Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning --that provide utility for robot navigation. However, as humans communicate with robots in the real world, ambiguity and uncertainty may be embedded inside spoken instructions. While LLMs are proficient at processing text in human conversations, they often encounter difficulties with the nuances of verbal instructions and, thus, remain prone to hallucinate trust in human command. In this work, we present TrustNavGPT, an LLM-based audio-guided navigation agent that uses affective cues in spoken communication—elements such as tone and inflection that convey meaning beyond words—allowing it to assess the trustworthiness of human commands and make effective, safe decisions.
The current navigation methods using Large Language Models (LLMs) struggle with making accurate decisions when faced with ambiguous audio instructions. Our strategy involves affective cues from spoken communication into LLMs, enabling them to evaluate the reliability of human instructions from the semantic and vocal uncertainty, thus allowing for safe and successful navigation.
Human audio goes through an audio-processing module that transcribes it, while a vocal cue model identifies three essential affective cues. We then prompt a language model to generate five possible next-step actions, selecting the choice based on the next token logit probability. Notably, semantic transcription alone leads to the red choice, but incorporating the vocal cue results in the green choice being selected. Finally, a tool library translates the chosen language instruction into agent actions for navigation.
coming soon