As explained last week in a blog post, Boston Dynamics (BD) folks observed with considerable interest the advent of foundation models (FMs) and their use powering chatbots like ChatGPT. The firm therefore became interested in developing a demo of Spot using FMs to make decisions in real time.
"Large Language Models (LLMs) like ChatGPT are basically very big, very capable autocomplete algorithms; they take in a stream of text and predict the next bit of text," the post states. "We were inspired by the apparent ability of LLMs to roleplay, replicate culture and nuance, form plans, and maintain coherence over time, as well as by recently released Visual Question Answering (VQA) models that can caption images and answer simple questions about them."
A robot tour guide was chosen as good test case. "The robot could walk around, look at objects in the environment, use a VQA or captioning model to describe them, and then elaborate on those descriptions using an LLM," the droid-maker's post states. "Additionally, the LLM could answer questions from the tour audience, and plan what actions the robot should take next. In this way, the LLM can be thought of as an improv actor – we provide a broad strokes script and the LLM fills in the blanks on the fly."