Thursday, April 3, 2025

NVIDIA reveals ReMEmBer, a new approach to let generative AI aid robots in deciding what to do and then perform the action.

Take heed to this text

Remembr allows the integration of language models, visual models, and retrieval-augmented technology, enabling robots to autonomously plan and execute motion. | Supply: NVIDIA

By combining the powerful linguistic understanding of foundation models with the visual capabilities of vision transformers, imaginative and prescriptive language models (VLMs) project text and images into a shared embedding space. The system processes raw, diverse information, imposes meaning on it, and presents the results in an organized manner.

Building upon their foundation in visual computing, NVIDIA’s AI models can be easily adapted to tackle diverse computer vision tasks through the introduction of novel prompts or efficient fine-tuning strategies.

Equipped with access to a vast library of knowledge sources and tools, these AI models will be empowered to seek additional information when unsure of an answer or take prompt action after verifying the response. Massive language models and generative AI can serve as facilitators, processing vast amounts of information to enable robots to accomplish complex tasks that might otherwise prove challenging to define.

Previously, we showcased the ability to deploy large language models (LLMs) and vision language models (VLMS) on NVIDIA Jetson Orin devices, unlocking a range of innovative applications such as real-time object detection without labels, video description capabilities, and text generation at the edge. 

What challenges do we face when trying to apply advanced AI techniques to notion and autonomy in robotics, and what innovative solutions can we implement to overcome them? Implementing these fashion trends across various spectrums and domains requires meticulous planning to overcome obstacles such as cultural sensitivity, logistical complexities, and diverse consumer expectations.

In this submission, we focus on ReMEmbR, a challenge that combines Large Language Models (LLMs), Visual Long-term Memory Models (VLMs), and Reinforced Active Grounding (RAG) to enable robots to plan and take actions based on what they observe throughout a long-horizon deployment, spanning hours or even days. 

Remembers’ memory-building section leverages Virtual Large Memory Spaces (VLMS) to craft a robust, long-term semantic remembrance. The ReMEmbR’s querying section leverages a mechanism to effectively utilize past memories. This open-source solution operates directly on the device.

ReMEMBERing that many obstacles arise when leveraging Large Language Models (LLMs) and Visual Language Models (VLMs) in robotic applications, 

  • deal with massive contexts.
  • purpose over a spatial reminiscence.
  • Develop an intelligent conversational interface that proactively seeks additional information from users until their inquiry is satisfactorily resolved. 

To further advance these findings, we successfully implemented RemEmbR on a real-world robot, effectively translating theory into practical application. We leveraged Nova Carter to accomplish this task, and for transparency, we’re sharing our code and step-by-step process. To access additional information, please refer to the subsequent resources.

RemEmber helps to improve long-term reminiscence, reasoning, and motion.

As anticipation grows for robots to effectively collaborate with their surroundings over extended periods, understanding becomes a crucial factor in achieving this goal. Robots are often deployed for extended periods, sometimes spanning hours or even days, which allows them to comprehend a wide range of novel objects, situations, and environments. 

To enable robots to effectively respond to complex, multi-step inquiries after being deployed for extended periods, we developed ReMEmbR – a novel approach that leverages the power of retrieval-augmented memory in embodied robotics. 

RemEmBert develops modular, large-scale cognitive architectures that empower robots to anticipate complex scenarios, make informed decisions, and execute precise actions by leveraging robust long-term memory recall and logical deduction capabilities. The ReMEMBER system comprises two distinct stages: initial memory construction and subsequent inquiry processes. 

In the memory-building segment, we leveraged VLMs to create a well-structured reminiscence by harnessing the power of vector databases. During the querying process, we developed a large language model-based (LLM) agent capable of iteratively retrieving diverse features and ultimately providing the desired response to the user’s inquiry.

Schematic of NVIDIA's full ReMEmbR system for connecting generative AI to robotics.

Determine 1. The total ReMEmbR system. Supply: NVIDIA

Constructing a wiser reminiscence

RemEmBRS’ memory-building section focuses on leveraging reminiscence to empower robotic intelligence. When deploying robots for extended periods, a environmentally responsible approach to storing collected data is essential. Films are easy to distribute, but difficult to articulate and understand. 

During video-based reminiscence construction, we extract short clips, annotate them using the captioning Video Language Model (VLV), and then store them in a scalable vector database like MilvusDB. Additionally, we store timestamps and coordinate data from the robot in our vector database. 

This setup allowed us to efficiently retrieve and interrogate every type of data from the robot’s memory. By leveraging VILA’s ability to capture video segments and integrating them seamlessly with a MilvusDB vector database, the system is capable of retaining valuable information that VILA can detect, encompassing dynamic scenes such as individuals walking around and specific small objects, all the way up to more basic categories. 

Using a vector database enables seamless integration of novel data types for ReMEmbR’s consideration.

ReMEmbR agent

A prolonged recollection stored in the database, a standard language model (LLM) would struggle to derive insights quickly across the extensive context. 

The LLM backend for the ReMEmbR agent is designed to be native on-device or utilize various LLM utility programming interfaces (APIs), enabling seamless integration and optimal performance. When consumers pose a query, large language models generate targeted requests to databases, retrieving relevant information in an iterative process. The large language model can query for textual content data, temporal data, or spatial data, depending on the user’s inquiry. This process continues until the inquiry is resolved.

By employing a diverse range of instruments in our Large Language Model (LLM) agent, we enable the robot to move beyond answering questions about methodologies, instead allowing it to navigate to specific locations and reason spatially and temporally with greater ease. Determining the two exhibits that demonstrate this reasoning could potentially appear as follows: Exhibit A: a detailed analysis of the project’s financial statements and Exhibit B: an illustration of the project’s budget process and allocation.

GIF shows the LLM agent being asked how to get upstairs. It first determines that it must query the database for stairs, for which it retrieves an outdoor staircase that is not sufficient. Then, it queries and returns an elevator, which may be sufficient. The LLM then queries the database for stairs that are indoors. It finds the elevator as a sufficient response and returns that to the user as an answer to their question.

Determine 2. Instance RemEmber question and reasoning movement? | Supply: NVIDIA

What makes you think ReMEmbR is a feasible solution for your robotics endeavour?

To illustrate the integration of ReMEmbR within a practical robotic framework, we developed a demonstration combining ReMEmbR with NVIDIA Isaac’s Robot Operating System (ROS) and Nova Carter technologies. Isaac, built on open-source foundations, comprises a suite of accelerated computing packages and AI frameworks, enabling NVIDIA-accelerated development for builders worldwide.

In this immersive demo, a robotic solution provides interactive guidance, navigating participants through a realistic office environment. To dispel the mystique surrounding our appliance’s construction, we have decided to outline the meticulous process we followed.

  • Constructing an occupancy grid map
  • Operating the reminiscence builder
  • Operating the ReMEmbR agent
  • Including speech recognition

Constructing an occupancy grid map

To establish a solid foundation for our story, we first crafted a detailed map of the setting. To build a robust vector database, ReMEmbR requires input from both monocular digital camera images and global location (pose) data. 

Picture shows the Nova Carter robot with an arrow pointing at the 3D Lidar + odometry being fed into a Nav2 2D SLAM pipeline, which is used to build a map.

Determine 3. Creating Occupancy Grid Maps with Nova Carter: A Collaborative Effort | Supply: NVIDIA

Relying on your setup or platform, obtaining global posture data may prove challenging. When using, simplicity reigns happily.

Powered by the Nova Orion reference structure, Nova Carter is a comprehensive robotics growth platform that expedites the development and deployment of cutting-edge autonomous cellular robots. Equipped with a 3D lidar, the system will produce accurate and consistently updated metric maps at the global level.

GIF shows a 2D occupancy grid being built online using Nova Carter. The map fills out over time as the robot moves throughout the environment.

Determine 4. What’s the status of the occupancy grid map construction? | Supply: NVIDIA

By leveraging the process, we rapidly generated an occupancy map through remote control of the robotic system. The map is subsequently utilised for localisation purposes during the construction of the ReMEmbR database, as well as for path planning and navigation in preparation for the eventual deployment of the robotic system.

Operating the reminiscence builder

Following creation of the setting’s map, our next move was to enrich the vector database that underlies ReMEmbR’s functionality. During this project, we remotely controlled a robot while developing its ability to learn and recognize its surroundings for global positioning purposes. To learn more about working with Nova Carter and accessing additional resources, please visit our website at [insert link]. 

The system diagram shows running the ReMEmBr demo memory builder. The occupancy grid map is used as input. The VILA node captions images from the camera. The captions and localization information are stored in a vector database.

Determine 5. Operating the ReMEmBr reminiscence builder. | Supply: NVIDIA

With localization functioning seamlessly in the background, we deployed two bespoke ROS nodes specifically designed for the memory-building component.

The primary ROS node executes a mannequin that generates captions for images captured by the robotic camera. Although this node operates within the system, its reliability remains unaffected by the community’s intermittency, allowing us to build a trustworthy database despite these limitations.

Operating this node on Jetson is significantly simplified through quantization and inference capabilities. This library, along with numerous other esteemed institutions, is showcased within a prominent online directory. Is there a recently launched ROS bundle available for seamlessly integrating NanoLLM models with a ROS tool?

The second ROS node receives captions produced by VILA, as well as the globally estimated pose provided by the AMCL node. The system creates semantic representations of captioned data by generating textual content embeddings, storing the pose, textual content, embeddings, and timestamps in a centralized vector database.

Operating the ReMEmbR agent

Diagram shows that when the user has a question, the agent node leverages the pose information from AMCL and generates queries for the vector database in a loop. When the LLM has an answer, and if it is a goal position for the robot, a message is sent on the goal pose topic, which navigates the robot using Nav2.

Determine 6. Operating the ReMEmbR Agent to Reply Consumer Queries and Navigate to Preferred Purposes | Supply: NVIDIA

Following successful population of the vector database, the ReMEmbR agent acquired the requisite knowledge to respond to customer inquiries and effect meaningful outcomes.

The third step was to compile the Java program. To render the robot’s recall stable, we disengaged the image description and knowledge consolidation modules, and instead activated the ReMEmbR control node.

The ReMEmbR agent is responsible for processing customer inquiries, querying the vector database, and determining the most appropriate action for the robot to take in response. On this occasion, the proposed motion serves as an answer to the customer’s inquiry, effectively equating to a vacation spot in purpose and intent.

After conducting a thorough examination of the system’s performance, we simulated real-world scenarios by manually entering a range of consumer inquiries.

  • Can you please take me to the nearest elevator?
  • “Can you take me to a place where I can grab a quick bite?”

The REMEMBER agent determines the most effective response posture and disseminates it to the /goal_pose matter. The trail planner creates a global trajectory for the robot to follow, ensuring its navigation towards the intended objective is precise and accurate.

Including speech recognition

Utility users likely won’t have direct access to terminals for querying purposes; therefore, they require an intuitive interface to effectively interact with the AI-driven system. By elevating the appliance’s capabilities, we successfully incorporated speech recognition technology, enabling the generation of queries that seamlessly interact with the agent. 

Integrating speech recognition capabilities on Jetson Orin platforms is remarkably straightforward. We successfully implemented the integration by developing a ROS (Robot Operating System) node that encapsulates the newly released challenge.

WhisperTRT leverages OpenAI’s Whisper model to deliver low-latency inference capabilities on both NVIDIA Jetson AGX Orin and NVIDIA Jetson Orin Nano platforms.

The WhisperTRT ROS node swiftly harnesses the microphone capabilities through PyAudio, successfully publishing recognized spoken words for analysis.

The diagram shows taking in user input, which is recognized with a WhisperTRT speech recognition node that publishes a speech topic that the ReMEmbR agent node listens to.

Determine 6. Implementing seamless speech-to-text integration with WhisperTRT technology to enable effortless consumer interaction. | Supply: NVIDIA

All collectively

We successfully integrated all components to produce a comprehensive demonstration of the robotic system.

Get began

We hope that this submission inspires you to explore and learn about robotics. Interested in learning more about the topics covered in this post? Explore the ReMEmBr code to gain a deeper understanding of the subject matter and start building your own generative AI robotics applications by checking out these resources:

Join our community for updates on additional assets and reference architectures to support your growth goals.

Join us to access exclusive data and connect with the robotics community through our social media channels on Twitter, LinkedIn, and YouTube. Observe together with  and webinars (and ).

In regards to the authors

Abrar Anwar is a Ph.D. A scholar on the campus of the University of Southern California and an intern at NVIDIA. His research focuses on the intersection of linguistics and robotics, concentrating on navigation and human-robot interaction methodologies. 

Anwar acquired his B.Sc. with a degree in computer science from The University of Texas at Austin.


John Welsh is an expert developer in autonomous machine technologies at NVIDIA, where he creates advanced functionalities utilizing NVIDIA’s Jetson platform. Regardless of whether it’s Legos, robots, or even a guitar track, he always delights in crafting innovative solutions. 

With a Bachelor of Science and Master of Science in electrical engineering from the University of Maryland, Welsh specialized in both robotics and computer vision, laying the foundation for his future endeavors.


As a principal engineer and senior engineering supervisor at NVIDIA, Yan Chang plays a vital role in the company’s innovative efforts. She is currently leading the robotics mobility workforce.

Prior to joining, Chang led the conduct validation model team at Amazon’s subsidiary developing self-driving cars. She acquired her Ph.D. from the College of Michigan.

With permission, this text was syndicated from NVIDIA. 

 The 2024 election cycle, commencing on October The NVIDIA-designed facilities at 16 and 17 in Santa Clara, California, will offer additional learning opportunities. Amit Goel, Head of Robotics and Edge AI Ecosystem at NVIDIA, is set to participate in a discussion titled “Driving the Way Forward for Robotics Innovation”. 

On day one of the event, Rohan Kumar, senior strategic alliances and ecosystem supervisor for robotics at NVIDIA, will participate in a panel discussion titled “Generative AI’s Influence on Robotics.”


SITE AD for the 2024 RoboBusiness registration now open.
.


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles