這個多模式助理可以計劃、執行、檢查和學習
Table of Contents
- Introduction
- Related Work
- AssistGPT
- Planner
- Executor
- Inspector
- Learner
- Experiments
- Experimental Setup
- A-OKVQA Benchmark
- NExT-QA Benchmark
- Quantitative Results
- Qualitative Results
Introduction
Large language models (LLMs) have made significant progress in recent times as AI assistants. However, they have limitations when it comes to understanding visual environments and complex tasks. To overcome these limitations, researchers have proposed integrating the efforts of multiple domain experts, such as pretrained models and APIs. This approach involves converting visual input into text and breaking down user queries into smaller tasks. This article introduces a multimodal AI assistant system called AssistGPT, which combines language and code reasoning to handle complex visual tasks.
Related Work
Before the advent of LLMs, models were developed to handle multiple types of inputs, including visual elements, actions, and text. However, these models had limited generalizability. Two strategies have emerged to Create more general multimodal systems: pre-training LLMs with visual inputs and combining multiple models or APIs. Compositional reasoning methods have also been used to decompose questions into subtasks. Early modular models employed end-to-end reinforcement learning, but newer models use self-Supervised approaches. The proposed AssistGPT model addresses the limitations of earlier methods by intertwining language and code reasoning.
AssistGPT
Planner
The Planner module in AssistGPT uses the GPT-4 language learning model to direct global reasoning and planning. It takes in three types of inputs: the Instruction Prompt, the Input Query, and the Summary of Visual Inputs. Based on these inputs, the Planner generates a Thought and an Action. The Thought guides the planning process, while the Action dictates which external tool to use and what arguments to input.
Executor
The Executor module in AssistGPT plays a crucial role in the system. It takes the code generated by the Planner and performs three key steps: validation check, module execution, and post-processing. The validation check ensures the code is executable, while the module execution standardized various modules or APIs into a unified interface. The post-processing converts the results into a language format.
Inspector
The Inspector module in AssistGPT manages the visual inputs by recording metadata of each visual element, including Type, source, and content description. It assists the Planner in deciding which source should be directed to which module. The Inspector also monitors inputs and outputs and appends metadata to the reasoning history of the Planner.
Learner
The Learner module in AssistGPT self-assesses and verifies the reasonableness of the system's output. It operates in two modes: self-assessment and ground-truth comparison. The Learner keeps trying until it receives positive feedback or reaches the maximum number of attempts. Successful predictions are stored as in-Context examples for future reference.
Experiments
In the experiments conducted, AssistGPT was evaluated on two benchmarks: A-OKVQA and NExT-QA. A-OKVQA tests the system's ability to answer questions about visual data using common Sense and general knowledge. NExT-QA tests the system's reasoning abilities in the context of video question answering. The performance of AssistGPT was compared with top-tier methods in the benchmarks.
Quantitative Results
AssistGPT outperformed other existing techniques in the context-based learning setting, particularly in multiple-choice questions. However, for direct response questions, the model's performance was on par with previous top-performing models. The model tended to generate comprehensive phrases instead of single-word answers. Compared to the state-of-the-art method PromptCap, AssistGPT showed higher versatility. It performed well in addressing causal and descriptive questions but struggled with temporal questions due to a lack of open-world temporal grounding models.
Qualitative Results
AssistGPT demonstrated its ability to decompose complex questions into manageable sub-tasks and self-correct when necessary. The interweaving of code and language reasoning methods allowed the model to effectively utilize necessary content as input. Examples were provided to showcase the model's performance in real-world scenarios.
Highlights
- AssistGPT is a multimodal AI assistant system that combines language and code reasoning.
- The system consists of four modules: Planner, Executor, Inspector, and Learner.
- The Planner directs global reasoning and planning based on inputs from the Instruction Prompt, Input Query, and Summary of Visual Inputs.
- The Executor performs validation check, module execution, and post-processing of the generated code.
- The Inspector manages the visual inputs and appends metadata to the reasoning history.
- The Learner self-assesses and verifies the reasonableness of the system's output and incorporates successful predictions as in-context examples.
FAQs
Q: How does AssistGPT handle complex visual tasks?
A: AssistGPT combines language and code reasoning methods to effectively handle complex visual tasks. It breaks down questions into sub-tasks and tackles them one at a time until the final answer is reached.
Q: What benchmarks were used to evaluate AssistGPT?
A: AssistGPT was evaluated on two benchmarks: A-OKVQA and NExT-QA. A-OKVQA tests the system's ability to answer questions using common sense and general knowledge, while NExT-QA focuses on video question answering.
Q: How does the Learner module improve the performance of AssistGPT?
A: The Learner module in AssistGPT self-assesses the system's output and collects successful predictions as in-context examples. This allows the model to continuously improve its planning abilities.
Q: How does AssistGPT compare to other existing techniques?
A: AssistGPT outperformed other existing techniques in the context-based learning setting, particularly in multiple-choice questions. However, its performance in direct response questions was on par with previous top-performing models.
Q: Can AssistGPT handle real-world scenarios?
A: Yes, AssistGPT has demonstrated its ability to handle real-world scenarios by effectively decomposing complex questions and self-correcting when necessary. Examples of its performance in real-world scenarios are provided in the article.
Conclusion
AssistGPT is a versatile multimodal AI assistant that combines the benefits of flexible reasoning and robust tool invocation. It integrates multiple models and external tools to handle complex visual tasks. The system's performance was evaluated on benchmarks and showed promising results. AssistGPT's ability to handle complex problems and self-optimize sets it apart from existing techniques.