Predicting Multi-Agent Motion with MotionLM
Table of Contents
- Introduction
- Auto Regressive Trajectory Prediction
- Discrete Sequence Modeling in Continuous Domains
- Motion LM: Combining Trajectory Generation and Interaction Modeling
- Training Objective
- Ego Agent Reference Frames
- Enforcing Temporal Causality
- Rollout Aggregation
- Interactive Motion Prediction
- Conclusion
Introduction
In this article, we will explore the use of modern sequence models in predicting the next token in a sequence, focusing on their application in the Context of driving, specifically in predicting the behavior of road users. We will discuss the importance of being able to predict the likely actions and responses of other road users in order to navigate the complex web of interactions on the road. We will also introduce a new approach called "Motion LM" that combines trajectory generation and interaction modeling for multi-agent motion forecasting. This approach has shown promising results in various downstream behavior prediction tasks.
Auto Regressive Trajectory Prediction
When it comes to predicting the future paths of multiple agents, such as cars or pedestrians, in a scene, auto-regressive trajectory prediction is a commonly used method. Instead of relying on complex models and techniques like latent variables or Beam search, our approach directly samples from a learned distribution of discrete motion token sequences. This allows us to generate multiple possible joint trajectories without the need for additional complexity. We also discuss the concept of discrete sequence modeling in continuous domains, where we break down the output space into discrete parts and predict categorical distributions at each step.
Motion LM: Combining Trajectory Generation and Interaction Modeling
Our proposed model, Motion LM, aims to capture the interactions between multiple agents in a way that can be applied to different tasks, such as marginal, joint, and conditional forecasting. The model consists of an encoder that processes the initial scene elements and a trajectory decoder that performs both cross-Attention to the scene encodings and self-attention along agent motion tokens. By using a temporally causal decoding process over discrete motion tokens, Motion LM can generate joint trajectories step by step, allowing interacting agents to sample tokens simultaneously and attend to one another.
Training Objective
To train the Motion LM model, we use a maximum likelihood objective, similar to how modern language models are trained. In the training process, each agent is exposed to the correct action sequence of all other agents up to the Current time step. This allows for Parallel processing, thanks to modern attention-Based architectures. However, it's important to note the theoretical limitations of the model, such as the potential for compounding errors and self-delusions due to unobserved factors.
Ego Agent Reference Frames
In our model, we represent each agent in the scene from their own perspective, treating them as the central or ego agent. This allows us to focus on the features of the scene that are Relevant to each agent. By grouping these ego agents together, we can process them simultaneously during training and inference, which speeds up the process.
Enforcing Temporal Causality
To ensure that our model follows temporal causality, we use a mask during training that only allows the model to update its representations based on past actions. This mask creates a pattern that allows each agent to be aware of the other's past actions up to the current step, while disregarding future actions. This helps in making more accurate predictions and reducing the likelihood of inconsistencies.
Rollout Aggregation
Rollout aggregation is a technique used to represent the possible future actions of agents in the form of a few key modes. Each mode is assigned a probability and represents a possible outcome. We employ non-maximum suppression (NMS) aggregation and model ensembling to uncover the underlying modes and estimate their probabilities. By generating enough samples from the model, we can accurately represent the multimodal future distribution.
Interactive Motion Prediction
Our model has achieved impressive results in the interactive prediction challenge, outperforming previous approaches. Instead of scoring pairs of pre-built marginal trajectories, our model generates joint rollouts directly. Interactive attention plays a crucial role in allowing agents to respond to each other more effectively, resulting in more accurate predictions. We also studied the impact of the frequency of interactive attention during joint rollouts and found that higher frequencies improved the performance and reduced the likelihood of implausible overlaps or collisions between different agent predictions.
Conclusion
In conclusion, our proposed approach, Motion LM, shows great promise in the field of multi-agent motion forecasting. By combining trajectory generation and interaction modeling in a single temporally causal decoding process, our model can make more accurate predictions of the behavior of road users. Its ability to generate joint rollouts directly, use interactive attention, and employ rollout aggregation techniques allows for better representation of multimodal future distributions. Through extensive testing and evaluation, we have demonstrated the effectiveness of our model in improving performance on various motion prediction challenges.