Home AI News Unlocking the Power of Mistral MoE: Outperforming ChatGPT

Unlocking the Power of Mistral MoE: Outperforming ChatGPT

Introduction
The Release of Mixture of Experts by Mistral AI
The Open Access Model
Comparison with GPT-3.5
Performance and Benchmarks
Architecture of the Sparse Mixture of Expert Model
Training and Data Collection
Inference Speed and Efficiency
Fine-Tuning and Alignment
Using the Models: Local and Cloud Options
Pricing Comparison with GPT-3.5 Turbo
Controversy Surrounding the Terms of Use
Additional Models and Services by Mistral AI
Embedding Endpoint: Mistral Embed
API Specifications and Compatibility
Conclusion

🚀 The Release of Mixture of Experts by Mistral AI

Mistral AI has made significant announcements with the official release of their mixture of experts (MoE) model, called Mixlol. This Open Access model has shown superior performance compared to Chat GPT on various benchmarks. In fact, Mistral AI has released not just one, but three different models. Alongside the release of Mixlol, they have also introduced the embedding endpoint, Mistral Embed, which can be integrated into your own applications. Furthermore, Mistral AI has launched their own platform, providing access to their models through an API.

👀 The Open Access Model

Mixlol, also known as Sparse Mixture of Expert (SoE), stands out as the strongest open weight model with a permissive license. It surpasses LAMA 270B on most benchmarks and achieves faster inference, being six times quicker. With a context window of 32,000 tokens, Mixlol offers a larger capacity compared to Mistral's original 7B model. Supporting multiple languages such as English, French, Italian, German, and Spanish, it excels in code generation and can be fine-tuned for instruction following tasks. Mixlol achieves remarkable performance, reaching an impressive score of 8.3 on the EMPty benchmark.

🎯 Pros

Superior performance compared to Chat GPT
Permissive license
Context window of 32,000 tokens
Multilingual support
Impressive code generation abilities
Fine-tuning options available

🚫 Cons

No support for Asian languages

🔍 Comparison with GPT-3.5

In terms of performance, Mixlol outperforms GPT-3.5 on five out of the seven standard benchmarks, marking a significant achievement for an open weight or Open Access model. Although it slightly lags behind GPT-3.5 on two benchmarks, the difference is negligible. Notably, Mixlol's performance on the VOG Grande and EMPty benchmarks is comparable to GPT-3.5, with only a slight difference. Overall, Mixlol presents a compelling alternative to GPT-3.5, with competitive performance and cost-performance trade-offs.

🏗️ Architecture of the Sparse Mixture of Expert Model

The Mixlol model follows a sparse mixture of expert network architecture. As a decoder-only model, each feed-forward block selects parameters from eight distinct groups, with a router network determining which two groups to choose for each token. By utilizing a fraction of the available network parameters, Mixlol achieves better performance while controlling costs and improving inference speed. With a total of approximately 47 billion parameters, each token uses around 13 billion parameters for predictions.

⚙️ Training and Data Collection

Mixlol is pre-trained on data extracted from the open web. While the training data collection timeline remains undisclosed, Mistral AI simultaneously trains both the experts and the router network. The sparse nature of the model allows for increased inference speed. In comparative plots, Mixlol outperforms other models, including the LAMA 2 family, across various tasks such as comprehension, mathematics, and code generation. The base model's performance demonstrates the model's efficiency relative to the inference budget.

🚀 Fine-Tuning and Alignment

To fine-tune the models and convert them into instruction following versions, Mistral AI employed Supervised fine-tuning. Alignment was achieved using direct preference optimization (DPO), an alternative to reinforcement learning through human feedback. DPO leverages another model, such as GPT-4, to indicate preferred responses. This method streamlines the alignment process. The instruct version, although powerful, requires careful usage, as it can be prompted to ban certain outputs and might require additional preference tuning for moderation purposes.

⚙️ Using the Models: Local and Cloud Options

Mistral AI offers multiple options for utilizing their models. For local usage, running Mixlol would require approximately 35 GB of VRAM. Additionally, integrating the model into cloud-based systems presents a more practical approach due to the substantial VRAM requirement. Developers can access Mixlol through Mistral AI's platform, and they can also explore fine-tuned versions available on platforms like Hugging Face. Notably, Mistral AI's own platform offers a cost-effective endpoint for accessing their models, including Mr. Tiny, Mr. Small, and Mr. Medium.

💵 Pricing Comparison with GPT-3.5 Turbo

When comparing pricing, Mistral AI's models, especially Mr. Tiny and Mr. Small, offer a more affordable alternative to GPT-3.5. However, for those primarily interested in the Mr. Medium model's unmatched performance, it is important to note that it is not available for direct download. Mistral AI solely provides access to this model through their API endpoint. Potential users should be aware of the terms of use, which prohibit the use of outputs to reverse engineer the services or develop competing services.

❗ Controversy Surrounding the Terms of Use

Mistral AI's terms of use have sparked controversy due to restrictions on using outputs or derived versions for reverse engineering or competing purposes. While further legal clarification is required, this restriction potentially limits the ability to retrain models that compete directly with Mistral AI. Developers are advised to review and fully understand the terms of use before undertaking any projects that may conflict with the stated conditions.

🌐 Additional Models and Services by Mistral AI

In addition to Mixlol and the embedding endpoint, Mistral AI offers other models and services. They have introduced Mr. Tiny and Mr. Small, which are cost-effective alternatives with specific capabilities tailored to different requirements. Additionally, Mistral AI's most powerful model, Mistral Medium, showcasing remarkable benchmark scores, defends its distinction in quality among other available models. However, the Mistral Medium model, currently serving as a prototype, is not yet fully released, and it remains uncertain if it also employs a mixture of experts architecture.

📊 Embedding Endpoint: Mistral Embed

Mistral Embed, an embedding endpoint by Mistral AI, provides a vector size of 1024. Although its score on the MASSIVE text embedding benchmark is not the highest, having an embedding model available is significant for building robust pipelines. Mistral Embed is accessible solely through the API endpoint, ensuring seamless integration while offering both embedding models and language models from Mistral AI.

📡 API Specifications and Compatibility

Mistral AI's API follows the specifications of the widely used chat interface proposed by their competitor, OpenAI. By adopting this format, Mistral AI ensures compatibility and simplifies integration for developers using both OpenAI and Mistral AI's services. This standardization facilitates the seamless transition between different vendors.

📝 Conclusion

With the release of the promising Mixlol model and additional offerings, Mistral AI has expanded the landscape of open access models. Mixlol demonstrates outstanding performance, surpassing GPT-3.5 on standard benchmarks while offering faster inference and a larger context window. By utilizing a sparse mixture of expert network architecture, Mistral AI strikes a remarkable balance between performance, cost, and speed. As the open-source community continues to explore and fine-tune these models, the future holds exciting possibilities in the field of Large Language Models.

Discover the Power of Monday AI at the Monday.com User Group

Unlocking Success with ISO 27001 at Monolith AI