Scaling ML Workloads on Amazon EC2 | Boost Performance and Productivity with Intel

Find AI Tools

No difficulty

No complicated process

Find ai tools

Home Hardware Scaling ML Workloads on Amazon EC2 | Boost Performance and Productivity with Intel

Scaling ML Workloads on Amazon EC2 | Boost Performance and Productivity with Intel

Introduction
The AWS ML Stack
- AI Services Layer
- SageMaker Layer
- Frameworks and Infrastructure Layer
Scaling ML Workloads using EC2 Services and Open Source Tools
Accelerating Proof of Concepts with Intel Partnership
Enabling PyTorch Adoption and Scale on AWS
Trends in Deep Learning Models and Training
- Increasing Model Size and Complexity
- Distributed Training and Orchestration
Strategies for Training Big Models
- Utilizing the Right Parallelism Paradigm
- Achieving Linear Scaling
- Experimenting with Multiple Accelerators
- Profiling and Managing Large Compute Fleets
ML Inference on CPU-based Instances
- The Volume and Cost of ML Inference
- Sparse Inference vs. Dense Inference
Optimization Strategies for ML Inference
- Matching ML Inference Requirements to Workload Types
- Utilizing Intel Habana Gaudi for Dense Inference
- Model Bin Packing for Cost Efficiency
Convergence of ML and HPC Workloads
- Physics-Enabled and Physics-Informed Workloads
- Collaborating with Intel for HPC+ML Solutions
Case Study: Optimizing Car Geometry in Formula One
Summary and Conclusion

🔍 Introduction

Good morning folks! I'm Sundar I'll, the lead of the ML Frameworks team at AWS. I'm thrilled to be here today to walk you through a wide range of topics in the field of machine learning. We have a lot to cover, so let's dive right in.

🚀 The AWS ML Stack

The AWS ML Stack is designed to cater to three classes of customers. At the topmost layer, we have the AI Services layer, which offers AI capabilities through APIs, allowing customers to embed AI without directly training models. Moving down, we have the SageMaker layer, a comprehensive end-to-end platform that covers the entire ML workflow, from data preparation to deployment and MLOps. Finally, we have the Frameworks and Infrastructure layer, where things get interesting and complex.

AI Services Layer

The AI Services layer is perfect for customers looking to leverage AI capabilities without the need to train models directly. With a variety of APIs, customers can easily integrate AI into their applications, enhancing user experiences and automating processes.

SageMaker Layer

The SageMaker layer is a fully managed offering that provides a complete platform for ML development. From data wrangling to model training and deployment, SageMaker streamlines the entire ML workflow. Customers benefit from the simplicity and efficiency of a unified platform.

Frameworks and Infrastructure Layer

At the Frameworks and Infrastructure layer, we encounter the true complexity of ML. This layer is where customers face the challenge of making choices from a wide range of tools, compute instances, orchestration options, open-source frameworks, libraries, and inference servers. It is in this diverse landscape that our partnership with Intel adds tremendous value, as it helps us guide customers through the decision-making process.

📚 Scaling ML Workloads using EC2 Services and Open Source Tools

One of our key areas of focus is developing architectures to scale ML workloads using core EC2 services and open-source tools. We understand that customers appreciate prescriptive solutions, particularly when it comes to selecting the right compute instances for their specific workloads. With the power of Intel's partnership, we have been able to optimize ML inference on CPUs, as well as leverage the capabilities of the dl1 instances. This collaboration has resulted in significant performance improvements and cost savings for our customers.

🚀 Accelerating Proof of Concepts with Intel Partnership

Proof of Concepts (PoCs) play a crucial role in helping customers progress from exploring the art of the possible to full-scale production. Our team focuses on accelerating PoCs by showcasing the potential of ML solutions and providing hands-on support. We have been working closely with Intel to achieve impressive results, such as training a one trillion parameter model across 512 accelerators. Additionally, our collaboration has enabled us to offer deep speed support on the SYNAPSE 1.5/1.6 release, further enhancing our customers' experience.

🌟 Enabling PyTorch Adoption and Scale on AWS

PyTorch has gained significant popularity in the ML community due to its simplicity and flexibility. To support our customers in adopting and scaling PyTorch on AWS, we have teamed up with Facebook (Meta) for joint go-to-market efforts. Through this partnership, we aim to provide seamless integration, comprehensive support, and efficient utilization of PyTorch on the AWS platform. By working closely with Intel, we ensure that our customers have the best ecosystem for PyTorch development.

📈 Trends in Deep Learning Models and Training

Over the past few years, we have witnessed an exponential growth in the size and complexity of deep learning models. State-of-the-art NLP models have evolved from 20 billion to over 1 trillion parameters. This rapid advancement presents new challenges and opportunities in training and deploying these models effectively.

Increasing Model Size and Complexity

The growing size and complexity of models have necessitated innovations in training methodologies. Training models with billions or trillions of parameters requires distributed training across hundreds of thousands of accelerators. Streamlining internal communication and orchestration methods are integral to achieving linear scaling and optimal performance.

Distributed Training and Orchestration

As model sizes increase, addressing bottlenecks in network communication becomes crucial. Customers have adopted solutions like EKS or traditional HPC scaling methods to efficiently orchestrate large fleets of accelerators, CPUs, and nodes. We collaborate closely with Intel to ensure our customers have access to the best orchestration options and achieve high training efficiencies.

💡 Strategies for Training Big Models

Training big models involves facing multiple challenges related to parallelism, scalability, and performance. By implementing the right strategies, customers can overcome these challenges and train models efficiently.

Utilizing the Right Parallelism Paradigm

Choosing the appropriate parallelism paradigm depends on the size of the model and resource limitations. With the increasing size of models, GPU and CPU memory constraints pose a challenge. Customers need to explore parallelism techniques and solutions that can overcome these limitations.

Achieving Linear Scaling

Efficiently scaling ML workloads is crucial to achieve high training efficiencies. Addressing network bottlenecks and utilizing software and hardware solutions can help achieve linear scaling. Our collaboration with Meta (Facebook) and Intel has resulted in significant improvements, such as 10x faster training times through gradient compression methods using PowerSGD.

Experimenting with Multiple Accelerators

To optimize both price and performance, it is advisable to experiment with multiple accelerators. Starting with one accelerator and gradually switching to another can help identify the most suitable accelerator for a given workload. We have demonstrated the effectiveness of this approach with Intel AI, training models on DL1 instances after checkpointing and switching from other accelerators.

Profiling and Managing Large Compute Fleets

When dealing with large fleets of compute resources, profiling becomes crucial. Identifying and replacing faulty nodes is essential for maintaining performance and avoiding downtime. Our collaboration with Intel enabled us to profile DL1 instances for a customer, resulting in improved fault identification and replacement processes.

💻 ML Inference on CPU-based Instances

Although ML inference is traditionally associated with GPU-based instances, we have observed that a significant portion of ML inference is executed on CPU-based instances. Many customers, especially those using traditional machine learning or small deep learning models, rely on Xeon-based CPUs or C5 DN instances to run their inference workloads. As model sizes grow and inference volume increases, it becomes essential to optimize the price-performance ratio for ML inference.

The Volume and Cost of ML Inference

ML inference involves processing a large number of API calls, which can lead to increased costs and resource utilization. Managing costs and fluctuations in capacity becomes paramount. Customers must have a clear understanding of price-performance trade-offs and Align ML inference requirements with the most appropriate workload types.

Sparse Inference vs. Dense Inference

Sparse inference, characterized by medium latency but low throughput, is best served by Xeon-based scalable C6i instances. Leveraging Intel's IPEX (Intel PyTorch Extension) further accelerates sparse inference, enabling the efficient processing of millions of API requests within strict latency requirements. On the other HAND, dense inference, requiring low latency and high throughput, benefits from Intel Habana Gaudi and DL1 instances. These powerful accelerators allow for the inference of large models at impressive speeds.

📊 Optimization Strategies for ML Inference

Achieving cost efficiency and optimal performance in ML inference requires careful optimization strategies. Customers must match ML inference requirements to the most suitable workload types, leverage specialized accelerators for dense inference tasks, and employ model bin packing techniques to minimize costs.

Matching ML Inference Requirements to Workload Types

Understanding the characteristics of ML inference workloads is crucial for selecting the appropriate compute instances. Sparse inference workloads with medium latency requirements and low throughput are best suited for Xeon-based scalable C6 instances. Dense inference workloads, demanding low latency and high throughput, benefit from Intel Habana Gaudi and DL1 instances.

Utilizing Intel Habana Gaudi for Dense Inference

Intel Habana Gaudi and DL1 instances offer a powerful solution for dense inference workloads. These instances provide high throughput and low-latency processing, enabling customers to infer large models efficiently. By leveraging the capabilities of Intel Habana Gaudi, customers can achieve remarkable performance gains in their dense inference workloads.

Model Bin Packing for Cost Efficiency

To optimize costs, customers can employ model bin packing techniques. By grouping together models with similar resource requirements, customers can reduce the overall cost profile. This approach efficiently utilizes compute instances, enabling customers to achieve cost savings without compromising performance.

🌐 Convergence of ML and HPC Workloads

The convergence of ML and HPC (High-Performance Computing) workloads presents exciting opportunities and challenges. ML is increasingly being used to accelerate simulations and optimize complex systems, while HPC techniques are embedded within ML architectures. We collaborate closely with Intel to develop innovative solutions that bridge the gap between ML and HPC.

Physics-Enabled and Physics-Informed Workloads

ML plays a crucial role in extracting insights and accelerating simulations in physics-enabled workloads. By utilizing ML techniques, we can optimize car geometries in Formula One to reduce drag and improve performance. Additionally, ML can tackle physics-informed problems, where partial differential equations are embedded within neural networks to produce solutions. These advancements create new possibilities for solving complex problems efficiently and accurately.

Collaborating with Intel for HPC+ML Solutions

The collaboration between AWS and Intel extends beyond ML inference to encompass HPC and ML workloads. By leveraging Intel CPU-based C6 instances, we enable customers to perform simulations and optimizations at scale. Our partnership with Intel's HPC team allows us to address the unique challenges of HPC+ML workloads and develop solutions that meet the needs of our customers.

🏎️ Case Study: Optimizing Car Geometry in Formula One

To illustrate the power of ML and HPC convergence, let's take a look at a case study involving the optimization of car geometry in Formula One. The goal is to maximize car performance by reducing drag on the following car. By utilizing ML techniques and embedding partial differential equations within deep learning models, we can propose changes to the lead car's geometry. This process allows us to optimize car designs and achieve better race performance. Our collaboration with Intel has been invaluable in developing efficient and accurate solutions for Formula One and other similar use cases.

📝 Summary and Conclusion

In conclusion, the world of ML is diverse and ever-evolving. Scaling ML workloads, optimizing ML inference, and bridging the gap between ML and HPC are critical challenges that we aim to address with the help of our partnership with Intel. By developing robust architectures, utilizing the right strategies, and leveraging the power of Intel's technology, we are able to provide our customers with scalable, cost-effective, and high-performance ML solutions.

With Intel's support, we are continuously innovating and pushing the boundaries of what is possible in the world of ML. By collaborating closely with our customers and partners, we strive to deliver Meaningful and impactful solutions that drive progress in the field of machine learning.

Thank you for joining me today, and I hope you found this presentation insightful.

Intel Arc vs NVIDIA: What You Need to Know

Explore NVIDIA's Open Source Contributions and New Portal