Streamline ML Development: Train and Deploy Models in Cloud and On-Prem
Table of Contents
- Introduction
- About Karthik Ramasamy
- User Research and Feedback
- Introduction to Notebook Based Machine Learning Development
- Kubeflow: ML Platform Based on Kubernetes
- The Rise of Kubernetes and its Benefits
- Introducing Kubeflow and Fairing
- Fairing: Making Kubeflow Work for Data Scientists
- Integrating Kubeflow with Jupyter Notebooks
- Different Ways to Express Training Logic
- Executing Training Jobs
- Using Python Functions for Training Jobs
- Executing Complete Notebooks
- Packaging Dependencies with Docker
- Specifying Resources for Training Jobs
- Training XGBoost Models
- Distributed Training for XGBoost
- TensorFlow Distributed Training
- Prediction Endpoints
- Model Versioning and Managing Configurations
- Optimizing Cost with Auto-Scaling
- Getting Started with Kubeflow
Introduction
In this article, we will explore the topic of notebook-based machine learning development using the Kubeflow platform. We will delve into the capabilities of Kubeflow for managing machine learning workloads on a Kubernetes cluster, with a specific focus on the fairing project. Fairing aims to simplify the use of Kubeflow for data scientists, allowing them to seamlessly integrate their machine learning models with Jupyter notebooks and Python programming.
But first, let's learn more about Karthik Ramasamy, the ML engineer leading the data science experience for Kubeflow at Google.
About Karthik Ramasamy
Karthik Ramasamy is a ML engineer and the lead for the data science experience on the Kubeflow project at Google. Before joining Google, Karthik worked in the ML platform team at Uber's self-driving division, where he led a team of data scientists in Uber's risk team. With a passion for making data scientists more productive, Karthik's expertise lies in the intersection of machine learning, Kubernetes, and cloud-native technologies.
User Research and Feedback
At the start of the session, the speaker emphasized the collaborative nature of the meet-up and encouraged attendees to provide feedback and suggestions for future topics. This commitment to user research and feedback ensures that Kubeflow and Fairing continue to meet the needs of data scientists and ML engineers in the community.
Introduction to Notebook Based Machine Learning Development
Notebook-based machine learning development has become increasingly popular among data scientists. Jupyter notebooks, with their support for multiple languages and easy-to-use interface, have become a ubiquitous IDE for data scientists. With the rise of Kubeflow and its integration with Jupyter, data scientists can now seamlessly build and deploy machine learning models using the familiar notebook environment.
Kubeflow: ML Platform Based on Kubernetes
Kubeflow is an ML platform based on Kubernetes, a cloud-native technology that allows for scalable and portable deployment of applications. Like Hadoop or YARN revolutionized the way we deploy production applications, Kubernetes is quickly becoming the de-facto standard for deploying applications at Scale. Kubeflow aims to make machine learning on Kubernetes simpler and more accessible, with Fairing focusing specifically on data scientists' needs.
Pros:
- Scalable and portable deployment of ML models.
- Integration with Kubernetes, which offers powerful orchestration capabilities.
- Simplified ML development and deployment for data scientists.
Cons:
- Steep learning curve for data scientists unfamiliar with Kubernetes.
- Limited tooling for machine learning on Kubernetes prior to the development of Kubeflow.
Introducing Kubeflow and Fairing
Kubeflow provides a comprehensive suite of tools for machine learning on Kubernetes. Fairing is one of the key components of Kubeflow, designed to make machine learning on Kubernetes much simpler for data scientists. Fairing focuses on bridging the gap between the Kubernetes infrastructure and the notebook-based ML development environment, making it easier for data scientists to build and deploy models.
Pros:
- Simplifies the deployment of machine learning models on Kubernetes.
- Provides a seamless integration of Jupyter notebooks with Kubeflow.
- Enables data scientists to focus on their model development instead of managing infrastructure.
Cons:
- Limited documentation and learning resources for data scientists new to Kubeflow and Fairing.
- Requires some understanding of Kubernetes to leverage its full capabilities.
Integrating Kubeflow with Jupyter Notebooks
One of the key features of Fairing is its integration with Jupyter notebooks. Data scientists can specify their training logic using various methods, including Python files, Python functions, or complete notebooks. Fairing provides an easy way to execute these training jobs, abstracting away the complexities of packaging the code into Docker containers and running them on a remote Kubeflow cluster.
Pros:
- Seamless integration with Jupyter notebooks, a popular IDE for data scientists.
- Provides flexibility in expressing training logic using Python files, functions, or notebooks.
- Abstracts away the complexities of packaging code into Docker containers and deploying them.
Cons:
- Limited support for other notebook languages apart from Python.
- Requires some understanding of Docker and containerization concepts.
Different Ways to Express Training Logic
Fairing supports multiple ways to express training logic. Data scientists can choose between specifying their training logic as a Python file, a Python function, or a complete notebook. This flexibility allows data scientists to work with their preferred coding styles and easily integrate their existing code into Kubeflow for distributed learning.
Pros:
- Accommodates different coding styles and preferences.
- Enables seamless integration of existing code with Kubeflow.
- Allows for easy reuse of code and promotes code modularity.
Cons:
- Requires data scientists to adapt to the specific syntax and structure depending on the chosen method.
Executing Training Jobs
Fairing provides a straightforward API for executing training jobs. Data scientists can submit their training jobs using the TrainJob API, which takes in either a Python function or a Python class specifying the training logic. Fairing then packages the code into a Docker image and executes it on a remote Kubeflow cluster.
Pros:
- Simple API for executing training jobs.
- Abstracts away the complexities of Docker image creation and cluster execution.
- Facilitates seamless scaling and management of training jobs.
Cons:
- Requires familiarity with the TrainJob API syntax and parameters.
- Limited visibility into the underlying infrastructure and execution details.
Using Python Functions for Training Jobs
Data scientists can also use Python functions to express their training logic and execute training jobs. Fairing provides an API for submitting Python functions as training jobs, allowing data scientists to easily launch their training logic on the remote Kubeflow cluster.
Pros:
- Familiar syntax for data scientists using Python.
- Enables easy integration of existing Python code into Kubeflow.
- Provides flexibility in defining complex training logic.
Cons:
- Requires data scientists to adhere to the specific signature and structure of the Python function.
Executing Complete Notebooks
For data scientists who prefer to work entirely within Jupyter notebooks, Fairing supports the execution of entire notebooks on the remote Kubeflow cluster. This feature allows data scientists to capture and execute their complete model development process, including data preprocessing, model training, and model analysis, within a notebook environment.
Pros:
- Seamless integration of Jupyter notebooks with Kubeflow.
- Enables end-to-end model development and analysis within a single notebook.
- Simplifies the deployment process for data scientists working primarily within notebooks.
Cons:
- Limited support for integrating models developed in other notebook languages apart from Python.
- Requires careful management of resources within the notebook to prevent resource limitations.
Packaging Dependencies with Docker
One of the challenges in machine learning development is managing dependencies. Fairing solves this problem by allowing data scientists to specify their Python dependencies in a requirements.txt file or by providing a custom Docker image with the necessary dependencies pre-installed. Fairing then automatically builds the Docker image and uses it for executing the training job.
Pros:
- Simplifies dependency management for machine learning projects.
- Provides flexibility in specifying dependencies either through requirements.txt or custom Docker images.
- Streamlines the deployment process by packaging the code and dependencies into a Docker image.
Cons:
- Requires data scientists to manage and define their dependencies accurately.
- Limited support for managing system dependencies.
Specifying Resources for Training Jobs
Fairing allows data scientists to specify the resources needed for training jobs, including CPU and GPU resources. Data scientists can easily adjust the resource requirements based on the complexity and scale of their training job. Additionally, Fairing integrates with the auto-scaling capabilities of Kubernetes, ensuring efficient resource allocation and cost optimization.
Pros:
- Enables fine-grained control of resource allocation for training jobs.
- Facilitates efficient scaling and resource management based on job requirements.
- Optimizes cost by dynamically adjusting the resource allocation.
Cons:
- Requires data scientists to have a good understanding of their training job's resource requirements.
- Limited visibility into the Kubernetes auto-scaling process.
Training XGBoost Models
Fairing provides support for training XGBoost models, a popular machine learning framework for tabular data. Data scientists can leverage Fairing's APIs to train XGBoost models on remote Kubeflow clusters, abstracting away the complexities of distributed training and enabling them to focus on their model development.
Pros:
- Simplifies the training of XGBoost models on remote clusters.
- Provides an easy and consistent API for training XGBoost models within a notebook environment.
- Supports Parallel training and distributed learning for improved model performance.
Cons:
- Limited support for other machine learning frameworks apart from XGBoost.
- Requires data scientists to familiarize themselves with the syntax and specific requirements of XGBoost training.
Distributed Training for XGBoost
Fairing extends its support for XGBoost by providing the ability to perform distributed training. Data scientists can leverage Fairing to distribute their XGBoost training across multiple nodes, allowing for parallelized and faster model training. Fairing handles the complexities of setting up the distributed training environment, providing a seamless experience for data scientists.
Pros:
- Enables parallelized and distributed training for improved performance.
- Abstracts away the complexities of setting up distributed training for XGBoost.
- Facilitates seamless scaling and resource management for data scientists.
Cons:
- Requires data scientists to have a good understanding of distributed training concepts and XGBoost configurations.
- Limited to distributed training using XGBoost and requires a specific environment setup.
TensorFlow Distributed Training
In addition to XGBoost, Fairing also supports distributed training for TensorFlow, a popular deep learning framework. Data scientists can leverage Fairing to distribute their TensorFlow training jobs across multiple nodes, enabling faster training and improved model performance. Fairing abstracts away the complexities of TensorFlow's distributed training setup, making it easier for data scientists to take advantage of distributed learning capabilities.
Pros:
- Facilitates distributed training for TensorFlow models.
- Simplifies the setup and execution of TensorFlow distributed training jobs.
- Provides seamless integration with Jupyter notebooks, enabling end-to-end deep learning workflows.
Cons:
- Requires data scientists to have a good understanding of distributed training concepts and TensorFlow's APIs.
- Limited to distributed training using TensorFlow and may not support other deep learning frameworks.
Prediction Endpoints
Fairing goes beyond training jobs and provides support for deploying prediction endpoints. Data scientists can use Fairing to Package their trained models into Docker containers and deploy them as prediction endpoints on remote Kubeflow clusters. This allows data scientists to expose their models as services, making it easier to serve predictions and incorporate the models into production workflows.
Pros:
- Streamlines the deployment of trained models as prediction endpoints.
- Abstracts away the complexities of packaging models into Docker containers and exposing them as services.
- Enables data scientists to easily integrate their models into production workflows.
Cons:
- Requires data scientists to manage the infrastructure and expose the prediction endpoints themselves.
- Limited to deploying prediction endpoints using Fairing's APIs and may not support other deployment strategies.
Model Versioning and Managing Configurations
When it comes to managing models and configurations, Fairing provides some level of support through the integration with Kubeflow Metadata. Data scientists can capture and track metadata about their training jobs, including versioning information, model performance metrics, and configurations. However, currently, Fairing only provides a basic solution for model versioning, and there is ongoing work to improve this aspect of the platform.
Pros:
- Allows data scientists to capture and track metadata about their training jobs and models.
- Provides basic support for model versioning and configuration management through Kubeflow Metadata.
- Facilitates collaboration and knowledge sharing among data scientists working on similar projects.
Cons:
- Limited functionality and ease of use for model versioning and configuration management.
- Requires data scientists to manually track and manage metadata through Kubeflow Metadata.
Optimizing Cost with Auto-Scaling
One of the key features of Kubernetes that Fairing leverages is auto-scaling. By utilizing Kubernetes' auto-scaling capabilities, Fairing optimizes cost by starting and stopping resources as needed. This allows data scientists to allocate resources based on their specific training job requirements and scale up or down dynamically, reducing unnecessary costs.
Pros:
- Balances resource allocation based on training job requirements and cost optimization.
- Provides flexibility in resource management through Kubernetes' auto-scaling capabilities.
- Enables data scientists to use resources effectively and avoid unnecessary costs.
Cons:
- Requires data scientists to have a good understanding of their training job's resource requirements and Kubernetes' auto-scaling settings.
- Limited visibility and control over the auto-scaling process.
Getting Started with Kubeflow
To get started with Kubeflow, you can visit the Kubeflow website, which provides extensive documentation and resources. The website offers step-by-step guides on how to deploy Kubeflow using various platforms, including GCP, AWS, on-premises, or Microsoft Azure. With the deployment complete, you can access the Kubeflow dashboard, which provides a user-friendly interface for managing your Kubeflow cluster, notebooks, and pipelines.
Conclusion
In this article, we explored the topic of notebook-based machine learning development using Kubeflow and Fairing. We discussed the various features and capabilities of Fairing, including its seamless integration with Jupyter notebooks, support for different ways to express training logic, packaging dependencies with Docker, and resource management. We also touched upon the challenges and limitations of the platform, such as the need for familiarity with Kubernetes and specific framework requirements. Overall, Fairing provides a powerful toolset for data scientists to develop, train, deploy, and manage their machine learning models in a notebook-based environment. By leveraging Kubeflow and Fairing, data scientists can streamline their workflows, optimize resource allocation, and focus on what they do best - building and deploying models.
Highlights
- Fairing is a key component of Kubeflow, providing a seamless integration of Jupyter notebooks with the Kubeflow platform.
- Data scientists can express their training logic using Python files, Python functions, or complete notebooks.
- Fairing simplifies the packaging and deployment of machine learning models with Docker containers.
- Resource allocation for training jobs can be optimized using the auto-scaling capabilities of Kubernetes.
- Kubeflow Metadata provides limited support for managing model versions and configurations.
- Fairing supports XGBoost and TensorFlow for distributed training, enabling data scientists to take advantage of parallelized and faster training.
- Prediction endpoints can be deployed using Fairing, allowing data scientists to serve their trained models as services.
Frequently Asked Questions
Q: Can I use Fairing for distributed training with frameworks other than XGBoost or TensorFlow?
A: Fairing is designed to be extensible, and support for other frameworks is actively being developed. The current version of Fairing focuses on XGBoost and TensorFlow due to their popularity and wide adoption in the machine learning community. However, the goal is to provide support for more frameworks in the future.
Q: Can Fairing track and monitor the performance of training jobs during execution?
A: Fairing itself does not provide real-time monitoring of training job performance. However, you can integrate Fairing with tools like TensorBoard to track and visualize various performance metrics during training.
Q: Can Fairing be used for hyperparameter tuning?
A: Fairing itself does not provide built-in support for hyperparameter tuning. However, you can integrate Fairing with tools like Katib or Raytune to perform hyperparameter optimization and automate the tuning process.
Q: Is it possible to debug and troubleshoot training jobs using Fairing?
A: Fairing provides basic logging and error reporting capabilities. However, for more in-depth debugging and troubleshooting, it is recommended to leverage the logging and debugging features of the underlying Kubernetes cluster.
Q: Can I use Fairing to deploy trained models on different environments, such as cloud services or on-premises?
A: Yes, Fairing supports deployment on various environments, including cloud services like GCP, AWS, and on-premises setups. Fairing abstracts away the deployment details and provides a unified API for deploying trained models, making it easier to switch between different environments.
Resources