3 Reasons to Start Using Kubernetes in Machine Learning and MLOps

Mateusz Kwaśniak
5 min readJan 19, 2023

While cloud infrastructure or Kubernetes are already well known and used in most of software engineering teams and companies, this is not necessarily true in machine learning projects.

Very often when designing a machine learning system or platform, people (architects, engineers) debate whether using Kubernetes might be a good choice.

In this post I will introduce it to you and I will describe, based on my experience as well, how machine learning projects can leverage Kubernetes.

Photo by Growtika Developer Marketing Agency on Unsplash

What is Kubernetes?

For the sake of simplicity of this article I will not focus on explaining Kubernetes and I assume that you are reading this because you are already somewhat familiar with at least a general concept (should be more of less enough to grasp the ideas in further sections of this post).

If you are completely new to Kubernetes or even containerization topics, I recommend starting with the basics well described in The Illustrated Children’s Guide to Kubernetes.

Kubernetes for Machine Learning

If you are familiar with features of K8s such as scalability, deployments and with microservices in general, you may already have an idea how that could be used in machine learning context.

Let me introduce some ideas.

Figure 1. Kubernetes Batch Jobs (source: https://hevodata.com/learn/kubernetes-batch-job/)

Workloads

Title of this section may sound a bit abstract so let me explain right away. By “workloads” I generally mean any kind of one-off or recurring jobs that you may want to run:

  • Batch (data) processing jobs,
  • Machine learning training pipelines,
  • Single, computationally expensive tasks e.g. training script, wrapped in a Docker container etc.

Whatever it is, you can run it on Kubernetes.

All you need is to wrap it in an (Docker) image.

You can execute jobs as a simple container (read more: Pod), a recurring CRON job, a multi-step (multi-container) pipeline, whatever your use case is.

Available tools

Furthermore, you don’t really have to do it on a low level but you may use one of many machine learning (or MLOps) tools for that.

Airflow for instance has its KubernetesPodOperator that allows to execute tasks on top of a cluster. Kubeflow Pipelines on the other hand is a K8s-native framework that allows you to use cluster nodes for batch jobs and training pipelines.

Figure 2. Architecture behind KServe framework (source: kubeflow.org)

Model Deployment

Another, quite popular use case for using K8s in machine learning context is deployment of models or model services for inference. Depending on your project and client’s requirements you might have deployed models as e.g.:

  • AWS Lambda service exposed to user,
  • Flask or FastAPI inference service.

Another option is to deploy and expose the application through Kubernetes. In fact it does not differ much from implementing a Flask/FastAPI endpoint I mentioned earlier.

As I already said, you are able to deploy and run anything wrapped in a container, this can also be e.g. FastAPI endpoint.

Available tools

Once again let me list some of the most popular choices for model deployment/serving if you choose to do it on K8s cluster.

KServe, Seldon or Bento, to name a few, all allow you to deploy your machine learning models (or even ensembles and pipelines) as Services.

They all offer a different set of features, supported frameworks and extras (e.g. built-in drift detection) so you may want to check out the documentation and decide which one suits your needs.

More than that, while discussing Model Deployment, Kubernetes has also a great support for monitoring of such services. The immortal Prometheus and Grafana stack can be easily integrated with these tools to enable advanced monitoring of model and service metrics.

End-to-end ML Platforms

At last, I decided to put here is a set of platforms (complex applications) that support an end-to-end lifecycle of machine learning projects.

What does it mean?

These “platforms” are usually quite advanced web applications with a backend, UI, database and they provide a set of components to enable work such as:

  • Notebook Server,
  • Pipelines framework and UI,
  • Metadata store,
  • Experiment tracking,
  • Model deployment & monitoring.

It is usually a good choice if you need most (if not all) of the components from the list. Then it’s much more convenient to maintain within one platform rather than as a set of separate applications and frameworks.

Figure 3. Architecture of Kubeflow platform (source: kubeflow.org)

Naturally a drawback of this option is that these applications also require more maintenance than smaller frameworks. Whenever making such design decisions, always think of your team’s skills and capabilities.

Available tools

Probably the most popular choice out there is Kubeflow, an open source platform which supports notebook sessions, pipeline and job execution, model serving, storage and much more.

Apart from that you can also take a look at Flyte. It is a bit younger, maybe more modern than Kubeflow but it may lack some of ML-specific functionalities so be careful when reading a list of features it offers.

When not to introduce Kubernetes?

Naturally it is not a silver bullet and while it gives you many capabilities it has also its drawbacks or costs. Probably the most obvious and the most important ones are:

  • maintenance costs,
  • engineers and skills required to provision, deploy and maintain.

In other words, moving your work and deployments to Kubernetes may not be a good idea unless you have a required set of skills already in place in your team.

While most of the tools and platforms abstract all the low-level Kubernetes operations, you still need to provision and maintain a cluster. Deploying new applications, updating them and debugging will require a significant amount of knowledge about cluster management.

Conclusion

As you see, Kubernetes can do a great job in supporting your day to day work in machine learning projects.

From notebooks through pipelines and batch jobs, to model serving and monitoring , you will find a bunch of actively developed tools that will help you to move your workload to a cluster.

Bear in mind that with power of Kubernetes also comes a responsibility and the bigger the cluster (and the more pipelines or services you have deployed) the more maintenance skills may be required.

--

--

Mateusz Kwaśniak

Lead MLOps Engineer, ML Architect // I write about machine learning engineering, system design and platforms // linkedin.com/in/mtszkw