Deploying production Machine Learning pipelines to Kubernetes with Argo

By Kamil Ciemniewski
June 28, 2019

Rube Goldberg machine
Image by Wikimedia Commons

In some sense, most machine learning projects look exactly the same. There are 4 stages to be concerned with no matter what the project is:

Sourcing the data
Transforming it
Building the model
Deploying it

It’s been said that #1 and #2 take most of ML engineers’ time. This is to emphasize how little time it sometimes feels the most fun part—#3—gets.

In the real world, though, #4 over time can take almost as much as the previous three.

Deployed models sometimes need to be rebuilt. They consume data that need to constantly go through points #1 and #2. It certainly isn’t always what’s shown in the classroom, where datasets perfectly fit in the memory and model training takes at most a couple hours on an old laptop.

Working with gigantic datasets isn’t the only problem. Data pipelines can take long hours to complete. What if some part of your infrastructure has an unexpected downtime? Do you just start it all over again from the very beginning?

Many solutions of course exist. With this article, I’d like to go over this problem space and present an approach that feels really nice and clean.

Project description

End Point Corporation was founded in 1995. That’s 24 years! About 9 years later, the oldest article on the company’s blog was published. Since that time, a staggering number of 1435 unique articles have been published. That’s a lot of words! This is something we can definitely use in a smart way.

For the purpose of having fun with building a production-grade data pipeline, let’s imagine the following project:

A doc2vec model trained on the corpus of End Point’s blog articles
Use of the paragraph vectors for each article to find the 10 other, most similar articles

I blogged about using the matrix factorization as a simple collaborative filtering style of the recommender system. We can think about today’s doc2vec-based model as an example of the content based filtering. The business value would be the potentially increased blog traffic from users staying longer on the website.

Scalable pipelines

The data pipelines problem certainly found some really great solutions. The Hadoop project brought in the HDFS—a distributed file system for huge data artifacts. Its MapReduce component plays a vital role in distributed data processing.

Then, the fantastic Spark project came in. Its architecture makes data reside in memory by default—with explicit caching of the data on disks. The project claims to be running workloads 100 times faster than Hadoop.

Both projects though require the developer to use a very specific set of libraries. It’s not easy, for example, to distribute spaCy training and inference on Spark.

Containers

On the other side of the spectrum, there’s Dask. It’s a Python package that wraps Numpy, Pandas and Scikit-Learn. It enables developers to load huge piles of data, just as they would with the smaller datasets. The data is partitioned and distributed among the cluster nodes. It can work with groups of processes as well as clusters of containers. The APIs of the above-mentioned projects are (mostly) preserved while all the processing is suddenly distributed.

Some teams like to use Dask along with Luigi and build production pipelines around Docker or Kubernetes.

In this article, I’d like to present another Dask-friendly solution: Kubernetes-native workflows using Argo. What’s great about it compared to Luigi, is that you don’t even need to care about having a certain version of Python and Luigi installed to orchestrate the pipeline. All you need is the Kubernetes cluster and Argo installed on it.

Hands down work on the project

The first thing to do when developing this project is to get access to the Kubernetes cluster. For the development, you can set up a one-node cluster using either one of:

I love them both. The first is developed by Canonical while the second by the Kubernetes team itself.

This isn’t going to be a step-by-step tutorial on using Kubernetes. I encourage you to read the documentation or possibly seek out a good online course if you don’t know anything yet. Read on even in this case though—it’s nothing that would be overly complex.

Next, you’ll need the Argo Workflows. The installation is really easy. The full yet simple documentation can be found here.

The project structure

Here’s what the project looks like in the end:

.
├── Makefile
├── notebooks
│  └── scratch.ipynb
├── notebooks.yml
├── pipeline.yaml
├── tasks
   ├── base
   │  ├── Dockerfile
   │  └── requirements.txt
   ├── build_model
   │  ├── Dockerfile
   │  └── run.py
   ├── clone_repo
   │  ├── Dockerfile
   │  └── run.sh
   ├── infer
   │  ├── Dockerfile
   │  └── run.py
   ├── notebooks
   │  └── Dockerfile
   └── preprocess
      ├── Dockerfile
      └── run.py

The main parts are as follows:

Makefile provides easy to use helpers for building images, sending them into the Docker repository and running the Argo workflow
notebooks.yml defines a Kubernetes service and deployment for exploratory Jupyter Lab instance
notebooks contains individual Jupyter notebooks
pipeline.yaml defines our Machine Learning pipeline in the form of the Argo workflow
tasks contains workflow steps as containers along with their Dockerfiles
tasks/base defines the base Docker image for other tasks
tasks/**/run.(py|sh) is a single entry point for a given pipeline step

The idea is to minimize the boilerplate while retaining the features offered e.g. by Luigi.

Makefile

SHELL := /bin/bash
VERSION?=latest
TASK_IMAGES:=$(shell find tasks -name Dockerfile -printf '%h ')
REGISTRY=base:5000

tasks/%: FORCE
        set -e ;\
        docker build -t blog_pipeline_$(@F):$(VERSION) $@ ;\
        docker tag blog_pipeline_$(@F):$(VERSION) $(REGISTRY)/blog_pipeline_$(@F):$(VERSION) ;\
        docker push $(REGISTRY)/blog_pipeline_$(@F):$(VERSION)

images: $(TASK_IMAGES)

run: images
        argo submit pipeline.yaml --watch

start_notebooks:
        kubectl apply -f notebooks.yml

stop_notebooks:
        kubectl delete deployment jupyter-notebook

FORCE: ;

When using this Makefile with make run, it will need to resolve the images dependency. This, in turn, will ask to resolve all of the task/**/Dockerfile dependencies too. Notice how the TASK_IMAGES variable is constructed: it uses the make’s shell command to use the Unix’s find to find the subdirectories of tasks that contain the Dockerfile. Here’s what the output would be if you were to use it directly:

$ find tasks -name Dockerfile -printf '%h '
tasks/notebooks tasks/base tasks/preprocess tasks/infer tasks/build_model tasks/clone_repo

Setting up Jupyter Notebooks as a scratch pad and for EDA

Let’s start off by defining our base Docker image:

FROM python:3.7

COPY requirements.txt /requirements.txt
RUN pip install -r /requirements.txt

Following is the Dockerfile that extends it and adds the Jupyter Lab:

FROM endpoint-blog-pipeline/base:latest

RUN pip install jupyterlab

RUN mkdir ~/.jupyter
RUN echo "c.NotebookApp.token = ''" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.password = ''" >> ~/.jupyter/jupyter_notebook_config.py

RUN mkdir /notebooks
WORKDIR /notebooks

The last step is to add the Kubernetes service and deployment definition in notebooks.yml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter-notebook
  labels:
    app: jupyter-notebook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter-notebook
  template:
    metadata:
      labels:
        app: jupyter-notebook
    spec:
      containers:
      - name: minimal-notebook
        image: base:5000/blog_pipeline_notebooks
        ports:
        - containerPort: 8888
        command: ["/usr/local/bin/jupyter"]
        args: ["lab", "--allow-root", "--port", "8888", "--ip", "0.0.0.0"]
---
kind: Service
apiVersion: v1
metadata:
  name: jupyter-notebook
spec:
  type: NodePort
  selector:
    app: jupyter-notebook
  ports:
  - protocol: TCP
    nodePort: 30040
    port: 8888
    targetPort: 8888

This can be run using our Makefile with make start_notebooks or directly with:

$ kubectl apply -f notebooks.yml

Exploration

The notebook itself feels more like a scratch pad than an exploratory data analysis. You can see that it’s very informal and doesn’t include much of the exploration or visualization. You’re likely not to omit those in more real-world code.

I used it to ensure the model would work at all. I then was able to grab portions of the code and paste it directly into step definitions.

Implementation

Step 1: Source blog articles

The blog’s articles are stored on GitHub in Markdown files.

Our first pipeline task will need to either clone the repo or pull from it if it’s present in the pipeline’s shared volume.

We’ll use the Kubernetes hostPath as the cross-step volume. What’s nice about it is that it’s easy to peek into the volume during development to see if the data artifacts are being generated correctly.

In our example here, I’m hardcoding the path on my local system:

# ...
volumes:
  - name: endpoint-blog-src
    hostPath:
      path: /home/kamil/data/endpoint-blog-src
      type: Directory
# ...

This is one of the downsides of the hostPath—it only accepts absolute paths. This will do just fine for now though.

In the pipeline.yml we define the task container with:

# ...
templates:
  - name: clone-repo
    container:
      image: base:5000/blog_pipeline_clone_repo
      command: [bash]
      args: ["/run.sh"]
      volumeMounts:
      - mountPath: /data
        name: endpoint-blog-src
# ...

The full pipeline forms a tree which is expressed conveniently as a directed acyclic graph within the Argo. Here’s the definition of the whole pipeline (some steps were not shown yet):

# ...
- name: article-vectors
    dag:
      tasks:
      - name: src
        template: clone-repo
      - name: dataframe
        template: preprocess
        dependencies: [src]
      - name: model
        template: build-model
        dependencies: [dataframe]
      - name: infer
        template: infer
        dependencies: [model]
# ...

Notice how the dependencies field makes it easy to tell Argo what order to take when executing the tasks. The Argo steps can also define inputs and outputs—just like Luigi. For this simple example, I decided to omit them and enforce the convention for the steps to expect data artifacts in a certain location in the mounted volume. If you’re curious about other Argo features though, here is its documentation.

The entry point script for the task is pretty simple:

#!/bin/bash

cd /data

if [ -d ./blog ]
then
  cd blog
  git pull origin master
else
  git clone https://github.com/EndPointCorp/end-point-blog.git blog
fi

Step 2: Data wrangling

At this point, we’d have the source files for the blog articles in Markdown files. To be able to run them through any kind of machine learning modeling, we need to source it into the data frame. We’ll also need to clean the text a bit. Here is the reasoning behind the cleanup routine:

I want the relations between the articles to omit the code snippets: not to group them by the used programming language or a library just by the keywords they contain
I also want the metadata about the tags and authors to be omitted too as I don’t want to see only e.g. my articles listed as similar to my other ones

The full source for the run.py of the “preprocess” task can be viewed here.

Notice that unlike make or Luigi, the Argo workflows would run the same task fully again even with the step artifact already being created. I like this flexibility—it’s extremely easy after all to just skip the processing in Python or shell script if it already exists.

At the end of this step, the data frame is written as the Apache Parquet file.

Step 3: Building the model

The model from the paper mentioned earlier has already been implemented in a variety of other projects. There are implementations for each major deep learning framework on GitHub. There’s also a pretty good one included in Gensim. Its documentation can be found here.

The run.py is pretty short and straight forward as well. This is one of the goals for the pipeline. In the end, it’s writing the trained model into the shared volume as well.

Notice that re-running the pipeline with the model already stored will not trigger the training again. This is what we want. Imagine a new article being pushed into the repository. It’s very unlikely that retraining with it would affect the model’s performance in any significant way. We’ll still need to predict the similar other documents for it. The model building step would short-circuit though with:

if __name__ == '__main__':
    if os.path.isfile('/data/articles.model'):
        print("Skipping as the model file already exists")
    else:
        build_model()

Step 4: Predict similar articles

The listing of the run.py isn’t overly long:

import pandas as pd
from gensim.models.doc2vec import Doc2Vec
import yaml
from pathlib import Path
import os


def write_similar_for(path, model):
    similar_paths = model.docvecs.most_similar(path)
    yaml_path = (Path('/data/blog/') / path).parent / 'similar.yaml'

    with open(yaml_path, "w") as file:
        file.write(yaml.dump([p for p, _ in similar_paths]))
        print(f"Wrote similar paths to {yaml_path}")


def infer_similar():
    articles = pd.read_parquet('/data/articles.parquet')
    model = Doc2Vec.load('/data/articles.model')

    for tag in articles['file'].tolist():
        write_similar_for(tag, model)

if __name__ == '__main__':
    infer_similar()

The idea is to load up the saved Gensim model and the data frame with articles first. Then for each article use the model to get the 10 most similar other articles.

As the step’s output, the listing of similar articles is placed in the similar.yml file for each article’s subdirectory.

The blog’s Markdown → HTML compiler could then use this file and e.g. inject the “You might find those articles interesting too” section.

Results

The scratch notebook already includes the example results of running this doc2vec model. Examples:

model.docvecs.most_similar('2019/01/09/liquid-galaxy-at-instituto-moreira-salles.html.md')

Giving the output of:

[('2016/04/22/liquid-galaxy-for-real-estate.html.md', 0.8872901201248169),
 ('2017/07/03/liquid-galaxy-at-2017-boma.html.md', 0.8766101598739624),
 ('2017/01/25/smartracs-liquid-galaxy-at-national.html.md',
  0.8722846508026123),
 ('2016/01/04/liquid-galaxy-at-new-york-tech-meetup_4.html.md',
  0.8693454265594482),
 ('2017/06/16/successful-first-geoint-symposium-for.html.md',
  0.8679709434509277),
 ('2014/08/22/liquid-galaxy-for-daniel-island-school.html.md',
  0.8659971356391907),
 ('2016/07/21/liquid-galaxy-featured-on-reef-builders.html.md',
  0.8644022941589355),
 ('2017/11/17/president-of-the-un-general-assembly.html.md',
  0.8620222806930542),
 ('2016/04/27/we-are-bigger-than-vr-gear-liquid-galaxy.html.md',
  0.8613147139549255),
 ('2015/11/04/end-pointers-favorite-liquid-galaxy.html.md',
  0.8601428270339966)]

Or the following:

model.docvecs.most_similar('2019/01/08/speech-recognition-with-tensorflow.html.md')

Giving:

[('2019/05/01/facial-recognition-amazon-deeplens.html.md', 0.8850516080856323),
 ('2017/05/30/recognizing-handwritten-digits-quick.html.md',
  0.8535605072975159),
 ('2018/10/10/image-recognition-tools.html.md', 0.8495659232139587),
 ('2018/07/09/training-tesseract-models-from-scratch.html.md',
  0.8377258777618408),
 ('2015/12/18/ros-has-become-pivotal-piece-of.html.md', 0.8344655632972717),
 ('2013/03/07/streaming-live-with-red5-media.html.md', 0.8181146383285522),
 ('2012/04/27/streaming-live-with-red5-media-server.html.md',
  0.8142604827880859),
 ('2013/03/15/generating-pdf-documents-in-browser.html.md',
  0.7829260230064392),
 ('2016/05/12/sketchfab-on-liquid-galaxy.html.md', 0.7779937386512756),
 ('2018/08/29/self-driving-toy-car-using-the-a3c-algorithm.html.md',
  0.7659779787063599)]

model.docvecs.most_similar('2016/06/03/adding-bash-completion-to-python-script.html.md')

With:

[('2014/03/12/provisioning-development-environment.html.md',
  0.8298013806343079),
 ('2015/04/03/manage-python-script-options.html.md', 0.7975824475288391),
 ('2012/01/03/automating-removal-of-ssh-key-patterns.html.md',
  0.7794561386108398),
 ('2014/03/14/provisioning-development-environment_14.html.md',
  0.7763932943344116),
 ('2012/04/16/easy-creating-ramdisk-on-ubuntu.html.md', 0.7579266428947449),
 ('2016/03/03/loading-json-files-into-postgresql-95.html.md',
  0.7410352230072021),
 ('2015/02/06/vim-plugin-spotlight-ctrlp.html.md', 0.7385793924331665),
 ('2017/10/27/hot-deploy-java-classes-and-assets-in.html.md',
  0.7358890771865845),
 ('2012/03/21/puppet-custom-fact-ruby-plugin.html.md', 0.718029260635376),
 ('2012/01/14/using-disqus-and-rails.html.md', 0.716759443283081)]

To run the pipeline all you need is to:

$ make run

Or directly with:

$ argo submit pipeline.yml --watch

Argo gives a nice looking output of all the steps:

Name:                endpoint-blog-pipeline-49ls5
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Started:             Wed Jun 26 13:27:51 +0200 (17 seconds ago)
Finished:            Wed Jun 26 13:28:08 +0200 (now)
Duration:            17 seconds

STEP                             PODNAME                                  DURATION  MESSAGE
 ✔ endpoint-blog-pipeline-49ls5
 ├-✔ src                         endpoint-blog-pipeline-49ls5-3331170004  3s
 ├-✔ dataframe                   endpoint-blog-pipeline-49ls5-2286787535  3s
 ├-✔ model                       endpoint-blog-pipeline-49ls5-529475051   3s
 └-✔ infer                       endpoint-blog-pipeline-49ls5-1778224726  6s

The resulting similar.yml files look as follows:

$ ls ~/data/endpoint-blog-src/blog/2013/03/15/
generating-pdf-documents-in-browser.html.md  similar.yaml

$ cat ~/data/endpoint-blog-src/blog/2013/03/15/similar.yaml
- 2016/03/17/creating-video-player-with-time-markers.html.md
- 2014/07/17/creating-symbol-web-font.html.md
- 2018/10/10/image-recognition-tools.html.md
- 2015/08/04/how-to-big-beautiful-background-video.html.md
- 2014/11/06/simplifying-mobile-development-with.html.md
- 2016/03/23/learning-from-data-basics-naive-bayes.html.md
- 2019/01/08/speech-recognition-with-tensorflow.html.md
- 2013/11/19/asynchronous-page-switches-with-django.html.md
- 2016/03/11/strict-typing-fun-example-free-monads.html.md
- 2018/07/09/training-tesseract-models-from-scratch.html.md

Although it’s difficult to quantify, those sets of “similar” documents do seem to be linked in many ways to their “anchor” articles. You’re invited to read them and see for yourself!

Closing words

The code presented here is hosted on GitHub. There’s lots of room for improvement of course. It shows a nice approach that could be used for both small model deployments (like the one above) but also very big ones too.

The Argo workflows could be used in tandem with Kubernetes deployments. You could e.g. run a distributed TensorFlow model training and then deploy it on Kubernetes via TensorFlow Serving. If you’re more into PyTorch, then distributing the training would be possible via Horovod. Have data scientists that use R? Deploy RStudio Server instead of the JupyterLab with the image from DockerHub and run some or all tasks with the simpler one with R-base only.

If you have any questions or projects you’d like us to help you with, reach out right away through our contact form!

machine-learning kubernetes natural-language-processing python docker containers

Custom Ecommerce

Application Development

Database Consulting

Cloud Hosting

Systems Integration

Legacy Business Systems

Security & Compliance

GIS

DevOps / Cloud

Databases

Frameworks

Ecommerce

About Us

Our Team

Clients

Blog

Careers

Our Blog