GPU-as-a-Service on KubeFlow: Fast, Scalable and Efficient ML

栏目: IT技术 · 发布时间: 4年前

内容简介：Machine Learning (ML) and Deep Learning (DL) involve compute and data intensive tasks. In order to maximize our model accuracy, we want to train on larger datasets, evaluate a variety of algorithms, and try out different parameters for each algorithm (hype

Machine Learning (ML) and Deep Learning (DL) involve compute and data intensive tasks. In order to maximize our model accuracy, we want to train on larger datasets, evaluate a variety of algorithms, and try out different parameters for each algorithm (hyper-parameter tuning).

As our datasets and model complexity grow, so does the time we need to wait for our jobs to complete, leading to inefficient use of our time. We end up running fewer iterations and tests or working on smaller datasets as a result.

NVIDIA GPUs are a great tool to accelerate our data science work. They are the obvious choice when it comes to Deep Learning workloads and deliver better ROI than CPUs. With new developments like RAPIDS , NVIDIA is tackling data analytics and machine learning workloads (like XGBoost) efficiently (read the details in my previous post: Python Pandas at Extreme Performance ). For example, an analytic task of reading Json data, aggregating its metrics and writing back to a compressed (Parquet) file runs 1.4 sec on GPU versus 43.4 sec on CPU (that’s 30x faster!).

The Challenge: Sharing GPUs

CPUs have long supported technologies such as virtualization (hypervisors), virtual memory, and IO management. We can run many different workloads, virtual machines and containers on the same CPUs while ensuring maximum isolation. We can use a variety of clustering technologies to scale computation across multiple CPUs and systems and use schedulers to dynamically allocate computation to tasks.

GPUs, on the other hand, must be assigned to specific VMs or containers. This leads to inefficiency. When a GPU intensive task finishes, the GPU will stay idle. If we associate a notebook sever with a GPU (for example), we waste GPU resources whenever a task is not running — for example when we are writing or debugging code, or when we go to lunch. This is a problem, especially when GPU equipped servers are more expensive.

Some solutions exist which partition a single GPU to smaller virtual GPUs, but this doesn’t solve the problem, as the (mostly idle) fragment we get is too small or has too little memory to run our task, and given that it’s hard to isolate memory between tasks on the same GPU we can run into many potential glitches.

Solution: Dynamic GPU Allocation and Scale-Out

The solution users are looking for is one that can harness multiple GPUs for a single task (so it can complete faster) and allocate GPUs just for the duration of the task. This can be made possible by combining containers with orchestration, clustering and a shared data layer.

Let’s assume we write some code in our Jupyter notebook or IDE (e.g. PyCharm). We can execute it locally, but when we need scale, we turn on a knob and it runs 10–100x faster on a distributed cluster. Wouldn’t that be nice? Can we implement such a dream? Yes, we can, and I will show you a demo a little further into this article.

To achieve this, we need to be able to package and clone our code and libraries to multiple dynamically scheduled containers at run time. We need all those containers to share the same data and to implement task distribution/parallelism mechanism, as illustrated in the following diagram.

GPU-as-a-Service on KubeFlow: Fast, Scalable and Efficient ML — Dynamic GPU/CPU Allocation (image by author)

A new open-source ML Orchestration framework called MLRun allows us to define “serverless” ML functions which consist of code, configuration, packages and infrastructure dependencies (such as CPUs, GPUs, memory, storage, etc.). Those “serverless” functions can run locally in our notebook or run over one or more containers which are created dynamically for the duration of the task (or stay longer if needed), the client/notebook and containers can share the same code and data through a low-latency shared data plane (i.e. virtualized as one logical system).

MLRun builds on top of Kubernetes and KubeFlow , it uses the Kubernetes API to create and manage resources. It leverages KubeFlow custom resources (CRDs) for seamlessly running horizontal scaling workloads (such as Nuclio functions , Spark, Dask, Horovod …), KubeFlow SDK for attaching tasks to resources like storage volumes and secrets, and KubeFlow Pipelines to create multi-step execution graphs (DAG).

Every local or remote task which is executed through MLRun is tracked by the MLRun service controller, all inputs, outputs, logs, and artifacts are stored in a versioned database and can be browsed using a simple UI, SDK or REST API calls, i.e. a built-in job and artifact management. MLRun functions can be chained to form a pipeline, they support hyper-parameters and AutoML tasks, Git integration, Projects packing, but those are topics for different posts, read some more here .

Example: Distributed Image Classification Using Keras and TensorFlow

In our example we have a 4-step pipeline based on the famous Cats & Dogs TensorFlow use-case:

Data ingestion function — loading an image archive from AWS S3
Data labeling function — labeling images as Dogs or Cats
Distributed training function — Use TensorFlow/Keras with Horovod to train our model
Deploy an interactive model serving function

You can see the full MLRun project and notebooks source here , we will focus on the 3rd and 4th steps. For this to work, you need to use Kubernetes with few open-source services running over it (Jupyter, KubeFlow Pipelines, KubeFlow MpiJob, Nuclio, MLRun) and shared file system access, or you can ask for an Iguazio cloud trial with those pre-integrated.

Our code can run locally (see the 1st and 2nd steps in the notebook), to run the distributed training on a cluster with GPUs we simply define our function as an MPIJob kind (one which uses MPI and Horovod to distribute our TensorFlow logic), we specify a link to the code, a container image (alternatively MLRun can build the image for us), the required number of containers, number of GPUs (per container), and attach it to a file mount (we apply iguazio low-latency v3io fabric mount, but other K8s shared file volume drivers or object storage solutions work just as well).

once we defined the function object all we need to do is run it with a bunch of parameters, inputs (datasets or files), and specify the default location for output artifacts (e.g. trained model files).

mprun = trainer.run(name='train', params=params, artifact_path='/User/mlrun/data', inputs=inputs, watch=True)

Note that in this example we didn’t need to move code or data, the fact that we used the same low-latency shared file system mounts across our notebook and worker containers means that we can just modify the code in Jupyter and re-run our job (all the job containers will see the new Py file changes), and all the job results will be instantly viewed in the Jupyter file browser or MLRun UI.

In addition to viewing the job progress interactively (watch=True) the run object (mprun) holds all the information on the run including pointers to the output artifacts, logs, status, etc. We can use MLRun web-based UI to track our job progress, compare experiment results or access versioned artifacts.

We use “.save()” to serialize and version our function object into a database, we can retrieve this function object later in a different notebook or CI/CD pipelines (no need to copy code and config between notebooks and teams).

If we want to deploy the generated model as an interactive serverless function all we need is to feed the “model” and “category_map” outputs into a “serving” function and deploy it to our test cluster.

MLRun orchestrates auto-scaling Nuclio functions which are super-fast and can be stateful (support GPU attachment, shared file mounts, state caching, streaming, etc.), the functions will auto scale to fit the workload and will scale to zero if requests will not arrive for a few minutes (consume zero resources). In this example we use “nuclio-serving” functions (Nuclio functions which host standard KFServing model classes), as you can see below, it only takes one command (deploy) to make it run as a live serverless function.

Now we have a running inference function and we can test the end point using simple HTTP request with a url or even a binary image in the payload.

End to End Workflow with KubeFlow Pipelines

Now that we tested each step of our pipeline manually, we may want to automate the process and potentially run it on a given schedule or be triggered by an event (e.g. a Git push). The next step is to define a KubeFlow Pipeline graph (DAG) which chains the 4 steps into a sequence and run the graph.

MLRun functions can be converted into KubeFlow Pipeline steps using a simple method (.as_step()), and specifying how step outputs are fed into other step inputs, check the full notebook example here , the following code demonstrates the graph DSL.

MLRun projects can have multiple workflows, and they can be launched with a single command or can be triggered by various events such as a Git push or HTTP REST call.

Once we run our pipeline we can track its progress using KubeFlow, MLRun will automatically register metrics, inputs, outputs and artifacts in KubeFlow UI without writing a single extra line of code (I guess you should try doing it first without MLRun to appreciate it :blush:).

For a more basic project example you can see the MLRun Iris XGBoost Project , other demos can be found in MLRun Demos repository , and you can check MLRun readme and examples for tutorials and simple examples.

Summary

This article demonstrates how computational resources can be used efficiently to run data science jobs at scale, but more importantly, I demonstrated how data science development can be simplified and automated, allowing far greater productivity and faster time to market. Ping me on KubeFlow Slack if you have additional questions.

Check out my GPU As A Service presentation and live demo from KubeCon NA in November.

Yaron

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

GPU-as-a-Service on KubeFlow: Fast, Scalable and Efficient ML

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Python金融衍生品大数据分析：建模、模拟、校准与对冲

【德】Yves Hilpisch（伊夫·希尔皮斯科） / 蔡立耑 / 电子工业出版社 / 2017-8 / 99.00

Python 在衍生工具分析领域占据重要地位，使机构能够快速、有效地提供定价、交易及风险管理的结果。《Python金融衍生品大数据分析：建模、模拟、校准与对冲》精心介绍了有效定价期权的四个领域：基于巿场定价的过程、完善的巿场模型、数值方法及技术。书中的内容分为三个部分。第一部分着眼于影响股指期权价值的风险，以及股票和利率的相关实证发现。第二部分包括套利定价理论、离散及连续时间的风险中性定价，并介绍......一起来看看《Python金融衍生品大数据分析：建模、模拟、校准与对冲》这本书的介绍吧!

码农工具