Why we don’t use Lambda for serverless machine learning

栏目: IT技术 · 发布时间: 4年前

内容简介：When it comes to serving model predictions, serverless has become the popular architecture. Most ML deployments can be conceptualized as a straightforwardThat was our initial assumption beforeIn fact, Lambda’s shortcomings as a model inference platform pro

AWS’s serverless compute platform doesn’t work for ML inference

Caleb Kaiser

Mar 27 ·4min read

Why we don’t use Lambda for serverless machine learning — Source: Pexels

When it comes to serving model predictions, serverless has become the popular architecture. Most ML deployments can be conceptualized as a straightforward predict() , and if—from the developer’s perspective—a model is essentially just a function, it makes perfect sense to deploy it on a serverless compute platform like Lambda.

That was our initial assumption before we began work on Cortex . What we found, however, was that while serverless is (in our opinion) the correct approach to model inference, Lambda is the wrong choice of platform.

In fact, Lambda’s shortcomings as a model inference platform provided us an early roadmap for key Cortex features. These shortcomings included:

1. Deep learning models are too big for Lambda

The trend in deep learning is towards bigger and bigger models. Looking at state-of-the-art language models released since 2019:

OpenAI’s GPT-2 has 1.5 billion parameters.
Salesforce’s CTRL has 1.6 billion parameters.
Google’s Meena has 2.6 billion parameters
Nvidia’s Megatron has 8.3 billion parameters.

And thanks to transfer learning, more and more engineers are fine tuning these models to new domains. AI Dungeon , the AI-powered choose-your-own-adventure text game, is an example of a fine-tuned GPT-2 model.

For a modern inference serving platform to work, it has to be able to deploy models of this size, and Lambda lacks the requisite storage and memory.

A trained GPT-2 is 5 GB. Lambda’s deployment package limit is 250 MB uncompressed . Additionally, GPT-2 needs to be loaded into memory in order to serve predictions, and Lambda functions have an upper bound of 3,008 MB of memory.

Simple math dictates that Lambda is a poor choice for serving big models.

2. Lambda doesn’t support GPU inference

Speaking of big models, larger deep learning models often benefit from GPU processing for inference.

We recently benchmarked GPT-2 using CPU and GPU, and we found that the average inference latency on a GPU was 199ms, 4.6 times faster than just CPUs at 925ms.

This difference matters. Think about predictive text features, like Gmail’s Smart Compose. In order for the feature to be useful, it needs to serve predictions before you type more characters — not after.

The average person types roughly 200 characters per minute, or 3.33 characters per second, meaning there is about 300 ms between each character.

If you’re running on a CPU taking 925 ms per request, you’re way too slow for Gmail’s Smart Compose. By the time you process one of a user’s characters, they’re roughly 3 characters ahead — even more if they’re a fast typer.

With GPUs, however, you’re well ahead of them. At 199 ms per request, you’ll be able to predict the rest of their message with about 100 ms to spare — which is useful when you consider their browser still needs to render your prediction.

The fact that Lambda doesn’t support GPUs makes it a nonstarter for many ML projects.

3. Concurrency is suboptimal in Lambda

Each Lambda instance is capable of processing one request at a time, and while Lambda instances can handle consecutive requests, they cannot handle concurrent requests.

This may not be a problem for many web applications, where requests can be served rapid fire and the total number of instances can be kept low, but ML inferences can have quite a bit of latency, which creates problems.

For example, it is common to make IO requests within ML prediction APIs. Recommendation engines, for example, will often call to a database to get more user data before generating a prediction. If a prediction takes 700ms to generate, plus the time it takes to execute the IO request, it’s feasible that each request would occupy an entire Lambda instance for more than one second.

It doesn’t make sense, however, to lock down an entire instance waiting for the IO to complete. There are plenty of idle resources on the instance to field parallel requests.

By utilizing these extra resources, users could process more concurrent requests with fewer instances, bringing down their bill substantially—but because of Lambda’s inflexibility, this is not possible.

Inference workloads require more control

A large part of Lambda’s appeal is its plug-and-play nature. Machine learning inference workloads, however, require more control. Users need the ability to directly configure their resources and concurrency, as well as access to a wider variety of instance types than Lambda supports.

On the other hand, the ease and simplicity of serverless is a natural fit for inference workloads. A deployed model is, effectively, just a predict() function in the cloud, which makes serverless the ideal architecture.

With Cortex , we’ve focused on building a platform that strikes that balance, providing serverless’s ease of use while exposing the knobs users need to control their inference workloads.

Lambda is fantastic—just not for ML.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Why we don’t use Lambda for serverless machine learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

强化学习精要

冯超 / 电子工业出版社 / 2018-6 / 80

《强化学习精要：核心算法与TensorFlow 实现》用通俗幽默的语言深入浅出地介绍了强化学习的基本算法与代码实现，为读者构建了一个完整的强化学习知识体系，同时介绍了这些算法的具体实现方式。从基本的马尔可夫决策过程，到各种复杂的强化学习算法，读者都可以从本书中学习到。本书除了介绍这些算法的原理，还深入分析了算法之间的内在联系，可以帮助读者举一反三，掌握算法精髓。书中介绍的代码可以帮助读者快速将算法......一起来看看《强化学习精要》这本书的介绍吧!

码农工具