内容简介:When it comes to serving model predictions, serverless has become the popular architecture. Most ML deployments can be conceptualized as a straightforwardThat was our initial assumption beforeIn fact, Lambda’s shortcomings as a model inference platform pro
AWS’s serverless compute platform doesn’t work for ML inference
Mar 27 ·4min read
When it comes to serving model predictions, serverless has become the popular architecture. Most ML deployments can be conceptualized as a straightforward predict()
, and if—from the developer’s perspective—a model is essentially just a function, it makes perfect sense to deploy it on a serverless compute platform like Lambda.
That was our initial assumption before we began work on Cortex . What we found, however, was that while serverless is (in our opinion) the correct approach to model inference, Lambda is the wrong choice of platform.
In fact, Lambda’s shortcomings as a model inference platform provided us an early roadmap for key Cortex features. These shortcomings included:
1. Deep learning models are too big for Lambda
The trend in deep learning is towards bigger and bigger models. Looking at state-of-the-art language models released since 2019:
- OpenAI’s GPT-2 has 1.5 billion parameters.
- Salesforce’s CTRL has 1.6 billion parameters.
- Google’s Meena has 2.6 billion parameters
- Nvidia’s Megatron has 8.3 billion parameters.
And thanks to transfer learning, more and more engineers are fine tuning these models to new domains. AI Dungeon , the AI-powered choose-your-own-adventure text game, is an example of a fine-tuned GPT-2 model.
For a modern inference serving platform to work, it has to be able to deploy models of this size, and Lambda lacks the requisite storage and memory.
A trained GPT-2 is 5 GB. Lambda’s deployment package limit is 250 MB uncompressed . Additionally, GPT-2 needs to be loaded into memory in order to serve predictions, and Lambda functions have an upper bound of 3,008 MB of memory.
Simple math dictates that Lambda is a poor choice for serving big models.
2. Lambda doesn’t support GPU inference
Speaking of big models, larger deep learning models often benefit from GPU processing for inference.
We recently benchmarked GPT-2 using CPU and GPU, and we found that the average inference latency on a GPU was 199ms, 4.6 times faster than just CPUs at 925ms.
This difference matters. Think about predictive text features, like Gmail’s Smart Compose. In order for the feature to be useful, it needs to serve predictions before you type more characters — not after.
The average person types roughly 200 characters per minute, or 3.33 characters per second, meaning there is about 300 ms between each character.
If you’re running on a CPU taking 925 ms per request, you’re way too slow for Gmail’s Smart Compose. By the time you process one of a user’s characters, they’re roughly 3 characters ahead — even more if they’re a fast typer.
With GPUs, however, you’re well ahead of them. At 199 ms per request, you’ll be able to predict the rest of their message with about 100 ms to spare — which is useful when you consider their browser still needs to render your prediction.
The fact that Lambda doesn’t support GPUs makes it a nonstarter for many ML projects.
3. Concurrency is suboptimal in Lambda
Each Lambda instance is capable of processing one request at a time, and while Lambda instances can handle consecutive requests, they cannot handle concurrent requests.
This may not be a problem for many web applications, where requests can be served rapid fire and the total number of instances can be kept low, but ML inferences can have quite a bit of latency, which creates problems.
For example, it is common to make IO requests within ML prediction APIs. Recommendation engines, for example, will often call to a database to get more user data before generating a prediction. If a prediction takes 700ms to generate, plus the time it takes to execute the IO request, it’s feasible that each request would occupy an entire Lambda instance for more than one second.
It doesn’t make sense, however, to lock down an entire instance waiting for the IO to complete. There are plenty of idle resources on the instance to field parallel requests.
By utilizing these extra resources, users could process more concurrent requests with fewer instances, bringing down their bill substantially—but because of Lambda’s inflexibility, this is not possible.
Inference workloads require more control
A large part of Lambda’s appeal is its plug-and-play nature. Machine learning inference workloads, however, require more control. Users need the ability to directly configure their resources and concurrency, as well as access to a wider variety of instance types than Lambda supports.
On the other hand, the ease and simplicity of serverless is a natural fit for inference workloads. A deployed model is, effectively, just a predict()
function in the cloud, which makes serverless the ideal architecture.
With Cortex , we’ve focused on building a platform that strikes that balance, providing serverless’s ease of use while exposing the knobs users need to control their inference workloads.
Lambda is fantastic—just not for ML.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Markdown 在线编辑器
Markdown 在线编辑器
UNIX 时间戳转换
UNIX 时间戳转换