Why we switched from Flask to FastAPI for production machine learning

栏目: IT技术 · 发布时间: 4年前

内容简介:To productionize a machine learning model, the standard approach is to wrap it in a REST API and deploy it as a microservice. Flask is currently the de facto choice for writing these APIs for a couple of reasons:When we were selecting a framework to use un

The most popular tool isn’t always the best

Jun 11 ·6min read

Why we switched from Flask to FastAPI for production machine learning

To productionize a machine learning model, the standard approach is to wrap it in a REST API and deploy it as a microservice. Flask is currently the de facto choice for writing these APIs for a couple of reasons:

predict()

When we were selecting a framework to use under the hood of Cortex , our open source model serving platform, we picked Flask for these same reasons. However, with the release of version 0.14, we switched from Flask to FastAPI.

Several releases later, we’re very pleased with our decision.

Below, I’ve gone in depth on the core reasons we switched from Flask to FastAPI. If you’re curious about using something other than Flask for building your inference APIs, this will hopefully provide some useful context.

Note: If you are unfamiliar, FastAPI is a Python API microframework built on top of Starlette and Uvicorn .

1. ML inference benefits from native async support

We initially began looking for alternatives to Flask because of issues we were running into with autoscaling.

Rearchitecting autoscaling within Cortex is its own story, but the high-level is that Cortex used to autoscale by measuring CPU utilization, but now autoscales according to how many incoming requests an API has.

The problem we ran into here was that we didn’t have an elegant solution for counting incoming requests. Flask, being designed for WSGI servers like Gunicorn, doesn’t have native async support.

In order for this to work, Cortex needs to be able to asynchronously count queued and in-flight requests, and Flask—being designed for WSGI servers like Gunicorn—doesn’t have native async support.

The easiest solution, for us, was to switch to a framework with native async support. Being built on top of Uvicorn, an ASGI server, FastAPI made it easy to run an async event loop that counts incoming requests.

But even beyond solving our autoscaler issues, async support has enabled us to begin working on more complex inference features.

For example, Cortex allows users to write their own request handling code using a Predictor interface. The interface is a Python class that provides methods for initializing a model file and generating predictions:

But some users need to include operations beyond generating predictions in their predict() method—saving a file from S3, logging predictions to an external service, etc.

Ideally, these tasks wouldn’t run in predict() , as they add to inference latency. We’re currently working on implementing pre- and post-predict hooks, which will allow users to asynchronously run operations that don’t need to block the actual inference for the request.

2. Improved latency is a huge deal for inference

Latency and throughput are always important, but in production machine learning, their value is emphasized.

For example, if Uber’s ETA prediction is a few seconds late on your location or on traffic data, its utility decreases significantly. Similarly, if Gmail’s Smart Compose suggests text slower than you type, the feature has little value.

Because of this, every improvement we can make in overall latency and throughput is valuable, even seemingly minor ones.

FastAPI, as its name suggests, is one of the fastest Python frameworks, outperforming Flask by over 300%:

Why we switched from Flask to FastAPI for production machine learning
Source: Web Framework Benchmarks

For most deployments, the speed of the underlying framework is not the largest factor in determining inference latency. However, when you consider the cost of improving latency, it is clear that any improvement is valuable.

For example, Smart Compose needs to serve predictions at under 100ms . Even after designing a model specifically for faster predictions, the team couldn’t hit this threshold. They had to deploy on cloud TPUs—the smallest of which is $4.50/hour on-demand —in order to get latency under 100ms.

In that context, improving the speed of the underlying framework can have a large benefit. Even a small decrease in latency can prevent a team from needing more expensive hardware.

3. FastAPI is easy to switch to—by design

There are other frameworks faster than Flask that have native support for async. Our decision to choose FastAPI over the rest of them, while still largely motivated by its technical advantages, was heavily impacted by its low switching cost.

For context, we’re a small team. Cortex has an open source community with amazing contributors, but only four of us work on it full time. Because our engineering hours are precious, switching costs are a major consideration for us anytime we consider a change.

One of FastAPI’s selling points is that it is by design very similar to Flask in terms of syntax. For example, this is a snippet of routing code from Cortex v0.13, when it was built on Flask:

Why we switched from Flask to FastAPI for production machine learning

And here is the equivalent code in v0.14, when we first transitioned to FastAPI:

Why we switched from Flask to FastAPI for production machine learning

The initial transition from Flask to FastAPI required a surprisingly minimal amount of rewrites (ignoring features like autoscaling, for which we were introducing new designs).

Obviously, if another framework offered a dramatic performance advantage over FastAPI, we wouldn’t select FastAPI solely because of its ease of adoption. But, with no framework being significantly faster than FastAPI, its ease of adoption was even more reason for us to select it over others.

Balancing minimalism and maturity in production machine learning

Another interesting thing that we’ve seen since switching to FastAPI is that some of the features we initially wrote off as “nice to haves”—data validation, improved error handling, etc.—have actually proven to be valuable to our users.

This, in my opinion, reflects a broader trend within ML.

In the past, there weren’t many teams deploying models as real-time production APIs. For most data science teams, Flask was “good enough” in that it was popular, minimal, and written in Python.

But production ML, as a field, has matured. It’s increasingly common for companies to have at least one model in production. As more teams deploy models, the conversation around tooling has shifted from “What gets the job done?” to “What does it take to deploy a model at production scale?”

This maturation is the same reason we built Cortex in the first place. Years ago, data science teams could get by kludging together a “good enough” deployment process. As the field has matured, however, real infrastructure features— rolling updates, autoscaling, prediction monitoring , etc.—have gone from being “nice to haves” to being essential.

The models teams are deploying are getting bigger. The applications they’re building are more complex. The traffic these models are handling is increasing. With all of these challenges, the definition of a “good enough” solution is changing, and mature tooling is becoming essential.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

一路编程

一路编程

史蒂夫·富特 (Steven Foote) / 佟达 / 电子工业出版社 / 2017-1-1 / CNY 65.00

《一路编程》是一本编程入门书籍,然而,如果以书中所讲内容作为入门标准,估计十有八九的在职程序员都不能算已入门。现代软件开发,已经不仅仅是写出正确的代码这么简单,环境、依赖、构建、版本、测试及文档,每一项都对软件是否成功交付起到至关重要的作用,这些都是每一个程序员在开发软件过程中必备的技能。《一路编程》对于上述的每一种技能都做了简洁而精练的介绍,以满足最基本的日常软件开发。换句话说,《一路编程》实际......一起来看看 《一路编程》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具