Stop making data scientists manage Kubernetes clusters

栏目: IT技术 · 发布时间: 6年前

内容简介:Production machine learning has an organizational problem, one that is a byproduct of its relative youth. While more mature fields—web development, for example—have developed best practices over decades, production machine learning hasn’t yet.To illustrate

Building models is hard enough

Feb 17 ·5min read

Stop making data scientists manage Kubernetes clusters

Source: Pexels

Disclaimer: The following is based on my observations of machine learning teams—not an academic survey of the industry. For context, I’m a contributor to Cortex , an open source platform for deploying models in production.

Production machine learning has an organizational problem, one that is a byproduct of its relative youth. While more mature fields—web development, for example—have developed best practices over decades, production machine learning hasn’t yet.

To illustrate, imagine you were tasked with growing a product engineering org for your startup, which develops a web app. Even if you had no experience building a team, you could find thousands of articles and books on how your engineering org should be structured and grown.

Now imagine you are at a startup that has dabbled with machine learning. You’ve hired a data scientist to lead the initial efforts, and the results have been good. As machine learning becomes more deeply embedded into your product, it becomes obvious that the machine learning team needs to grow, as the responsibilities of the data scientist have ballooned.

In this situation, there are not thousands of articles and books on how a production machine learning team should be structured.

This is not an uncommon scenario, and what frequently happens is that the new responsibilities of the machine learning org—infrastructure, in particular—get passed onto the data scientist(s).

This is a mistake.

The difference between machine learning and machine learning infrastructure

The difference between a platform and product engineer is pretty well understood at this point. Similarly, data analysts and data engineers are clearly differentiated roles.

Machine learning, at many companies, is still missing that specialization.

To see why the delineation between machine learning and machine learning infrastructure is important, it’s helpful to look at the work and tooling required for each.

To design and train new models, a data scientist is going to:

  • Spend their time in a notebook, analyzing data and running experiments.
  • Worry about things like data hygiene and selecting the right model architecture for their dataset.
  • Use a programming language like Python, R, Swift, or Julia.
  • Be opinionated about machine learning frameworks like PyTorch or TensorFlow.

In other words, their responsibilities, skills, and tools are going to revolve around manipulating data to develop models, and their ultimate output will be models that deliver the most accurate predictions possible.

The infrastructure side is fundamentally different.

A common way to put a model into production is to deploy it to the cloud as a microservice. To deploy a model as a production API, an engineer is going to:

  • Spend their time split between config files, their terminal, and their cloud provider’s console, trying to optimize stability, latency, and cost.
  • Worry about things like auto scaling their instances, updating models without crashing APIs, and serving inferences on GPUs.
  • Use tools like Docker, Kubernetes, Istio, Flask, and whatever services/APIs their cloud provider offers.

An easy way to visualize the difference in working on machine learning versus machine learning infrastructure is like this:

Stop making data scientists manage Kubernetes clusters

Machine learning vs. Machine learning infrastructure

Intuitively, it makes sense that a data scientist should handle the circle on the left, but not so much the circle on the right.

What’s wrong with having non-specialists manage infrastructure?

Let’s run this as a hypothetical. Say you had to assign someone to manage your machine learning infrastructure, but you didn’t want to dedicate someone full-time to it. Your only two options would be:

  • A data scientist, because of their familiarity with machine learning.
  • A devops engineer, because of their familiarity with general infrastructure.

Both of these options have issues.

First, data scientists should spend as much time as possible doing what they’re best at—data science. While learning infrastructure certainly isn’t beyond them, both infrastructure and data science are full-time jobs, and splitting a data scientist’s time between them will reduce the quality of output in both roles.

Second, your organization needs someone dedicated specifically to machine learning infrastructure. Serving models in production is different than hosting a web app. You need someone specialized for the role, who can advocate for machine learning infrastructure within your org.

This advocacy piece turns out to be crucial. I get to see inside a lot of machine learning orgs, and you’d be surprised how often their bottlenecks stem not from technical challenges, but from organizational ones.

For instance, I’ve seen machine learning teams who need GPUs for inferencing—big models like GPT-2 basically require them for reasonable latency—but who can’t get them because their infrastructure is managed by the broader devops team, who don’t want to put the cost on their account.

Having someone dedicated to your machine learning infrastructure means you not only have a team member who is constantly improving your infrastructure, it means you have an advocate who can get your team what it needs.

Who should manage the infrastructure then?

Machine learning infrastructure engineers.

Now, before you disagree about the official title, let’s just acknowledge that it’s still early days for production machine learning and that it’s the wild west when it comes to titles. Different companies might call it:

  • Machine learning infrastructure engineer
  • Data science platform engineer
  • ML production engineer

We can already see mature machine learning organizations hiring for this role, including Spotify:

Stop making data scientists manage Kubernetes clusters

Source: Spotify

As well as Netflix:

Stop making data scientists manage Kubernetes clusters

Source: Netflix

As ML-powered features like Gmail’s Smart Compose, Uber’s ETA prediction, and Netflix’s content recommendation become ubiquitous in software, machine learning infrastructure is becoming more and more important.

If we want a future in which ML-powered software is truly commonplace, removing the infrastructure bottleneck is essential—and to do that, we need to treat it as a real specialization, and let data scientists focus on data science.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

创业者手册

创业者手册

[美] 史蒂夫·布兰克、[美] 鲍勃·多夫 / 新华都商学院 / 机械工业出版社 / 2013-1 / 89.00元

我们发现,企业的成功程度和创始人使用本书的频繁程度成正比。书中折角越多,书被翻得越破,企业取得的成功就越显著。阅读本书切忌囫囵吞枣。 所有创业者都坚信自己的道路与众不同,他们在踏上创业之路时从不设计路线图,认为其他模式或模板并不适合自己。同样是初创企业,有些能够取得成功而有些只能沦落到廉价清库的下场,看起来这似乎是运气使然,然而事实并非如此。英雄成功的故事都是一样的。初创企业实现成功之路肯定......一起来看看 《创业者手册》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具