内容简介:Machine learning in production is a fairly new phenomenon, and as such, the playbook for managing data science teams that build production ML pipelines is still being written. As a result, well-intentioned companies often get things wrong, and accidentally
How companies can help data scientists do their job
May 28 ·4min read
Machine learning in production is a fairly new phenomenon, and as such, the playbook for managing data science teams that build production ML pipelines is still being written. As a result, well-intentioned companies often get things wrong, and accidentally institute policies that, while designed to improve things, actually hamper their data scientists.
I work with a lot of data science teams (for context, I contribute to Cortex , an open source model serving platform) and I constantly hear stories about weird internal policies that—while designed to make things smoother—make their day-to-day harder.
Policies like…
1. All cloud resources need a formal request
If your team doesn’t use local machines, then just about every operation you do requires cloud resources. Processing data, training and evaluating models, experimenting in notebooks, deploying models to production — everything requires a server.
I’ve worked with more than one team in which any request for cloud resources needed to be submitted formally for approval. That means nearly any new ML-related operation required a request process.
Typically, companies adopt this setup for security and control purposes. The fewer people have cloud access, the thinking goes, the fewer opportunities for mistakes. The problem is that development slows to a crawl in these environments. Instead of exploring new ideas or building new models, data scientists spend days navigating the red tape required to get an EC2 instance.
Giving data scientists cloud privileges, at least enough that they can independently do the basic cloud operations required for their job, would significantly increase the speed at which teams move.
2. GPUs are fine — but only for training
One data science team we worked with was actually given their own AWS accounts, one for dev and another for prod. Their dev account, where they did model training, had access to GPUs. Their prod account, where models were deployed, did not.
On some level, you can kind of see where the DevOps team was coming from. They were held accountable for cloud spend, and GPU inference can get expensive. If data scientists were getting by without it before, why did they need it now?
In order to get GPUs, the data science team would need to prove that the models they were deploying couldn’t generate predictions with acceptable latency without GPUs. That’s a lot of friction—particularly when you realize they were already running GPU instances in dev.
3. Notebook instances must be monitored nonstop
One data science team had some weird policies when it came to notebooks.
Basically, the IT department was very vigilant in making sure there was no unnecessary spending, specifically on notebook instances. They were aggressive in making sure that instances were deleted as soon as possible, and were required to sign off on any new notebook instances.
In theory, holding someone responsible for how resources are managed is a good thing. However, holding someone accountable for managing another team’s resources is clearly a recipe for disaster. Instead of ending up with a lean, financially responsible data science team, these companies burn money on wasted time, as data scientists spend hours twiddling their thumbs waiting for approval from IT.
In addition, this is terrible for morale. Imagine needing to file a ticket just to open up a notebook? Or having to answer probing questions about any long-running notebook instance?
Infrastructure makes a better safeguard than policy
As frustrating and bizarre as these policies can seem, they’re all designed for legitimate purposes. Companies want to institute safeguards to control costs, create accountability, and ensure security.
The mistake, however, is that these safeguards are instituted through management policies when they should be baked into the infrastructure.
For example, instead of having a policy for requesting cloud resources, some of the best teams we work with simply give data scientists IAM roles with tightly scoped permissions. Similarly, instead of having another team constantly monitoring data scientists, many companies give the data science team a provisioned-but-limited sandbox to experiment in.
We think about how to solve these issues via infrastructure all the time when building our model serving platform . For example, we want to lower inference costs, but we also want to give data scientists the latitude to use the best tool for the job. Instead of putting limits around GPU inference, we built out spot instance support. Now, teams can use GPU instances with ~90% discounts.
When you try and enforce these safeguards through management policies, you introduce a complex web of human dependencies and a hierarchy of authority. Inevitably, this slows progress and introduces friction.
When you institute these safeguards as features of your infrastructure, however, you enable your team to move independently and quickly—just with some guardrails.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Android开发艺术探索
任玉刚 / 电子工业出版社 / 2015-9-1 / CNY 79.00
《Android开发艺术探索》是一本Android进阶类书籍,采用理论、源码和实践相结合的方式来阐述高水准的Android应用开发要点。《Android开发艺术探索》从三个方面来组织内容。第一,介绍Android开发者不容易掌握的一些知识点;第二,结合Android源代码和应用层开发过程,融会贯通,介绍一些比较深入的知识点;第三,介绍一些核心技术和Android的性能优化思想。 《Andro......一起来看看 《Android开发艺术探索》 这本书的介绍吧!