内容简介:Machine learning in production is a fairly new phenomenon, and as such, the playbook for managing data science teams that build production ML pipelines is still being written. As a result, well-intentioned companies often get things wrong, and accidentally
How companies can help data scientists do their job
May 28 ·4min read
Machine learning in production is a fairly new phenomenon, and as such, the playbook for managing data science teams that build production ML pipelines is still being written. As a result, well-intentioned companies often get things wrong, and accidentally institute policies that, while designed to improve things, actually hamper their data scientists.
I work with a lot of data science teams (for context, I contribute to Cortex , an open source model serving platform) and I constantly hear stories about weird internal policies that—while designed to make things smoother—make their day-to-day harder.
Policies like…
1. All cloud resources need a formal request
If your team doesn’t use local machines, then just about every operation you do requires cloud resources. Processing data, training and evaluating models, experimenting in notebooks, deploying models to production — everything requires a server.
I’ve worked with more than one team in which any request for cloud resources needed to be submitted formally for approval. That means nearly any new ML-related operation required a request process.
Typically, companies adopt this setup for security and control purposes. The fewer people have cloud access, the thinking goes, the fewer opportunities for mistakes. The problem is that development slows to a crawl in these environments. Instead of exploring new ideas or building new models, data scientists spend days navigating the red tape required to get an EC2 instance.
Giving data scientists cloud privileges, at least enough that they can independently do the basic cloud operations required for their job, would significantly increase the speed at which teams move.
2. GPUs are fine — but only for training
One data science team we worked with was actually given their own AWS accounts, one for dev and another for prod. Their dev account, where they did model training, had access to GPUs. Their prod account, where models were deployed, did not.
On some level, you can kind of see where the DevOps team was coming from. They were held accountable for cloud spend, and GPU inference can get expensive. If data scientists were getting by without it before, why did they need it now?
In order to get GPUs, the data science team would need to prove that the models they were deploying couldn’t generate predictions with acceptable latency without GPUs. That’s a lot of friction—particularly when you realize they were already running GPU instances in dev.
3. Notebook instances must be monitored nonstop
One data science team had some weird policies when it came to notebooks.
Basically, the IT department was very vigilant in making sure there was no unnecessary spending, specifically on notebook instances. They were aggressive in making sure that instances were deleted as soon as possible, and were required to sign off on any new notebook instances.
In theory, holding someone responsible for how resources are managed is a good thing. However, holding someone accountable for managing another team’s resources is clearly a recipe for disaster. Instead of ending up with a lean, financially responsible data science team, these companies burn money on wasted time, as data scientists spend hours twiddling their thumbs waiting for approval from IT.
In addition, this is terrible for morale. Imagine needing to file a ticket just to open up a notebook? Or having to answer probing questions about any long-running notebook instance?
Infrastructure makes a better safeguard than policy
As frustrating and bizarre as these policies can seem, they’re all designed for legitimate purposes. Companies want to institute safeguards to control costs, create accountability, and ensure security.
The mistake, however, is that these safeguards are instituted through management policies when they should be baked into the infrastructure.
For example, instead of having a policy for requesting cloud resources, some of the best teams we work with simply give data scientists IAM roles with tightly scoped permissions. Similarly, instead of having another team constantly monitoring data scientists, many companies give the data science team a provisioned-but-limited sandbox to experiment in.
We think about how to solve these issues via infrastructure all the time when building our model serving platform . For example, we want to lower inference costs, but we also want to give data scientists the latitude to use the best tool for the job. Instead of putting limits around GPU inference, we built out spot instance support. Now, teams can use GPU instances with ~90% discounts.
When you try and enforce these safeguards through management policies, you introduce a complex web of human dependencies and a hierarchy of authority. Inevitably, this slows progress and introduces friction.
When you institute these safeguards as features of your infrastructure, however, you enable your team to move independently and quickly—just with some guardrails.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
The Pragmatic Programmer
Andrew Hunt、David Thomas / Addison-Wesley Professional / 1999-10-30 / USD 49.99
本书直击编程陈地,穿过了软件开发中日益增长的规范和技术藩篱,对核心过程进行了审视――即根据需求,创建用户乐于接受的、可工作和易维护的代码。本书包含的内容从个人责任到职业发展,直至保持代码灵活和易于改编重用的架构技术。从本书中将学到防止软件变质、消除复制知识的陷阱、编写灵活、动态和易适应的代码、避免出现相同的设计、用契约、断言和异常对代码进行防护等内容。一起来看看 《The Pragmatic Programmer》 这本书的介绍吧!