内容简介:There are dozens of articles written comparing the relative merits of Python and R for data science, and this isn’t one of them.Instead, this an article about the divergence of data analysts and machine learning engineers, and their differing needs in a pr
Machine learning engineering is maturing
Feb 28 ·5min read
Disclaimer: The following is based on my observations — not an academic survey of the industry. For context, I’m a contributor to Cortex , an open source machine learning platform (the “our” in this article’s title).
There are dozens of articles written comparing the relative merits of Python and R for data science, and this isn’t one of them.
Instead, this an article about the divergence of data analysts and machine learning engineers, and their differing needs in a programming language.
The simple version is that machine learning engineers are, fundamentally, software engineers, and they use programming languages designed for software engineering—not statistics.
This may sound fairly obvious, but it represents a change in the machine learning ecosystem, one that is worth diving into further.
Python and R are both suited for data analysis
Comparisons of R and Python often highlight perceived advantages of either language that are, at best, marginal and subjective. While some believe R’s out-of-the-box statistical functions provide an advantage over Python, which requires the use of third party libraries like NumPy, those differences aren’t that impactful.
The simple truth is that R and Python are both completely adequate for the analysis of data .
For example, say you want to run a simple linear regression model on some data, like housing prices. In R, it would look something like this:
square_feet <- c(1000, 1300, 942, 1423, 2189) price <- c(300000, 299000, 240000, 420000, 600322)correlation <- lm(price~square_feet)new_house <- data.frame(square_feet = 1100) new_house_price = predict(correlation, new_house)print(new_house_price)
Here it is in Python:
import pandas as pd import statsmodels.api as smdata = {'square_feet': [1000, 1300, 942, 1423, 2189], 'price': [300000, 299000, 240000, 420000, 600322]} housing_data = pd.DataFrame(data=data)model = sm.OLS(housing_data['price'], housing_data['square_feet']).fit()new_data = {'square_feet': [1400]} new_housing_data = pd.DataFrame(data=new_data)model.predict(new_housing_data['square_feet'])
The differences aren’t incredible. Some people may feel particularly attached to the syntax of one language, or may prefer R’s default plotting library ( ggplot2
) over Matplotlib
or other Python options. Others will point out that Python is more performant than R.
The reality is, if all you want to do is analyze data, either language will get the job done fine.
But machine learning engineering is about software—not business intelligence
The needs of a company that is analyzing data to learn about their business—business intelligence, in other words—are different than those of a company for whom machine learning is an actual part of their product.
As Adam Waksman, Head of Core Technology at Foursquare,explains:
“A lot of times when companies say they have a “data science team”, they mean they have an analytics support function. At Foursquare, where machine learning models are a big chunk of our product…. we think of data science as part of our product development team”
Waksman continues to explain that at Foursquare, “We don’t have a data science department — we have an engineering department that cuts across a lot of functions.”
The needs of machine learning engineers are different. Let’s look at a real example.
To build a customer service bot for your company, you’d probably deploy your model as a microservice, which would take customer input and return a response to be rendered within the bot’s frontend.
In building this API, you’d need to:
- Load your model, which regardless of what framework you use, almost certainly has native Python bindings.
- Use a framework for serving your API. Python has several options—Flask being the most popular—while R is stuck with just Plumbr.
- Worry about things like parsing user input and, potentially, communicating with other services. This is more easily done in a general purpose scripting language like Python.
In other words, machine learning engineers have to deal with engineering concerns, where Python is the better choice.
Machine learning is both a research and an engineering discipline
To understand the emergence of machine learning engineering, it is useful to look at what happened in a related field, web development.
In 2000, there was only one product that relied on asynchronous communication between the client and server—Outlook Web Access. The team at Microsoft working on Outlook Web Access was the same team that invented XMLHTTP , the technology that made background HTTP requests possible.
In other words, the only people who could build asynchronous apps were the people who invented the technology that enabled them.
Not long ago, the same was true of machine learning. The only companies building products with machine learning also had sizable machine learning research teams, like Google, Facebook, and Netflix.
However, it didn’t take long for the web development field to split into researchers and practitioners. While researchers still work on new technologies and frameworks—typically while employed by larger organizations—practitioners mostly use these inventions to build products.
A similar trend is happening in machine learning. Machine learning engineers are emerging as practitioners who build ML-powered products using state-of-the-art models and frameworks produced by large companies and research labs.
For example, Nick Walton built AI Dungeon, an ML-driven choose your own adventure game, at a hackathon using a finetuned version of OpenAI’s GPT-2 :
Similarly to how most web developers don’t design their own database or framework, Walton did not invent his own model architecture. Instead, he used the outputs of machine learning researchers to build a new product.
Practitioners like Walton, who are focused on building software, need to work in a language that suits itself to building software—not dashboards.
Machine learning is moving out of the lab and into products—and that means Python
Business intelligence and data analysis will always exist, and within those communities, R will remain a popular choice. ML engineering, however, has moved on.
More and more, we are seeing teams like Foursquare, for whom data science and machine learning are matters of product development and engineering. The people responsible for them aren’t data analysts, they’re engineers (in terms of responsibilities, not titles), and they use tools and languages familiar to software engineers—like Python.
R will always be a valid tool for generating dashboards and reports. Building a predictive ETA feature for your ridesharing app, a content recommendation engine for your streaming service, or a face recognizer for your photo app, however, is a job for machine learning engineers and Python.
We built Cortex for machine learning engineers because we, originally, were software engineers who wanted to use machine learning. Our concerns had less to do with designing new models, and more to do with engineering problems, like:
- What is the best language for integrating with popular ML frameworks? Every framework has native Python bindings.
- What language is best suited toward writing request processing code? A general purpose language like Python.
- What is the simplest microservice framework we could use for wrapping models in APIs? Flask, which of course is Python.
In other words, we built a platform for machine learning engineers, not data analysts, and that meant supporting Python over R.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
UNIX编程艺术
Eric S. Raymond / 姜宏、何源、蔡晓俊 / 电子工业出版社 / 2006-2 / 59.00元
本书主要介绍了Unix系统领域中的设计和开发哲学、思想文化体系、原则与经验,由公认的Unix编程大师、开源运动领袖人物之一Eric S. Raymond倾力多年写作而成。包括Unix设计者在内的多位领域专家也为本书贡献了宝贵的内容。本书内容涉及社群文化、软件开发设计与实现,覆盖面广、内容深邃,完全展现了作者极其深厚的经验积累和领域智慧。一起来看看 《UNIX编程艺术》 这本书的介绍吧!