Why our machine learning platform supports Python, not R

栏目: IT技术 · 发布时间: 5年前

内容简介：There are dozens of articles written comparing the relative merits of Python and R for data science, and this isn’t one of them.Instead, this an article about the divergence of data analysts and machine learning engineers, and their differing needs in a pr

Machine learning engineering is maturing

Caleb Kaiser

Feb 28 ·5min read

Why our machine learning platform supports Python, not R — Source: Python

Disclaimer: The following is based on my observations — not an academic survey of the industry. For context, I’m a contributor to Cortex , an open source machine learning platform (the “our” in this article’s title).

There are dozens of articles written comparing the relative merits of Python and R for data science, and this isn’t one of them.

Instead, this an article about the divergence of data analysts and machine learning engineers, and their differing needs in a programming language.

The simple version is that machine learning engineers are, fundamentally, software engineers, and they use programming languages designed for software engineering—not statistics.

This may sound fairly obvious, but it represents a change in the machine learning ecosystem, one that is worth diving into further.

Python and R are both suited for data analysis

Comparisons of R and Python often highlight perceived advantages of either language that are, at best, marginal and subjective. While some believe R’s out-of-the-box statistical functions provide an advantage over Python, which requires the use of third party libraries like NumPy, those differences aren’t that impactful.

The simple truth is that R and Python are both completely adequate for the analysis of data .

For example, say you want to run a simple linear regression model on some data, like housing prices. In R, it would look something like this:

square_feet <- c(1000, 1300, 942, 1423, 2189)
price <- c(300000, 299000, 240000, 420000, 600322)correlation <- lm(price~square_feet)new_house <- data.frame(square_feet = 1100)
new_house_price = predict(correlation, new_house)print(new_house_price)

Here it is in Python:

import pandas as pd
import statsmodels.api as smdata = {'square_feet': [1000, 1300, 942, 1423, 2189], 'price': [300000, 299000, 240000, 420000, 600322]}
housing_data = pd.DataFrame(data=data)model = sm.OLS(housing_data['price'], housing_data['square_feet']).fit()new_data = {'square_feet': [1400]}
new_housing_data = pd.DataFrame(data=new_data)model.predict(new_housing_data['square_feet'])

The differences aren’t incredible. Some people may feel particularly attached to the syntax of one language, or may prefer R’s default plotting library ( ggplot2 ) over Matplotlib or other Python options. Others will point out that Python is more performant than R.

The reality is, if all you want to do is analyze data, either language will get the job done fine.

But machine learning engineering is about software—not business intelligence

The needs of a company that is analyzing data to learn about their business—business intelligence, in other words—are different than those of a company for whom machine learning is an actual part of their product.

As Adam Waksman, Head of Core Technology at Foursquare,explains:

“A lot of times when companies say they have a “data science team”, they mean they have an analytics support function. At Foursquare, where machine learning models are a big chunk of our product…. we think of data science as part of our product development team”

Waksman continues to explain that at Foursquare, “We don’t have a data science department — we have an engineering department that cuts across a lot of functions.”

The needs of machine learning engineers are different. Let’s look at a real example.

To build a customer service bot for your company, you’d probably deploy your model as a microservice, which would take customer input and return a response to be rendered within the bot’s frontend.

In building this API, you’d need to:

Load your model, which regardless of what framework you use, almost certainly has native Python bindings.
Use a framework for serving your API. Python has several options—Flask being the most popular—while R is stuck with just Plumbr.
Worry about things like parsing user input and, potentially, communicating with other services. This is more easily done in a general purpose scripting language like Python.

In other words, machine learning engineers have to deal with engineering concerns, where Python is the better choice.

Machine learning is both a research and an engineering discipline

To understand the emergence of machine learning engineering, it is useful to look at what happened in a related field, web development.

In 2000, there was only one product that relied on asynchronous communication between the client and server—Outlook Web Access. The team at Microsoft working on Outlook Web Access was the same team that invented XMLHTTP , the technology that made background HTTP requests possible.

In other words, the only people who could build asynchronous apps were the people who invented the technology that enabled them.

Not long ago, the same was true of machine learning. The only companies building products with machine learning also had sizable machine learning research teams, like Google, Facebook, and Netflix.

However, it didn’t take long for the web development field to split into researchers and practitioners. While researchers still work on new technologies and frameworks—typically while employed by larger organizations—practitioners mostly use these inventions to build products.

A similar trend is happening in machine learning. Machine learning engineers are emerging as practitioners who build ML-powered products using state-of-the-art models and frameworks produced by large companies and research labs.

For example, Nick Walton built AI Dungeon, an ML-driven choose your own adventure game, at a hackathon using a finetuned version of OpenAI’s GPT-2 :

Similarly to how most web developers don’t design their own database or framework, Walton did not invent his own model architecture. Instead, he used the outputs of machine learning researchers to build a new product.

Practitioners like Walton, who are focused on building software, need to work in a language that suits itself to building software—not dashboards.

Machine learning is moving out of the lab and into products—and that means Python

Business intelligence and data analysis will always exist, and within those communities, R will remain a popular choice. ML engineering, however, has moved on.

More and more, we are seeing teams like Foursquare, for whom data science and machine learning are matters of product development and engineering. The people responsible for them aren’t data analysts, they’re engineers (in terms of responsibilities, not titles), and they use tools and languages familiar to software engineers—like Python.

R will always be a valid tool for generating dashboards and reports. Building a predictive ETA feature for your ridesharing app, a content recommendation engine for your streaming service, or a face recognizer for your photo app, however, is a job for machine learning engineers and Python.

We built Cortex for machine learning engineers because we, originally, were software engineers who wanted to use machine learning. Our concerns had less to do with designing new models, and more to do with engineering problems, like:

What is the best language for integrating with popular ML frameworks? Every framework has native Python bindings.
What language is best suited toward writing request processing code? A general purpose language like Python.
What is the simplest microservice framework we could use for wrapping models in APIs? Flask, which of course is Python.

In other words, we built a platform for machine learning engineers, not data analysts, and that meant supporting Python over R.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Why our machine learning platform supports Python, not R

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

程序设计语言

斯科特 / 裘宗燕 / 电子工业出版社 / 2005-1 / 88.00元

这是一本很有特色的教材，其核心是讨论程序设计语言的工作原理和技术。本书融合了传统的程序设计语言教科书和编译教科书的有关知识，并增加了一些有关汇编层体系结构的材料，以满足没学过计算机组织的学生们的需要。书中通过各种语言的例子，阐释了程序设计语言的重要基础概念，讨论了各种概念之间的关系，解释了语言中许多结构的形成和发展过程，以及它们演化为今天这种形式的根源。书中还详细讨论了编译器的工作方式和工作过程，......一起来看看《程序设计语言》这本书的介绍吧!

码农工具