What the f?
A good hard look at the ‘f’ word of Machine Learning and why it can’t be ignored!
May 6 ·6min read
I know you are eager to find out what this ‘f’ word actually is. Stay with me, we will get to it very soon. One thing I can tell you right away is that regardless of your familiarity with Machine Learning, understanding this ‘f’ word will help you understand what most of Machine Learning is all about.
B efore that, let’s indulge in a little bit of role play. You are a Data Scientist and your startup has tasked you to work with a marketing colleague, to improve the sales of your company’s product. You have to advise the “marketing guy” on how to adjust the advertisement budget on three different media outlets — TV, Radio and Newspaper.
You take a look at the past data (Fig. 1) and you can tell with the naked eye that, clearly, how much money you put into advertising on each media outlet — TV, Radio and Newspaper — has an impact on the product’s sales.
As a Data Scientist, you would like to understand and explain how these 3 operate together to influence sales. In other words, we would like to model the sales as a function of TV, Radio and Newspaper budget. Yes, that’s our elusive ‘f’ word — function.
What does this ‘f’ mean?
Simply put, you can think of f as something that takes an input X and produces an output Y. A good analogous example would be a washing machine. You put in dirty clothes (X) into the washing machine (f) and it gives you back washed clothes (Y).
In the context of product sales and advertisement media budget, the function f will take TV, Radio and Newspaper budgets, represented by X1, X2, X3 respectively, as input and return sales Y as output. (We represent X1, X2 and X3 in a combined form — as a vector X)
Spoiler Alert! Much of Machine Learning is actually just coming up with a good f that can take some input data and return a reliable output.
Why do we want this f?
There are 3 main reasons why we want to find a good f:
- With a good f we can input budgets for all 3 media and predict how much the sales would be.
- We can understand which predictors i.e. TV, Radio, Newspaper budgets, are important in influencing Y. We might find out that spending money on Newspapers is actually a waste because Newspaper ads do not boost sales by much.
- We might be able to understand how each predictor influences Y. For example, we might find that investing in TV ads is 5x more effective than investing in Newspaper ads.
Enough teasing.….how do I find this f?
Before we can answer that question, we need to ask ourselves the following question:
Is there some perfect f out there in the wide, gorgeous Universe?
Well, maybe not a “perfect” f, but there is an ideal/optimal f. If we take a look at Fig. 2, we notice something curious — for one point on the X-axis (Newspaper budget), there seem to be multiple corresponding Y (sales) values in some cases. For example for the data plotted in Fig. 2, for x = 6.4, there are two corresponding values on the Y-axis: y =11.9 and y = 17.3.
So an ideal function can simply be the average of all y values corresponding to a particular x. In other words, for the figure above:
In more ‘mathy’ terms, this average value of all Ys at any X is called the expected value, E(Y). Thus, this procedure of taking the average of all Y values at any X can be our ‘ideal’ function. Our ideal f can be expressed in the following way:
Okay….but then why do we need Machine Learning?
Sadly, because we live in the “real world”.
In the “real world”, we do not have all the data that we need to reliably estimate Y using the averaging idea we discussed above. Even for the sales-advertisement data, you can see that in Fig. 2, there is no corresponding Y value for x=77.5, x=95, x=110 etc.
One neat solution for this problem of missing data is to use the idea of a neighbourhood.
What it means is that instead of taking the average of Y values strictly at x=77.5, we can take the average of all values of Y that occur at points neighbouring x=77.5. So, maybe a possible neighbourhood could be something like, from x=75 to x=80 (refer to the blue vertical lines in Fig. 3).
Our definition and notation change a little bit to reflect the idea that we are no longer restricted to values of Y occurring exactly at a given point X=x, but instead are now looking at Y values occurring in the neighbourhood of X=x.
This works fine until we run into two major problems:
- What happens when there are multiple predictors apart from just Newspaper budget (eg: TV, Radio, Facebook ads, Google ads…). In that case the problem expands into multiple dimensions (beyond just x and y axes) and it gets increasingly difficult to define our precious little ‘neighbourhood’. (This problem has a badass name: The Curse of Dimensionality )
- What happens when there is no data in the neighbouring area? For example in Fig. 3 there is no data from x=115 to x=145 and beyond.
Machine Learning to the rescue!
To not constrain our f by the two problems mentioned above, we turn to Machine Learning to estimate this f instead. While there is a wide assortment of Machine Learning models to choose from, let’s consider a simple but effective one — a linear regression model. In a linear regression model, the inputs X1 (TV budget), X2 (Radio budget), X3 (Newspaper budget) are multiplied by w1, w2 and w3 respectively and added together to obtain Y.
In the equation above, w0, w1, w2, w3 are parameters whose values are learnt through training and fitting the model on the data. In other words, the values of these parameters change by ‘looking’ at the data and repeatedly making guesses that get better over time till we obtain a good enough f.
Conclusion
Which model(s) to choose for estimating f, how to carry out the learning procedure and what a “good enough” f means are non-trivial questions that Machine Learning practitioners investigate iteratively when working on a particular problem. Machine Learning practitioners often rely on experience, domain knowledge and empirical evidence to try to answer these questions. Nonetheless, regardless of the context and nature of a problem, finding a good f is what underlies much of prediction, inference and problem solving using Machine Learning.
References/Inspiration
- Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R . New York :Springer, 2013.
- Hastie, Trevor, Robert Tibshirani, and J. H Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer, 2009.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
图片转BASE64编码
在线图片转Base64编码工具
URL 编码/解码
URL 编码/解码