内容简介:Machine learning (ML) is “an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.” ML algorithms are used to find patterns in data that generate ins
Evaluating the Basics of Machine Learning
Jul 24 ·7min read
Machine learning (ML) is “an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.” ML algorithms are used to find patterns in data that generate insight and help make data-driven decisions and predictions. These types of algorithms are employed every day to make critical decisions in medical diagnosis, stock trading, transportation, legal matters and much more. Therefore, it can be seen why data scientists place ML on such a high pedestal; it provides a medium for high priority decisions, that can guide better business and smart actions, in real-time without human intervention.
Now, ML models do not necessarily ‘learn’ like how humans learn. Rather, these algorithms use computational methods to understand information directly from data without relying on a predetermined equation as a model. To do this, the algorithms are made to determine a pattern in data and develop a target function which best maps an input variable, x , to a target variable, y . It must be noted here that the true form of the target function is usually unknown. If the function was known, then ML would not be needed.
Therefore, the idea is to determine the best estimate of this target function by conducting sound inference about the sample data to then apply and optimize the appropriate ML technique for the situation at hand. Different situations require that different assumptions be made about the form of the function being estimated. Additionally, different ML algorithms make different assumptions about the shape the function, and thus, how it should be optimized. Understandably, it is easy to get overwhelmed with how much there is to learn with ML. So, in this post, I discuss two important topics in ML that every data scientist should know.
- The Type of Learning
ML algorithms are often categorized as either supervised or unsupervised , and this broadly refers to whether the dataset being used in labelled or not. Supervised ML algorithms apply what has been learned in the past to new data by using labelled examples to predict future outcomes. Essentially, the correct answer is known for these types of problems and the estimated model’s performance is judged based on whether or not the predicted output is correct. In contrast, unsupervised ML algorithms refer to those developed when the information used to train the model is neither classified nor labelled. These algorithms work by attempting to make sense out of data by extracting features and patterns that can be found within the sample.
Now semi-supervised learning does exist, and it takes the middle ground between supervised and unsupervised learning. That is, a small portion of the data might be labelled, and the remainder is not.
Supervised learning is useful when the task given is a classification or regression problem. Classification problems refer to grouping observations or input data into discrete ‘classes’ based on particular criteria developed by the model. A typical example of this would be predicting whether an email is spam or non-spam. The model would be developed and trained on a dataset containing both spam and non-spam emails, where each observation is appropriately labelled.
Regression problems, on the other hand, refer to the process of accepting a set of input data and determining a continuous quantity as the output. A common example of this is predicting an individual’s income, given their education level, gender, and the total amount of hours worked.
Unsupervised learning is most appropriate when the answer to a particular question is more or less unknown. These algorithms are mainly used for clustering and anomaly detection because it is possible to detect similarities throughout observations without knowing exactly what the observation refers to. For example, one can look at the colour, size, and shape of various flowers and then roughly separate them into groups without truly knowing the species of each flower. Additionally, consider a credit card company monitoring consumer behaviour. It would be possible to detect fraudulent transactions by monitoring where transactions have occurred. For example, consider a credit card is frequently used in New York. If on a particular day, the card is used in New York, Los Angeles and Hongkong, then it could be considered an anomaly and the system should alert the relevant parties.
2. Model Fitting
Fitting a model refers to making an algorithm determine the relationship between the predictors and the outcome so that future values can be predicted. Recall that the models are developed using training data, which is ideally a large random sample that accurately reflects a population. This necessary action comes with some very undesirable risks. Fully accurate models are difficult to estimate because sample data are subject to random noise. This random noise, along with the number of assumptions made by the researcher, has the potential to cause ML models to learn fake patterns within the data. If one tries to combat this risk by making too few assumptions, it can cause the model to not learn enough information from the data. These issues are known as overfitting and underfitting, and the goal is to determine an appropriate mix between simplicity and complexity.
Overfittingoccurs when a model learns ‘too much’ from the training data, including random noise. Models are then able to determine very intricate patterns within the data, but this negatively affects the performance on new data. The noise picked up in the training data does not apply to new or unseen data, and the model is unable to generalize the patterns found. Certain ML models are more prone to overfitting than others, and these include both nonlinear and nonparametric models. For these types of models, overfitting can be overcome altering the model itself. Consider a nonlinear equation to the 4th power. It is possible to reduce overfitting by reducing the power of the model to maybe the 3rd power once acceptable results will still be produced. Alternatively, overfitting can be limited by applying cross-validation or regularization to the model parameters.
Underfitting, on the other hand, occurs when a model is unable to learn a sufficient amount of information from the training data. Then models are then unable to determine suitable patterns within the data, and this negatively affects the performance on new data. Since very little is learned, the model cannot apply much to unseen data and it is unable to generalize observations for the research problem at hand. Commonly, underfitting is as a result of model misspecification and can be fixed by using a more appropriate ML algorithm. For example, is a linear equation is used to estimate a nonlinear problem, underfitting will occur. Although this is true, underfitting can also be corrected through cross-validation and parameter regularization.
Cross-validationis a technique used to evaluate a model’s fit by training several models on various subsets of the sample dataset and then evaluating them on a complementary subset of the training set.
Regularizationrefers to the process of adding information to a model parameter in order to combat poor model performance. This can be through specifying that a parameter follows a particular distribution, such as the normal distribution versus a uniform distribution; or by giving a range of values that a parameter must fall within.
Machine learning models are extremely powerful, but with great power comes great responsibility. Developing the most appropriate ML model requires that the researcher adequately understands the problem at hand and what techniques will be suitable given the circumstance. Understanding whether a problem is supervised or unsupervised will provide some insight into what type of ML algorithm will be used; while understanding the model fit can prevent poor model performance when deployed. Happy modelling!
References:
machinelearningmastery.com/how-machine-learning-algorithms-work/
Other Useful Material:
simplilearn.com/importance-of-machine-learning-for-data-scientists-article
towardsdatascience.com/important-topics-in-machine-learning-you-need-to-know-21ad02cc6be5
以上所述就是小编给大家介绍的《Important Topics in Machine Learning That Every Data Scientist Must Know》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
互联网爆破术:快速掌握互联网运营全链条实战技巧
茶文 / 电子工业出版社 / 2018-7 / 49.00元
《互联网爆破术:快速掌握互联网运营全链条实战技巧》是一本实用的互联网运营书籍,可以让读者快速掌握运营全链条的干货技巧和相关模型,涵盖如何有效寻找市场的需求爆破点,通过测试一步步放大并引爆,直至赢利。《互联网爆破术:快速掌握互联网运营全链条实战技巧》非常适合互联网运营人员及互联网创业者阅读,它可以帮读者快速了解互联网运营的核心技巧,并用最低的成本取得成功。本书5大特色:快速入门、实战干货、低成本、系......一起来看看 《互联网爆破术:快速掌握互联网运营全链条实战技巧》 这本书的介绍吧!