The Bias-Variance Tradeoff Should be Considered For Every Model

栏目: IT技术 · 发布时间: 4年前

内容简介：Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one w

What is the Bias-Variance Tradeoff?

Robert Wood

Jun 17 ·5min read

The Bias-Variance Tradeoff Should be Considered For Every Model — Image by meineresterampe from Pixabay

Why Do Interviewers Ask About it?

Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one who is unfamiliar with their options for mitigating prediction error within a model.

So what is it again?

So bias-variance tradeoff… ever heard of it? If not you’ll want to tune in.

The bias-variance tradeoff is a simple idea, but one that should inform many of the statistical analysis & modeling that you do, primarily when it comes to eliminating error from predictions.

Where error comes into play

When you create a model, your model will have some error . Makes sense! Nothing new here; what is new is the idea that said error is actually made up of two things… You guessed it, bias & variance! Sorry to drill this in so hard, but this reason that matters is that once you understand the component pieces of your error, then you can determine a plan to minimize it.

There are different methods and approaches you can take to manage and minimize bias or variance, but the act of doing so comes with its considerations. Hence, why it is so pivotal for you as a data scientist to understand the effects of either.

Let's break down bias

Bias represents the difference between our prediction and actuals.

High bias vs low bias

A model that with high bias is one that would garner little from data to then generate predictions. A common phrase you might hear is that a high bias model is ‘over generalized’. It depends very little on the training data to determine its predictions, thus when it comes to generating accurate predictions on your test data… it performs very poorly.

There may be assumptions implicit within our approach that leads to a lack of attention given to those features that would allow a model to generate predictions with greater performance.

Conversely, low bias represents a model that is highly accurate. Thus, it’s something we’d clearly want to minimize.

What does variance mean for your model?

Variance is pretty much what it sounds like; variance has to do with the distribution of our predictions and how ‘variable’ they are. If you’ve ever heard the term ‘overfitting’; this is effectively an explanation of the outcomes of a high variance model.

What happens is, very different to a high bias model, a high variance model is one that ‘over depends’, you could say, on your training data. In fact, that model may perform very well on its training data. It may be fit so well to the training data that it appears like an excellent model at first glance, but at the moment you attempt to generalize your model to your test data… it does so very poorly. The model is fit far too closely to your training data.

Understanding the overlap between bias and variance

The below image is an excellent representation of the overlap of models that are high or low in variance or bias. This concept had been visualized a million times and remains a staple for interpreting outcomes associated with the bias-variance tradeoff.

High Bias

Let’s talk about situations in which bias is high: No matter the variation of prediction, the model is implicitly missing whatever signals it might need to interpret or leverage; and as a result is finding itself far from the bullseye.

Low Bias

In situations where bias is low we can see that predictions are at least centered on actuals. — whether variable or not, we’re directionally better off.

High Variation

With high variation, we see that the outcomes are all over the place, clearly over fitting to the data it has seen before. While these outcomes appear directionally correct, they lack generalizability to new data… which should typically be the purpose behind building any model.

Low Variation

In instances of low variation, we can see that the predictions themselves vary significantly less.

Obviously each form of error occurs along a spectrum, but this visualization serves to cement the challenges of of this tradeoff.

Why is it difficult to have both?

When it comes to the design of your model, you will be forced to make certain decisions; and implicit in those decisions lies the act of leaning in one direction or the other.

Lets say you are working with a random forest algorithm, and in an effort to improve performance, you begin tuning hyper parameters… one of which is to add more and more trees and sampled variables.. while this would give you certain performance gains up to a point… what will happen over time is that your model will be far too familiar with the data it’s seen; and any subsequent call to generate predictions will likely treat this new data too similarly to that which it has seen.

You can also think about this from the perspective of the number of variables that are included, especially those that are categorical. The more inputs the more a model may understand about your training data, but potentially the less it will be capable of generalizing to data it has never seen. Again we see the consideration one might need to make in favor of mitigating either bias or variance.

Conclusion

So, we’ve thrown a variety of definitions around, talked about how they play together… but what’s the point of talking about this? I’d boil it all down to consideration . Without an awareness of the affects of model design on outcomes and the ability to define our error, we have no recourse to improve.

You now have greater insight into how your model design might affect their utility in the end. Use that insight, be methodical around your consideration, and build some awesome models!

I hope you enjoyed this, for more posts talking about machine learning, data science, and the like visit me at datasciencelessons.com or follow me on medium!

Happy data science-ing!

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

The Bias-Variance Tradeoff Should be Considered For Every Model

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Bulletproof Ajax中文版

Jeremy Keith / 刘申、宋薇 / 人民邮电出版社 / 2007-11 / 39.00元

本书介绍了如何构建无懈可击的Ajax Web应用程序，重点讲述如何在已有Web站点使用Ajax增强网站用户体验，从而尽可能地保证网站拥有最大限度的可移植性和亲和力，这正是目前大多数网站面临的需求。书中主要介绍了JavaScript、DOM、XMLHttpRequest、数据格式等，同时还提出了一种Hijax方法，即可以让Web应用程序平稳退化的方法。本书适合各层次Web开发和设计人员阅读......一起来看看《Bulletproof Ajax中文版》这本书的介绍吧!

码农工具