The Bias-Variance Tradeoff Should be Considered For Every Model

栏目: IT技术 · 发布时间: 4年前

内容简介:Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one w

What is the Bias-Variance Tradeoff?

Jun 17 ·5min read

The Bias-Variance Tradeoff Should be Considered For Every Model

Image by meineresterampe from Pixabay

Why Do Interviewers Ask About it?

Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one who is unfamiliar with their options for mitigating prediction error within a model.

So what is it again?

So bias-variance tradeoff… ever heard of it? If not you’ll want to tune in.

The bias-variance tradeoff is a simple idea, but one that should inform many of the statistical analysis & modeling that you do, primarily when it comes to eliminating error from predictions.

Where error comes into play

When you create a model, your model will have some error . Makes sense! Nothing new here; what is new is the idea that said error is actually made up of two things… You guessed it, bias & variance! Sorry to drill this in so hard, but this reason that matters is that once you understand the component pieces of your error, then you can determine a plan to minimize it.

There are different methods and approaches you can take to manage and minimize bias or variance, but the act of doing so comes with its considerations. Hence, why it is so pivotal for you as a data scientist to understand the effects of either.

Let's break down bias

Bias represents the difference between our prediction and actuals.

High bias vs low bias

A model that with high bias is one that would garner little from data to then generate predictions. A common phrase you might hear is that a high bias model is ‘over generalized’. It depends very little on the training data to determine its predictions, thus when it comes to generating accurate predictions on your test data… it performs very poorly.

There may be assumptions implicit within our approach that leads to a lack of attention given to those features that would allow a model to generate predictions with greater performance.

Conversely, low bias represents a model that is highly accurate. Thus, it’s something we’d clearly want to minimize.

What does variance mean for your model?

Variance is pretty much what it sounds like; variance has to do with the distribution of our predictions and how ‘variable’ they are. If you’ve ever heard the term ‘overfitting’; this is effectively an explanation of the outcomes of a high variance model.

What happens is, very different to a high bias model, a high variance model is one that ‘over depends’, you could say, on your training data. In fact, that model may perform very well on its training data. It may be fit so well to the training data that it appears like an excellent model at first glance, but at the moment you attempt to generalize your model to your test data… it does so very poorly. The model is fit far too closely to your training data.

Understanding the overlap between bias and variance

The below image is an excellent representation of the overlap of models that are high or low in variance or bias. This concept had been visualized a million times and remains a staple for interpreting outcomes associated with the bias-variance tradeoff.

The Bias-Variance Tradeoff Should be Considered For Every Model

Image credit to myself: Robert Wood

High Bias

Let’s talk about situations in which bias is high: No matter the variation of prediction, the model is implicitly missing whatever signals it might need to interpret or leverage; and as a result is finding itself far from the bullseye.

Low Bias

In situations where bias is low we can see that predictions are at least centered on actuals. — whether variable or not, we’re directionally better off.

High Variation

With high variation, we see that the outcomes are all over the place, clearly over fitting to the data it has seen before. While these outcomes appear directionally correct, they lack generalizability to new data… which should typically be the purpose behind building any model.

Low Variation

In instances of low variation, we can see that the predictions themselves vary significantly less.

Obviously each form of error occurs along a spectrum, but this visualization serves to cement the challenges of of this tradeoff.

Why is it difficult to have both?

When it comes to the design of your model, you will be forced to make certain decisions; and implicit in those decisions lies the act of leaning in one direction or the other.

Lets say you are working with a random forest algorithm, and in an effort to improve performance, you begin tuning hyper parameters… one of which is to add more and more trees and sampled variables.. while this would give you certain performance gains up to a point… what will happen over time is that your model will be far too familiar with the data it’s seen; and any subsequent call to generate predictions will likely treat this new data too similarly to that which it has seen.

You can also think about this from the perspective of the number of variables that are included, especially those that are categorical. The more inputs the more a model may understand about your training data, but potentially the less it will be capable of generalizing to data it has never seen. Again we see the consideration one might need to make in favor of mitigating either bias or variance.

Conclusion

So, we’ve thrown a variety of definitions around, talked about how they play together… but what’s the point of talking about this? I’d boil it all down to consideration . Without an awareness of the affects of model design on outcomes and the ability to define our error, we have no recourse to improve.

You now have greater insight into how your model design might affect their utility in the end. Use that insight, be methodical around your consideration, and build some awesome models!

I hope you enjoyed this, for more posts talking about machine learning, data science, and the like visit me at datasciencelessons.com or follow me on medium!

Happy data science-ing!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Perl语言编程

Perl语言编程

克里斯蒂安森 (Tom Christiansen) (作者)、Brian D Foy (作者)、Larry Wall (作者)、Jon Orwant (作者) / 苏金国 (译者)、吴爽 (译者) / 中国电力出版社 / 2014-9-1 / 148

从1991年第一版问世以来,《Perl语言编程》很快成为无可争议的Perl宝典,如今仍是这种高实用性语言的权威指南。Perl最初只是作为一个功能强大的文本处理工具,不过很快发展成为一种通用的编程语言,可以帮助成千上万的程序员、系统管理员,以及像你一样的技术爱好者轻松完成工作。 人们早已经翘首以待这本“大骆驼书”的更新,如今终于得偿所愿。在这一版中,三位颇有声望的Perl作者讲述了这种语言当前......一起来看看 《Perl语言编程》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换