内容简介:Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one w
What is the Bias-Variance Tradeoff?
Jun 17 ·5min read
Why Do Interviewers Ask About it?
Questions about the bias-variance tradeoff are used very frequently in interviews for data scientist positions. They often serve to delineate a data scientist that is seasoned and knows their stuff versus one that is junior… and more specifically, as one who is unfamiliar with their options for mitigating prediction error within a model.
So what is it again?
So bias-variance tradeoff… ever heard of it? If not you’ll want to tune in.
The bias-variance tradeoff is a simple idea, but one that should inform many of the statistical analysis & modeling that you do, primarily when it comes to eliminating error from predictions.
Where error comes into play
When you create a model, your model will have some error . Makes sense! Nothing new here; what is new is the idea that said error is actually made up of two things… You guessed it, bias & variance! Sorry to drill this in so hard, but this reason that matters is that once you understand the component pieces of your error, then you can determine a plan to minimize it.
There are different methods and approaches you can take to manage and minimize bias or variance, but the act of doing so comes with its considerations. Hence, why it is so pivotal for you as a data scientist to understand the effects of either.
Let's break down bias
Bias represents the difference between our prediction and actuals.
High bias vs low bias
A model that with high bias is one that would garner little from data to then generate predictions. A common phrase you might hear is that a high bias model is ‘over generalized’. It depends very little on the training data to determine its predictions, thus when it comes to generating accurate predictions on your test data… it performs very poorly.
There may be assumptions implicit within our approach that leads to a lack of attention given to those features that would allow a model to generate predictions with greater performance.
Conversely, low bias represents a model that is highly accurate. Thus, it’s something we’d clearly want to minimize.
What does variance mean for your model?
Variance is pretty much what it sounds like; variance has to do with the distribution of our predictions and how ‘variable’ they are. If you’ve ever heard the term ‘overfitting’; this is effectively an explanation of the outcomes of a high variance model.
What happens is, very different to a high bias model, a high variance model is one that ‘over depends’, you could say, on your training data. In fact, that model may perform very well on its training data. It may be fit so well to the training data that it appears like an excellent model at first glance, but at the moment you attempt to generalize your model to your test data… it does so very poorly. The model is fit far too closely to your training data.
Understanding the overlap between bias and variance
The below image is an excellent representation of the overlap of models that are high or low in variance or bias. This concept had been visualized a million times and remains a staple for interpreting outcomes associated with the bias-variance tradeoff.
High Bias
Let’s talk about situations in which bias is high: No matter the variation of prediction, the model is implicitly missing whatever signals it might need to interpret or leverage; and as a result is finding itself far from the bullseye.
Low Bias
In situations where bias is low we can see that predictions are at least centered on actuals. — whether variable or not, we’re directionally better off.
High Variation
With high variation, we see that the outcomes are all over the place, clearly over fitting to the data it has seen before. While these outcomes appear directionally correct, they lack generalizability to new data… which should typically be the purpose behind building any model.
Low Variation
In instances of low variation, we can see that the predictions themselves vary significantly less.
Obviously each form of error occurs along a spectrum, but this visualization serves to cement the challenges of of this tradeoff.
Why is it difficult to have both?
When it comes to the design of your model, you will be forced to make certain decisions; and implicit in those decisions lies the act of leaning in one direction or the other.
Lets say you are working with a random forest algorithm, and in an effort to improve performance, you begin tuning hyper parameters… one of which is to add more and more trees and sampled variables.. while this would give you certain performance gains up to a point… what will happen over time is that your model will be far too familiar with the data it’s seen; and any subsequent call to generate predictions will likely treat this new data too similarly to that which it has seen.
You can also think about this from the perspective of the number of variables that are included, especially those that are categorical. The more inputs the more a model may understand about your training data, but potentially the less it will be capable of generalizing to data it has never seen. Again we see the consideration one might need to make in favor of mitigating either bias or variance.
Conclusion
So, we’ve thrown a variety of definitions around, talked about how they play together… but what’s the point of talking about this? I’d boil it all down to consideration . Without an awareness of the affects of model design on outcomes and the ability to define our error, we have no recourse to improve.
You now have greater insight into how your model design might affect their utility in the end. Use that insight, be methodical around your consideration, and build some awesome models!
I hope you enjoyed this, for more posts talking about machine learning, data science, and the like visit me at datasciencelessons.com or follow me on medium!
Happy data science-ing!
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
深度学习核心技术与实践
猿辅导研究团队 / 电子工业出版社 / 2018-2 / 119.00元
《深度学习核心技术与实践》主要介绍深度学习的核心算法,以及在计算机视觉、语音识别、自然语言处理中的相关应用。《深度学习核心技术与实践》的作者们都是业界一线的深度学习从业者,所以书中所写内容和业界联系紧密,所涵盖的深度学习相关知识点比较全面。《深度学习核心技术与实践》主要讲解原理,较少贴代码。 《深度学习核心技术与实践》适合深度学习从业人士或者相关研究生作为参考资料,也可以作为入门教程来大致了......一起来看看 《深度学习核心技术与实践》 这本书的介绍吧!