Visualize Missing Values with Missingno

栏目: IT技术 · 发布时间: 4年前

内容简介:Explore the missing values in your dataset.Data is the new fuel. However, the raw data is cheap. We need to process it well to take the most value out of it. Complex, well-structured models are as good as the data we feed to it. Thus, data needs to be clea

Visualize Missing Values with Missingno

Explore the missing values in your dataset.

Photo by Irina on Unsplash

Data is the new fuel. However, the raw data is cheap. We need to process it well to take the most value out of it. Complex, well-structured models are as good as the data we feed to it. Thus, data needs to be cleaned and processed thoroughly in order to build robust and accurate models.

One of the issues that we are likely to encounter in raw data is missing values. Consider a case where we have features (columns in a dataframe) on some observations (rows in a dataframe). If we do not have the value in a particular row-column pair, then we have a missing value. We may have only a few missing values or half of an entire column may be missing. In some cases, we can just ignore or drop the rows or columns with missing values. On the other, there might be some cases in which we cannot afford to drop even a single missing value. In any case, handling missing values process starts with exploring them in the dataset.

Pandas provides functions to check the number of missing values in the dataset. Missingno library takes it one step further and provides the distribution of missing values in the dataset by informative visualizations. Using the plots of missingno , we are able to see where the missing values are located in each column and if there is a correlation between missing values of different columns. Before handling missing values, it is very important to explore them in the dataset. Thus, I consider missingno as a highly valuable asset in data cleaning and preprocessing steps.

In this post, we will explore the functionalities of missingno plot by going through some examples.

Let’s first try to explore a dataset about the movies on streaming platforms. The dataset is available here on kaggle.

import numpy as np
import pandas as pddf = pd.read_csv("/content/MoviesOnStreamingPlatforms.csv")
print(df.shape)
df.head()

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

触点管理

触点管理

[德] 安妮·M·许勒尔(Anne M. Schuller) / 于嵩楠 / 中国人民大学出版社 / 2015-12-1 / 49.00元

我们所处的时代正经历着巨大的变革,变得越来越数字化、复杂化和社会化。互联网浪潮猛烈冲击着传统商业世界,数字原住民队伍不断壮大,改变了企业的内外生态环境;金字塔式结构正在瓦解,组织变得越来越网络化和扁平化;员工接管了企业的话语权,我们比任何时期都更需要员工的忠诚,并期望他们表现出更加自主的创造力和协作精神。 在数字化商业世界里,公司内部员工与组织和领导之间接触点的数量直线上升,任何真相都无法对......一起来看看 《触点管理》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

SHA 加密
SHA 加密

SHA 加密工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具