Accelerate your Exploratory Data Analysis with Pandas Profiling

栏目: IT技术 · 发布时间: 4年前

Accelerate Your Exploratory Data Analysis With Pandas-Profiling

Exploratory Data Analysis is tedious. Automate the process and generate detailed interactive reports with a single line of code using Pandas-Profiling

Accelerate your Exploratory Data Analysis with Pandas Profiling

Apr 19 ·8min read

Accelerate your Exploratory Data Analysis with Pandas Profiling

Photo by Lukas Blazek on Unsplash

When starting a new data science project, the first step after getting your hands on the data set for the first time is to understand it. We achieve this by performing Exploratory Data Analysis (EDA). This includes finding out the data type of each variable, the distribution of the target variable, number of distinct values for each predictor variable, if there is any duplicate or missing values in the data set etc.

If you have ever done EDA on any data set (and I assume you have as you are reading this article), I don’t need to tell you how time consuming this process can be. And if you have been a part of many data science projects (be it in your job or by doing personal projects) you know how repetitive all these process can be. But with the Open source library Pandas-profiling that doesn’t have to be the case anymore.

What is Pandas-Profiling?

Accelerate your Exploratory Data Analysis with Pandas Profiling

Photo by Juan Rumimpunu on Unsplash

Pandas-profiling is an open source library that can generate beautiful interactive reports for any data set, with just a single line of code. Sound’s interesting? Let’s take a look at the documentation to get a better understanding of what it does.

Pandas-profiling generates profile reports from a pandas DataFrame . The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a data frame.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, inter-quartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables(Spearman, Pearson and Kendall matrices)
  • Missing values matrix , count, heatmap and dendrogram of missing values
  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Now that we know what pandas-profiling is all about, let’s see how to install it and use it in a Jupyter Notebook or in Google Colab in the following section.

Install Pandas-profiling:

Using pip

You can install pandas-profiling very easily using pip package manager with the following command:

pip install pandas-profiling[notebook,html]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using Conda

If you are using conda, then you can use the following command to installation

conda install -c conda-forge pandas-profiling

Installation in Google Colab

Google colab comes pre-installed with Pandas-profiling, but unfortunately it comes with an older version of it (v1.4). If you are following this article or the GitHub documentation, then the code will not run on Google Colab unless you install the latest version of the library (v2.6).

To do that, you need to first uninstall the existing library and install the latest one as follows:

# To uninstall
!pip uninstall !pip uninstall pandas_profiling

Now to install, we need to run the pip install command.

!pip install pandas-profiling[notebook,html]

Generate Reports:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Photo by Kevin Ku on Unsplash

Now that we are done with the prerequisites, let’s get into the fun part of analyzing some data set.

The data set I will be using for this example is the Titanic data set.

Load the libraries:

import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

Import the data

file = cache_file("titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")data = pd.read_csv(file)

Accelerate your Exploratory Data Analysis with Pandas Profiling

Loading the dataset

Generate report:

To generate the report, run the following code in the notebook.

profile = ProfileReport(data, title="Titanic Dataset", html={'style': {'full_width': True}}, sort="None")

Accelerate your Exploratory Data Analysis with Pandas Profiling

Generate report

That’s it. With a single line of code you have generated the a detailed profile report. Now let us see the results by including the report in the notebook.

Include the report in Notebook as IFrame

profile.to_notebook_iframe()

This will include the interactive report as HTML iframe in the notebook.

Saving the report

Save the report as a HTML file using the following code:

profile.to_file(output_file="your_report.html")

Or obtain the data as JSON using:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file(output_file="your_report.json")

The Results:

Now that we know how to generate reports using pandas-profiling, let’s look at the result.

Overview:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Overview

Accelerate your Exploratory Data Analysis with Pandas Profiling

Warnings

Pandas_profiling creates a very descriptive overview of the predictor variables, by calculating the total missing cells, duplicate rows, number of distinct values, missing values, zeros for the predictor variables. It also marks the variables that have high cardinality or have missing values in the warning section, as you can see in the above image.

Besides all these, it generates detailed analysis for each variable. I will go through some of them in this article, to see the full report with all the codes, find the colab link at the end of the article.

Class distribution:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Numerical Features:

Accelerate your Exploratory Data Analysis with Pandas Profiling

For the numerical features, besides having detailed statistics like mean, standard deviation, min, max, Interquartile range (IQR) etc. it also plots the histogram, gives the list of common and extreme values.

Categorical Features:

Similar to the numerical features, for categorical features it calculates common values, lengths, characters etc.

Accelerate your Exploratory Data Analysis with Pandas Profiling

Interactions:

One of the most interesting things is the interactions and correlation sections of the report. In the interaction section the pandas_profiling library automatically generates interaction plots for every pair of variables . You can get the interaction plot of any pair by selecting the specific variables from the two headers (Like in this example, I have selected passengerId and Age)

Accelerate your Exploratory Data Analysis with Pandas Profiling

Correlation Matrix:

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'’ is less than the average weight of people 5'6'’, and their average weight is less than that of people 5'7'’, etc. Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

When it comes to generating correlation matrix for all the numerical features, the pandas_profiling library gives us all the popular options to choose from including Pearson’s r , Spearman’s ρ etc.

Accelerate your Exploratory Data Analysis with Pandas Profiling

Correlations

Now that, we know the advantages of using pandas_profiling, it is also useful to note the disadvantage that this library has.

Disadvantage:

The main disadvantage of pandas profiling is its use with large data sets. With the increase in the size of the data the time to generate the report also increases a lot.

One way to solve this problem is to generate the profile report for a part of the data set. But while doing this, it is very important to make sure that the data is randomly sampled so that it is representative of all the data we have. We can do this by:

from pandas_profiling import ProfileReport# Generate report for 10000 data points
profile = ProfileReport(data.sample(n = 10000), title="Titanic Data set", html={'style': {'full_width': True}}, sort="None")# save to file
profile.to_file(output_file='10000datapoints.html')

Alternatively, if you are insistent on getting the report on the whole data set, you can do that by using the minimal mode . In the minimal mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large data set. The code for the same is given below:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file(output_file="output.html")

Conclusion:

Now that you know what is pandas-profiling and how to use it, I hope it will save you a ton of time which you can use for more advanced analysis specific to the problem in hand.

If you want to get the full report with working code, you can take a look at the following notebook. And if you would like to read some of my other articles then you can find the links below.

Pandas-Profiling GitHub repo:

If you loved this article, you may also like some of my the other articles.

About Me:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Hi, I am Sukanta Roy. A software developer, an aspiring Machine Learning Engineer, Former Google Summer of Code 2018 student and a huge psychology buff. If any of these things interest you, you can follow me on medium or you can connect with me on LinkedIn .


以上所述就是小编给大家介绍的《Accelerate your Exploratory Data Analysis with Pandas Profiling》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

计算统计

计算统计

Geof H.Givens、Jennifer A.Hoeting / 王兆军、刘民千、邹长亮、杨建峰 / 人民邮电出版社 / 2009-09-01 / 59.00元

随着计算机的快速发展, 数理统计中许多涉及大计算量的有效方法也得到了广泛应用与迅猛发展, 可以说, 计算统计已是统计中一个很重要的研究方向. 本书既包含一些经典的统计计算方法, 如求解非线性方程组的牛顿方法、传统的随机模拟方法等, 又全面地介绍了近些年来发展起来的某些新方法, 如模拟退火算法、基因算法、EM算法、MCMC方法、Bootstrap方法等, 并通过某些实例, 对这些方法的应用进行......一起来看看 《计算统计》 这本书的介绍吧!

html转js在线工具
html转js在线工具

html转js在线工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具