Accelerate Your Exploratory Data Analysis With Pandas-Profiling
Exploratory Data Analysis is tedious. Automate the process and generate detailed interactive reports with a single line of code using Pandas-Profiling
Apr 19 ·8min read
When starting a new data science project, the first step after getting your hands on the data set for the first time is to understand it. We achieve this by performing Exploratory Data Analysis (EDA). This includes finding out the data type of each variable, the distribution of the target variable, number of distinct values for each predictor variable, if there is any duplicate or missing values in the data set etc.
If you have ever done EDA on any data set (and I assume you have as you are reading this article), I don’t need to tell you how time consuming this process can be. And if you have been a part of many data science projects (be it in your job or by doing personal projects) you know how repetitive all these process can be. But with the Open source library Pandas-profiling that doesn’t have to be the case anymore.
What is Pandas-Profiling?
Pandas-profiling is an open source library that can generate beautiful interactive reports for any data set, with just a single line of code. Sound’s interesting? Let’s take a look at the documentation to get a better understanding of what it does.
Pandas-profiling generates profile reports from a pandas DataFrame
. The pandas df.describe()
function is great but a little basic for serious exploratory data analysis. pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:
- Type inference: detect the types of columns in a data frame.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, inter-quartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables(Spearman, Pearson and Kendall matrices)
- Missing values matrix , count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
Now that we know what pandas-profiling is all about, let’s see how to install it and use it in a Jupyter Notebook or in Google Colab in the following section.
Install Pandas-profiling:
Using pip
You can install pandas-profiling very easily using pip package manager with the following command:
pip install pandas-profiling[notebook,html]
Alternatively, you could install the latest version directly from Github:
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
Using Conda
If you are using conda, then you can use the following command to installation
conda install -c conda-forge pandas-profiling
Installation in Google Colab
Google colab comes pre-installed with Pandas-profiling, but unfortunately it comes with an older version of it (v1.4). If you are following this article or the GitHub documentation, then the code will not run on Google Colab unless you install the latest version of the library (v2.6).
To do that, you need to first uninstall the existing library and install the latest one as follows:
# To uninstall !pip uninstall !pip uninstall pandas_profiling
Now to install, we need to run the pip install command.
!pip install pandas-profiling[notebook,html]
Generate Reports:
Now that we are done with the prerequisites, let’s get into the fun part of analyzing some data set.
The data set I will be using for this example is the Titanic data set.
Load the libraries:
import pandas as pd import pandas_profiling from pandas_profiling import ProfileReport from pandas_profiling.utils.cache import cache_file
Import the data
file = cache_file("titanic.csv", "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")data = pd.read_csv(file)
Generate report:
To generate the report, run the following code in the notebook.
profile = ProfileReport(data, title="Titanic Dataset", html={'style': {'full_width': True}}, sort="None")
That’s it. With a single line of code you have generated the a detailed profile report. Now let us see the results by including the report in the notebook.
Include the report in Notebook as IFrame
profile.to_notebook_iframe()
This will include the interactive report as HTML iframe in the notebook.
Saving the report
Save the report as a HTML file using the following code:
profile.to_file(output_file="your_report.html")
Or obtain the data as JSON using:
# As a string json_data = profile.to_json() # As a file profile.to_file(output_file="your_report.json")
The Results:
Now that we know how to generate reports using pandas-profiling, let’s look at the result.
Overview:
Pandas_profiling creates a very descriptive overview of the predictor variables, by calculating the total missing cells, duplicate rows, number of distinct values, missing values, zeros for the predictor variables. It also marks the variables that have high cardinality or have missing values in the warning section, as you can see in the above image.
Besides all these, it generates detailed analysis for each variable. I will go through some of them in this article, to see the full report with all the codes, find the colab link at the end of the article.
Class distribution:
Numerical Features:
For the numerical features, besides having detailed statistics like mean, standard deviation, min, max, Interquartile range (IQR) etc. it also plots the histogram, gives the list of common and extreme values.
Categorical Features:
Similar to the numerical features, for categorical features it calculates common values, lengths, characters etc.
Interactions:
One of the most interesting things is the interactions and correlation sections of the report. In the interaction section the pandas_profiling library automatically generates interaction plots for every pair of variables . You can get the interaction plot of any pair by selecting the specific variables from the two headers (Like in this example, I have selected passengerId and Age)
Correlation Matrix:
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'’ is less than the average weight of people 5'6'’, and their average weight is less than that of people 5'7'’, etc. Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.
The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.
If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).
When it comes to generating correlation matrix for all the numerical features, the pandas_profiling library gives us all the popular options to choose from including Pearson’s r , Spearman’s ρ etc.
Now that, we know the advantages of using pandas_profiling, it is also useful to note the disadvantage that this library has.
Disadvantage:
The main disadvantage of pandas profiling is its use with large data sets. With the increase in the size of the data the time to generate the report also increases a lot.
One way to solve this problem is to generate the profile report for a part of the data set. But while doing this, it is very important to make sure that the data is randomly sampled so that it is representative of all the data we have. We can do this by:
from pandas_profiling import ProfileReport# Generate report for 10000 data points profile = ProfileReport(data.sample(n = 10000), title="Titanic Data set", html={'style': {'full_width': True}}, sort="None")# save to file profile.to_file(output_file='10000datapoints.html')
Alternatively, if you are insistent on getting the report on the whole data set, you can do that by using the minimal mode . In the minimal mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large data set. The code for the same is given below:
profile = ProfileReport(large_dataset, minimal=True) profile.to_file(output_file="output.html")
Conclusion:
Now that you know what is pandas-profiling and how to use it, I hope it will save you a ton of time which you can use for more advanced analysis specific to the problem in hand.
If you want to get the full report with working code, you can take a look at the following notebook. And if you would like to read some of my other articles then you can find the links below.
Demo
Demo on Titanic Data set
colab.research.google.com
Pandas-Profiling GitHub repo:
If you loved this article, you may also like some of my the other articles.
What is ACM ICPC and how to prepare for it (the beginner’s guide)
What is ACM ICPC?
codeburst.io
About Me:
Hi, I am Sukanta Roy. A software developer, an aspiring Machine Learning Engineer, Former Google Summer of Code 2018 student and a huge psychology buff. If any of these things interest you, you can follow me on medium or you can connect with me on LinkedIn .
以上所述就是小编给大家介绍的《Accelerate your Exploratory Data Analysis with Pandas Profiling》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Realm of Racket
Matthias Felleisen、Conrad Barski M.D.、David Van Horn、Eight Students Northeastern University of / No Starch Press / 2013-6-25 / USD 39.95
Racket is the noble descendant of Lisp, a programming language renowned for its elegance and power. But while Racket retains the functional goodness of Lisp that makes programming purists drool, it wa......一起来看看 《Realm of Racket》 这本书的介绍吧!