Accelerate your Exploratory Data Analysis with Pandas Profiling

栏目: IT技术 · 发布时间: 4年前

Accelerate Your Exploratory Data Analysis With Pandas-Profiling

Exploratory Data Analysis is tedious. Automate the process and generate detailed interactive reports with a single line of code using Pandas-Profiling

Accelerate your Exploratory Data Analysis with Pandas Profiling

Apr 19 ·8min read

Accelerate your Exploratory Data Analysis with Pandas Profiling

Photo by Lukas Blazek on Unsplash

When starting a new data science project, the first step after getting your hands on the data set for the first time is to understand it. We achieve this by performing Exploratory Data Analysis (EDA). This includes finding out the data type of each variable, the distribution of the target variable, number of distinct values for each predictor variable, if there is any duplicate or missing values in the data set etc.

If you have ever done EDA on any data set (and I assume you have as you are reading this article), I don’t need to tell you how time consuming this process can be. And if you have been a part of many data science projects (be it in your job or by doing personal projects) you know how repetitive all these process can be. But with the Open source library Pandas-profiling that doesn’t have to be the case anymore.

What is Pandas-Profiling?

Accelerate your Exploratory Data Analysis with Pandas Profiling

Photo by Juan Rumimpunu on Unsplash

Pandas-profiling is an open source library that can generate beautiful interactive reports for any data set, with just a single line of code. Sound’s interesting? Let’s take a look at the documentation to get a better understanding of what it does.

Pandas-profiling generates profile reports from a pandas DataFrame . The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics — if relevant for the column type — are presented in an interactive HTML report:

  • Type inference: detect the types of columns in a data frame.
  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, inter-quartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables(Spearman, Pearson and Kendall matrices)
  • Missing values matrix , count, heatmap and dendrogram of missing values
  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Now that we know what pandas-profiling is all about, let’s see how to install it and use it in a Jupyter Notebook or in Google Colab in the following section.

Install Pandas-profiling:

Using pip

You can install pandas-profiling very easily using pip package manager with the following command:

pip install pandas-profiling[notebook,html]

Alternatively, you could install the latest version directly from Github:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using Conda

If you are using conda, then you can use the following command to installation

conda install -c conda-forge pandas-profiling

Installation in Google Colab

Google colab comes pre-installed with Pandas-profiling, but unfortunately it comes with an older version of it (v1.4). If you are following this article or the GitHub documentation, then the code will not run on Google Colab unless you install the latest version of the library (v2.6).

To do that, you need to first uninstall the existing library and install the latest one as follows:

# To uninstall
!pip uninstall !pip uninstall pandas_profiling

Now to install, we need to run the pip install command.

!pip install pandas-profiling[notebook,html]

Generate Reports:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Photo by Kevin Ku on Unsplash

Now that we are done with the prerequisites, let’s get into the fun part of analyzing some data set.

The data set I will be using for this example is the Titanic data set.

Load the libraries:

import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

Import the data

file = cache_file("titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")data = pd.read_csv(file)

Accelerate your Exploratory Data Analysis with Pandas Profiling

Loading the dataset

Generate report:

To generate the report, run the following code in the notebook.

profile = ProfileReport(data, title="Titanic Dataset", html={'style': {'full_width': True}}, sort="None")

Accelerate your Exploratory Data Analysis with Pandas Profiling

Generate report

That’s it. With a single line of code you have generated the a detailed profile report. Now let us see the results by including the report in the notebook.

Include the report in Notebook as IFrame

profile.to_notebook_iframe()

This will include the interactive report as HTML iframe in the notebook.

Saving the report

Save the report as a HTML file using the following code:

profile.to_file(output_file="your_report.html")

Or obtain the data as JSON using:

# As a string
json_data = profile.to_json()

# As a file
profile.to_file(output_file="your_report.json")

The Results:

Now that we know how to generate reports using pandas-profiling, let’s look at the result.

Overview:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Overview

Accelerate your Exploratory Data Analysis with Pandas Profiling

Warnings

Pandas_profiling creates a very descriptive overview of the predictor variables, by calculating the total missing cells, duplicate rows, number of distinct values, missing values, zeros for the predictor variables. It also marks the variables that have high cardinality or have missing values in the warning section, as you can see in the above image.

Besides all these, it generates detailed analysis for each variable. I will go through some of them in this article, to see the full report with all the codes, find the colab link at the end of the article.

Class distribution:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Numerical Features:

Accelerate your Exploratory Data Analysis with Pandas Profiling

For the numerical features, besides having detailed statistics like mean, standard deviation, min, max, Interquartile range (IQR) etc. it also plots the histogram, gives the list of common and extreme values.

Categorical Features:

Similar to the numerical features, for categorical features it calculates common values, lengths, characters etc.

Accelerate your Exploratory Data Analysis with Pandas Profiling

Interactions:

One of the most interesting things is the interactions and correlation sections of the report. In the interaction section the pandas_profiling library automatically generates interaction plots for every pair of variables . You can get the interaction plot of any pair by selecting the specific variables from the two headers (Like in this example, I have selected passengerId and Age)

Accelerate your Exploratory Data Analysis with Pandas Profiling

Correlation Matrix:

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. For example, height and weight are related; taller people tend to be heavier than shorter people. The relationship isn’t perfect. People of the same height vary in weight, and you can easily think of two people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of people 5'5'’ is less than the average weight of people 5'6'’, and their average weight is less than that of people 5'7'’, etc. Correlation can tell you just how much of the variation in peoples’ weights is related to their heights.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an “inverse” correlation).

When it comes to generating correlation matrix for all the numerical features, the pandas_profiling library gives us all the popular options to choose from including Pearson’s r , Spearman’s ρ etc.

Accelerate your Exploratory Data Analysis with Pandas Profiling

Correlations

Now that, we know the advantages of using pandas_profiling, it is also useful to note the disadvantage that this library has.

Disadvantage:

The main disadvantage of pandas profiling is its use with large data sets. With the increase in the size of the data the time to generate the report also increases a lot.

One way to solve this problem is to generate the profile report for a part of the data set. But while doing this, it is very important to make sure that the data is randomly sampled so that it is representative of all the data we have. We can do this by:

from pandas_profiling import ProfileReport# Generate report for 10000 data points
profile = ProfileReport(data.sample(n = 10000), title="Titanic Data set", html={'style': {'full_width': True}}, sort="None")# save to file
profile.to_file(output_file='10000datapoints.html')

Alternatively, if you are insistent on getting the report on the whole data set, you can do that by using the minimal mode . In the minimal mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large data set. The code for the same is given below:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file(output_file="output.html")

Conclusion:

Now that you know what is pandas-profiling and how to use it, I hope it will save you a ton of time which you can use for more advanced analysis specific to the problem in hand.

If you want to get the full report with working code, you can take a look at the following notebook. And if you would like to read some of my other articles then you can find the links below.

Pandas-Profiling GitHub repo:

If you loved this article, you may also like some of my the other articles.

About Me:

Accelerate your Exploratory Data Analysis with Pandas Profiling

Hi, I am Sukanta Roy. A software developer, an aspiring Machine Learning Engineer, Former Google Summer of Code 2018 student and a huge psychology buff. If any of these things interest you, you can follow me on medium or you can connect with me on LinkedIn .


以上所述就是小编给大家介绍的《Accelerate your Exploratory Data Analysis with Pandas Profiling》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Realm of Racket

Realm of Racket

Matthias Felleisen、Conrad Barski M.D.、David Van Horn、Eight Students Northeastern University of / No Starch Press / 2013-6-25 / USD 39.95

Racket is the noble descendant of Lisp, a programming language renowned for its elegance and power. But while Racket retains the functional goodness of Lisp that makes programming purists drool, it wa......一起来看看 《Realm of Racket》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

随机密码生成器
随机密码生成器

多种字符组合密码

URL 编码/解码
URL 编码/解码

URL 编码/解码