Dexplot: Python library for data visualization

栏目: IT技术 · 发布时间: 5年前

内容简介：Dexplot is a Python library for delivering beautiful data visualizations with a simple and intuitive user experience.The primary goals for dexplot are:

Dexplot is a Python library for delivering beautiful data visualizations with a simple and intuitive user experience.

The primary goals for dexplot are:

Maintain a very consistent API with as few functions as necessary to make the desired statistical plots
Allow the user tremendous power without using matplotlib

pip install dexplot

Built for long and wide data

Dexplot is primarily built for long data, which is a form of data where each row represents a single observation and each column represents a distinct quantity. It is often referred to as "tidy" data. Here, we have some long data.

Dexplot: Python library for data visualization

Dexplot also has the ability to handle wide data, where multiple columns may contain values that represent the same kind of quantity. The same data above has been aggregated to show the mean for each combination of neighborhood and property type. It is now wide data as each column contains the same quantity (price).

Dexplot: Python library for data visualization

Dexplot provides a small number of powerful functions that all work similarly. Most plotting functions have the following signature:

dxp.plotting_func(x, y, data, aggfunc, split, row, col, orientation, ...)

x - Column name along the x-axis
y - Column name the y-axis
data - Pandas DataFrame
aggfunc - String of pandas aggregation function, 'min', 'max', 'mean', etc...
split - Column name to split data into distinct groups
row - Column name to split data into distinct subplots row-wise
col - Column name to split data into distinct subplots column-wise
orientation - Either vertical ( 'v' ) or horizontal ( 'h' ). Default for most plots is vertical.

When aggfunc is provided, x will be the grouping variable and y will be aggregated when vertical and vice-versa when horizontal. The best way to learn how to use dexplot is with the examples below.

Families of plots

There are two primary families of plots, aggregation and distribution . Aggregation plots take a sequence of values and return a single value using the function provided to aggfunc to do so. Distribution plots take a sequence of values and depict the shape of the distribution in some manner.

Aggregation
- bar
- line
- scatter
- count
Distribution
- box
- violin
- hist
- kde

Comparison with Seaborn

If you have used the seaborn library, then you should notice a lot of similarities. Much of dexplot was inspired by Seaborn. Below is a list of the extra features in dexplot not found in seaborn

catplot
groupby

Most of the examples below use long data.

Aggregating plots - bar, line and scatter

We'll begin by covering the plots that aggregate . An aggregation is defined as a function that summarizes a sequence of numbers with a single value. The examples come from the Airbnb dataset, which contains many property rental listings from the Washington D.C. area.

import dexplot as dxp
import pandas as pd
airbnb = dxp.load_dataset('airbnb')
airbnb.head()

	neighborhood	property_type	accommodates	bathrooms	bedrooms	price	cleaning_fee	rating	superhost	response_time	latitude	longitude
0	Shaw	Townhouse	16	3.5	4	433	250	95.0	No	within an hour	38.90982	-77.02016
1	Brightwood Park	Townhouse	4	3.5	4	154	50	97.0	No	NaN	38.95888	-77.02554
2	Capitol Hill	House	2	1.5	1	83	35	97.0	Yes	within an hour	38.88791	-76.99668
3	Shaw	House	2	2.5	1	475	0	98.0	No	NaN	38.91331	-77.02436
4	Kalorama Heights	Apartment	3	1.0	1	118	15	91.0	No	within an hour	38.91933	-77.04124

There are more than 4,000 listings in our dataset. We will use bar charts to aggregate the data.

airbnb.shape

(4581, 12)

Vertical bar charts

In order to performa an aggregation, you must supply a value for aggfunc . Here, we find the median price per neighborhood. Notice that the column names automatically wrap.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median')

Dexplot: Python library for data visualization

Line and scatter plots can be created with the same command, just substituting the name of the function. They both are not good choices for the visualization since the grouping variable (neighborhood) has no meaningful order.

dxp.line(x='neighborhood', y='price', data=airbnb, aggfunc='median')

Dexplot: Python library for data visualization

dxp.scatter(x='neighborhood', y='price', data=airbnb, aggfunc='median')

Dexplot: Python library for data visualization

Components of the groupby aggregation

Anytime the aggfunc parameter is set, you have performed a groupby aggregation, which always consists of three components:

Grouping column - unique values of this column form independent groups (neighborhood)
Aggregating column - the column that will get summarized with a single value (price)
Aggregating function - a function that returns a single value (median)

The general format for doing this in pandas is:

df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})

Specifically, the following code is executed within dexplot.

airbnb.groupby('neighborhood').agg({'price': 'median'})

	price
neighborhood
Brightwood Park	87.0
Capitol Hill	129.5
Columbia Heights	95.0
Dupont Circle	125.0
Edgewood	100.0
Kalorama Heights	118.0
Shaw	133.5
Union Station	120.0

Number and percent of missing values with `'countna'` and `'percna'`

In addition to all the common aggregating functions, you can use the strings 'countna' and 'percna' to get the number and percentage of missing values per group.

dxp.bar(x='neighborhood', y='response_time', data=airbnb, aggfunc='countna')

Dexplot: Python library for data visualization

Sorting the bars by values

By default, the bars will be sorted by the grouping column (x-axis here) in alphabetical order. Use the sort_values parameter to sort the bars by value.

asc
desc

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', sort_values='asc')

Dexplot: Python library for data visualization

Here, we sort the values from greatest to least.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', sort_values='desc')

Dexplot: Python library for data visualization

Specify order with `x_order`

Specify a specific order of the labels on the x-axis by passing a list of values to x_order . This can also act as a filter to limit the number of bars.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median',
        x_order=['Dupont Circle', 'Edgewood', 'Union Station'])

Dexplot: Python library for data visualization

By default, x_order and all of the _order parameters are set to 'asc' by default, which will order them alphabetically. Use the string 'desc' to sort in the opposite direction.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', x_order='desc')

Dexplot: Python library for data visualization

Filter for the neighborhoods with most/least frequency of occurrence

You can use x_order again to filter for the x-values that appear the most/least often by setting it to the string 'top n' or 'bottom n' where n is an integer. Here, we filter for the top 4 most frequently occurring neighborhoods. This option is useful when there are dozens of unique values in the grouping column.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median',
        x_order='top 4')

Dexplot: Python library for data visualization

We can verify that the four neighborhoods are the most common.

airbnb['neighborhood'].value_counts()

Columbia Heights    773
Union Station       713
Capitol Hill        654
Edgewood            610
Dupont Circle       549
Shaw                514
Brightwood Park     406
Kalorama Heights    362
Name: neighborhood, dtype: int64

Set orientation to 'h' for horizontal bars. When you do this, you'll need to switch x and y since the grouping column (neighborhood) will be along the y-axis and the aggregating column (price) will be along the x-axis.

dxp.bar(x='price', y='neighborhood', data=airbnb, aggfunc='median', 
        orientation='h', sort_values='desc')

Dexplot: Python library for data visualization

Switching orientation is possible for most other plots.

dxp.line(x='price', y='neighborhood', data=airbnb, aggfunc='median', orientation='h')

Dexplot: Python library for data visualization

Split bars into groups

You can split each bar into further groups by setting the split parameter to another column.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', split='superhost')

Dexplot: Python library for data visualization

We can use the pivot_table method to verify the results in pandas.

airbnb.pivot_table(index='superhost', columns='neighborhood', 
                   values='price', aggfunc='median')

neighborhood	Brightwood Park	Capitol Hill	Columbia Heights	Dupont Circle	Edgewood	Kalorama Heights	Shaw	Union Station
superhost
No	85.0	129.0	90.5	120.0	100.0	110.0	130.0	120.0
Yes	90.0	130.0	103.0	135.0	100.0	124.0	135.0	125.0

Set the order of the unique split values with split_order , which can also act as a filter.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', 
        split='superhost', split_order=['Yes', 'No'])

Dexplot: Python library for data visualization

Like all the _order parameters, split_order defaults to 'asc' (alphabetical) order. Set it to 'desc' for the opposite.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median',
        split='property_type', split_order='desc')

Dexplot: Python library for data visualization

Filtering for the most/least frequent split categories is possible.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', 
        split='property_type', split_order='bottom 2')

Dexplot: Python library for data visualization

Verifying that the least frequent property types are Townhouse and Condominium.

airbnb['property_type'].value_counts()

Apartment      2403
House           877
Townhouse       824
Condominium     477
Name: property_type, dtype: int64

Stacked bar charts

Stack all the split groups one on top of the other by setting stacked to True .

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', 
        split='superhost', split_order=['Yes', 'No'], stacked=True)

Dexplot: Python library for data visualization

Split into multiple plots

It's possible to split the data further into separate plots by the unique values in a different column with the row and col parameters. Here, each kind of property_type has its own plot.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', 
        split='superhost', col='property_type')

Dexplot: Python library for data visualization

If there isn't room for all of the plots, set the wrap parameter to an integer to set the maximum number of plots per row/col. We also specify the col_order to be descending alphabetically.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', 
        split='superhost', col='property_type', wrap=2, col_order='desc')

Dexplot: Python library for data visualization

Use col_order to both filter and set a specific order for the plots.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median',
        split='superhost', col='property_type', col_order=['House', 'Condominium'])

Dexplot: Python library for data visualization

Splits can be made simultaneously along row and columns.

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', split='superhost', 
        col='property_type', col_order=['House', 'Condominium', 'Apartment'],
        row='bedrooms', row_order=[1, 2, 3])

Dexplot: Python library for data visualization

By default, all axis limits are shared. Allow each plot to set its own limits by setting sharex and sharey to False .

dxp.bar(x='neighborhood', y='price', data=airbnb, aggfunc='median', split='superhost', 
        col='property_type', col_order=['House', 'Condominium', 'Apartment'],
        row='bedrooms', row_order=[1, 2, 3], sharey=False)

Dexplot: Python library for data visualization

Set the width of each bar with `size`

The width (height when horizontal) of the bars is set with the size parameter. By default, this value is .9. Think of this number as the relative width of all the bars for a particular x/y value, where 1 is the distance between each x/y value.

dxp.bar(x='neighborhood', y='price', data=airbnb, 
        aggfunc='median', split='property_type',
        split_order=['Apartment', 'House'], 
        x_order=['Dupont Circle', 'Capitol Hill', 'Union Station'], size=.5)

Dexplot: Python library for data visualization

Splitting line plots

All the other aggregating plots work similarly.

dxp.line(x='neighborhood', y='price', data=airbnb, 
        aggfunc='median', split='property_type',
        split_order=['Apartment', 'House'], 
        x_order=['Dupont Circle', 'Capitol Hill', 'Union Station'])

Dexplot: Python library for data visualization

Distribution plots - box, violin, histogram, kde

Distribution plots work similarly, but do not have an aggfunc since they do not aggregate. They take their group of values and draw some kind of shape that gives information on how that variable is distributed.

Box plots have colored boxes with ends at the first and third quartiles and a line at the median. The whiskers are placed at 1.5 times the difference between the third and first quartiles (Interquartile range (IQR)). Fliers are the points outside this range and plotted individually. By default, both box and violin plots are plotted horizontally.

dxp.box(x='price', y='neighborhood', data=airbnb)

Dexplot: Python library for data visualization

Split the groups in the same manner as with the aggregation plots.

dxp.box(x='price', y='neighborhood', data=airbnb, 
        split='superhost', split_order=['Yes', 'No'])

Dexplot: Python library for data visualization

Order the appearance of the splits alphabetically (in descending order here).

dxp.box(x='price', y='neighborhood', data=airbnb, 
        split='property_type', split_order='desc')

Dexplot: Python library for data visualization

Filter range of values with `x_order`

It's possible to filter the range of possible values by passing in a list of the minimum and maximum to x_order .

dxp.box(x='price', y='neighborhood', data=airbnb, 
        split='superhost', x_order=[50, 250])

Dexplot: Python library for data visualization

Change the x and y while setting orientation to make vertical bar plots.

dxp.box(x='neighborhood', y='price', data=airbnb, orientation='v',
        split='property_type', split_order='top 2')

Dexplot: Python library for data visualization

Violin plots work identically to box plots, but show "violins", kernel density plots duplicated on both sides of a line.

dxp.violin(x='price', y='neighborhood', data=airbnb, 
          split='superhost', split_order=['Yes', 'No'])

Dexplot: Python library for data visualization

Splitting by rows and columns is possible as well with distribution plots.

dxp.box(x='price', y='neighborhood', data=airbnb,split='superhost', 
        col='property_type', col_order=['House', 'Condominium', 'Apartment'],
        row='bedrooms', row_order=[1, 2])

Dexplot: Python library for data visualization

Histograms work in a slightly different manner. Instead of passing both x and y , you give it a single numeric column. A vertical histogram with 20 bins of the counts is created by default.

dxp.hist(val='price', data=airbnb)

Dexplot: Python library for data visualization

We can use split just like we did above and also create horizontal histograms.

dxp.hist(val='price', data=airbnb, orientation='h', split='superhost', bins=15)

Dexplot: Python library for data visualization

Here, we customize our histogram by plotting the cumulative density as opposed to the raw frequency count using the outline of the bars ('step').

dxp.hist(val='price', data=airbnb, split='bedrooms', split_order=[1, 2, 3], 
         bins=30, density=True, histtype='step', cumulative=True)

Dexplot: Python library for data visualization

Kernel density estimates provide an estimate for the probability distribution of a continuous variable. Here, we examine how price is distributed by bedroom.

dxp.kde(x='price', data=airbnb, split='bedrooms', split_order=[1, 2, 3])

Dexplot: Python library for data visualization

Graph the cumulative distribution instead on multiple plots.

dxp.kde(x='price', data=airbnb, split='bedrooms', 
        split_order=[1, 2, 3], cumulative=True, col='property_type', wrap=2)

Dexplot: Python library for data visualization

Two-dimensional KDE's

Provide two numeric columns to x and y to get a two dimensional KDE.

dxp.kde(x='price', y='cleaning_fee', data=airbnb)

Dexplot: Python library for data visualization

Create a grid of two-dimensional KDE's.

dxp.kde(x='price', y='cleaning_fee', data=airbnb, row='neighborhood', wrap=3)

Dexplot: Python library for data visualization

The count function graphs the frequency of unique values as bars. By default, it plots the values in descending order.

dxp.count(val='neighborhood', data=airbnb)

Dexplot: Python library for data visualization

In pandas, this is a straightforward call to the value_counts method.

airbnb['neighborhood'].value_counts()

Columbia Heights    773
Union Station       713
Capitol Hill        654
Edgewood            610
Dupont Circle       549
Shaw                514
Brightwood Park     406
Kalorama Heights    362
Name: neighborhood, dtype: int64

Relative frequency with `normalize`

Instead of the raw counts, get the relative frequency by setting normalize to True .

dxp.count(val='neighborhood', data=airbnb, normalize=True)

Dexplot: Python library for data visualization

Here, we split by property type.

dxp.count(val='neighborhood', data=airbnb, split='property_type')

Dexplot: Python library for data visualization

In pandas, this is done with the crosstab function.

pd.crosstab(index=airbnb['property_type'], columns=airbnb['neighborhood'])

neighborhood	Brightwood Park	Capitol Hill	Columbia Heights	Dupont Circle	Edgewood	Kalorama Heights	Shaw	Union Station
property_type
Apartment	167	299	374	397	244	284	315	323
Condominium	35	70	97	62	65	42	52	54
House	131	137	157	47	146	23	61	175
Townhouse	73	148	145	43	155	13	86	161

Horizontal stacked count plots.

dxp.count(val='neighborhood', data=airbnb, split='property_type', 
          orientation='h', stacked=True, col='superhost')

Dexplot: Python library for data visualization

Normalize over different variables

Setting normalize to True , returns the relative frequency with respect to all of the data. You can normalize over any of the variables provided.

dxp.count(val='neighborhood', data=airbnb, split='property_type', normalize='neighborhood', 
                title='Relative Frequency by Neighborhood')

Dexplot: Python library for data visualization

Normalize over several variables at once with a list.

dxp.count(val='neighborhood', data=airbnb, split='superhost', 
          row='property_type', col='bedrooms', col_order=[1, 2],
          normalize=['neighborhood', 'property_type', 'bedrooms'], stacked=True)

Dexplot: Python library for data visualization

Dexplot can also plot wide data, or data where no aggregation happens. Here is a scatter plot of the location of each listing.

dxp.scatter(x='longitude', y='latitude', data=airbnb, 
            split='neighborhood', col='bedrooms', col_order=[2, 3])

Dexplot: Python library for data visualization

If you've already aggregated your data, you can plot it directly without specifying x or y .

df = airbnb.pivot_table(index='neighborhood', columns='property_type', 
                        values='price', aggfunc='mean')
df

property_type	Apartment	Condominium	House	Townhouse
neighborhood
Brightwood Park	96.119760	105.000000	121.671756	133.479452
Capitol Hill	141.210702	104.200000	170.153285	184.459459
Columbia Heights	114.676471	126.773196	135.292994	124.358621
Dupont Circle	146.858942	130.709677	179.574468	139.348837
Edgewood	108.508197	112.846154	156.335616	147.503226
Kalorama Heights	122.542254	155.928571	92.695652	158.230769
Shaw	153.888889	158.500000	202.114754	173.279070
Union Station	128.458204	133.833333	162.748571	162.167702

dxp.bar(data=df, orientation='h')

Dexplot: Python library for data visualization

stocks = pd.read_csv('../data/stocks10.csv', parse_dates=['date'], index_col='date')
stocks.head()

	MSFT	AAPL	SLB	AMZN	TSLA	XOM	WMT	T	FB	V
date
1999-10-25	29.84	2.32	17.02	82.75	NaN	21.45	38.99	16.78	NaN	NaN
1999-10-26	29.82	2.34	16.65	81.25	NaN	20.89	37.11	17.28	NaN	NaN
1999-10-27	29.33	2.38	16.52	75.94	NaN	20.80	36.94	18.27	NaN	NaN
1999-10-28	29.01	2.43	16.59	71.00	NaN	21.19	38.85	19.79	NaN	NaN
1999-10-29	29.88	2.50	17.21	70.62	NaN	21.47	39.25	20.00	NaN	NaN

dxp.line(data=stocks.head(500))

Dexplot: Python library for data visualization

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Dexplot: Python library for data visualization

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

《裂变：秒懂人工智能的基础课》

王天一 / 电子工业出版社·博文视点 / 2018-6-13 / 59.00元

人工智能是指通过普通计算机程序实现的人类智能技术，这一学科不仅具有非凡的科学意义，对人类自身生存方式的影响也在不断加深。本书作为人工智能领域的入门读物，内容围绕人工智能的核心框架展开，具体包括数学基础知识、机器学习算法、人工神经网络原理、深度学习方法与实例、深度学习之外的人工智能和实践应用场景等模块。本书力图为人工智能初学者提供关于这一领域的全面认识，也为进一步的深入研究建立坚实的基础。一起来看看《《裂变：秒懂人工智能的基础课》》这本书的介绍吧!

码农工具

Dexplot: Python library for data visualization

Built for long and wide data

Families of plots

Comparison with Seaborn

Aggregating plots - bar, line and scatter

Vertical bar charts

Components of the groupby aggregation

Number and percent of missing values with `'countna'` and `'percna'`

Sorting the bars by values

Specify order with `x_order`

Filter for the neighborhoods with most/least frequency of occurrence

Split bars into groups

Stacked bar charts

Split into multiple plots

Set the width of each bar with `size`

Splitting line plots

Distribution plots - box, violin, histogram, kde

Filter range of values with `x_order`

Two-dimensional KDE's

Relative frequency with `normalize`

Normalize over different variables

《裂变：秒懂人工智能的基础课》

RGB转16进制工具

RGB CMYK 转换工具

HEX CMYK 转换工具

Dexplot: Python library for data visualization

Built for long and wide data

Families of plots

Comparison with Seaborn

Aggregating plots - bar, line and scatter

Vertical bar charts

Components of the groupby aggregation

Number and percent of missing values with 'countna' and 'percna'

Sorting the bars by values

Specify order with x_order

Filter for the neighborhoods with most/least frequency of occurrence

Split bars into groups

Stacked bar charts

Split into multiple plots

Set the width of each bar with size

Splitting line plots

Distribution plots - box, violin, histogram, kde

Filter range of values with x_order

Two-dimensional KDE's

Relative frequency with normalize

Normalize over different variables

《裂变：秒懂人工智能的基础课》

RGB转16进制工具

RGB CMYK 转换工具

HEX CMYK 转换工具

Number and percent of missing values with `'countna'` and `'percna'`

Specify order with `x_order`

Set the width of each bar with `size`

Filter range of values with `x_order`

Relative frequency with `normalize`