A to Z: Master Data Visualization with this Ruleset

栏目: IT技术 · 发布时间: 4年前

内容简介:Whether you’re trying to break into the world of data analytics or data science, if you’re a product manager, sales leader, or anybody seeking to understand their business being able to utilize data in a meaningful way is key. Whether you’re using data vis

A to Z: Master Data Visualization with this Ruleset

Image by Free-Photos from Pixabay

Intro:

Whether you’re trying to break into the world of data analytics or data science, if you’re a product manager, sales leader, or anybody seeking to understand their business being able to utilize data in a meaningful way is key. Whether you’re using data visualization software like Tableau, Domo, PowerBI, etc. or you’re using a language like R, Python, etc. there are a variety of principles and concepts that will help you get started.

Purpose of your analysis:

Before anything else, keep in mind that any analysis should have some purpose. It’s easy to look at a chart and ask yourself, “why am I looking at this?” or “what am I supposed to get here?”. To boil it down to a very simple principle, we want to understand the nature of a given variable & how that variable might relate to others.

Key things to keep in mind:

Dimensionality, Data types

Dimensionality:

Here we’re talking about the number of Ds in 3D. So when you played super mario in 2 dimensions, you had an x and a Y axis. Most of us have seen a lot of two dimensional charts and graphs. The way to think about this is, “how many variables do I want to include in a given visualization?”. As a general rule here; less is often more.

Data types:

Whether a field is numeric; something like age, weight, etc. categorical; gender, hair color, etc. or time; the date, month, or day something occurred. Once you understand this and have some hypothesis about how certain variables may relate to one another, you can begin to formulate what types of visualizations you might use.

Language and datasets:

All of the visualizations will be made using the ggplot2 package in R using a variety of sample datasets including iris, mtcars, mpg, and economics. I won’t be including much r code here as I want this to be broadly applicable, but if you’d like the code for any of these, please comment below or reach out.

Jumping in:

There are other things we’d do to understand the data before we might actually make any visualizations, but we’ll jump right into the visualizations for the sake of getting to rules around visualization specifically.

We’ll go through a variety of options & rules by datatype & dimensionality, starting with a single dimension.

Number of dimensions: 1

Datatype: Numeric

Dataset: mtcars; sample dataset included in base R that gives a variety of datapoints on cars

Purpose: Understanding distribution & summary statistics

Charts: histogram & boxplot

When trying to understand a numeric variable in isolation, you’ll first seek to understand it’s distribution. For this, you’ll use a simple histogram that tells you how many occurrences there are at each value. The first variable that we want to understand is horsepower.

A to Z: Master Data Visualization with this Ruleset

What we’re seeing here is that Horsepower is right-skewed or that the tail on the right side of the peak is stretching further than that of the left. If you think about the typical horsepower for a car, most will be less than 250, but there are certainly still cars that are being made to push that envelope, albeit far fewer.

A to Z: Master Data Visualization with this Ruleset

This is a box plot, box and whiskers plot for the same variable, hp. Boxplots are great for visualizing a number of summary statistics on a given variable. The horizontal lines on the end of the plot represent the max and min. The dark horizontal line is the median. The box the median sits within represents the IQR or interquartile range (breaking your data into four even quartiles, the IQR represents the range between the 1st and 3rd quartiles).

A to Z: Master Data Visualization with this Ruleset

Here we see a histogram of Miles per gallon, and we see a slightly right-skewed distribution. It also nearly appears bimodal. Bimodality is when there are effectively two peaks. One explanation would be that we’re overlapping distributions of gas and electric cars, so let's say the average mpg for a gas powered car is between 15–20, but for electric, it’s 30–35, then given enough volume of either we could see two peaks in our distribution.

A to Z: Master Data Visualization with this Ruleset

We’re now looking at qsec; a car performance metric. It’s the time it takes the car to travel ¼ of a mile. What we see here is a very standard normal distribution.

Number of dimensions: 1

Datatype: Categorical

Dataset: mpg; sample dataset included in base R that gives a variety of datapoints on cars

Purpose: Understanding proportions

Charts: bar chart & pie chart

When trying to understand a single categorical variable in isolation, the main thing you want to consider is how many occurrences of a given term is popping up.

Now let's take a look at the transmission & class variables from the mpg dataset. We’ll do so by creating a bar chart with the categorical variable on the X-axis & the count of occurrences on the Y-axis.

A to Z: Master Data Visualization with this Ruleset

Here you can get an idea of which transmissions occur frequently versus those that appear slightly more rare. Typically you don’t need to include color, but just to make things a tad more clear.

A to Z: Master Data Visualization with this Ruleset

Similarly, we see the count of occurrences charted by class. We can see that 2seaters and minivans are far less frequently occurring than SUVs or compacts. This bar chart could just as easily be shown as a pie chart. Pie charts can sometimes be a tad more difficult to delineate the volume of a given slice than it is in a bar chart as any given slice will have a different angle, could be on different sides of the pie, etc.

A to Z: Master Data Visualization with this Ruleset

As mentioned, here is a pie chart of the class variable.

Additionally, we can use bar charts to plot other aggregations by categorical variables. For instance, taking the average mpg per a given car class, but we’ll get into that later.

Ok so now we’ve looked at numeric & categorical variables in isolation; let's increase the number of dimensions we’re charting to two and look at some different combinations.

Number of dimensions: 2

Datatype: numeric

Dataset: Iris & mpg; sample datasets included in base R that gives a variety of datapoints on three species of Iris and some of their measurements

Purpose: Understanding the relationship between two variables

Charts: Scatter plot

Whenever trying to understand the relationship between two numeric variables, scatter plot is best practice.

A to Z: Master Data Visualization with this Ruleset

Here we are trying to observe the relationship between the length (Y Axis) and width (X axis) of a sepal (for the plant anatomy, just run a quick google search.. :) )

I also looked at the correlation (measure to understand how two variables relate or move together, 1 would be the move perfectly in sync, -1 would mean that they were perfectly inverse, .5 or -.5 is a good relationship, .3 or -.3 could be a weak relationship, and anything too far below that would be weak or random relationship) for these two variables and found that it was -.11, suggesting that there is no real relationship. While it appears like these two variables are unrelated; It is important to consider the many potential layers of a relationship among variables.

I’ll talk more about this later, but I’ll give a sneak peek now. While in the previous plot we saw that width and length appeared not to relate, it is important to include as many potential perspectives as possible. Looking at the same plot as before, we are going to add one more dimension to it; Species. We will visualize species by using color.

A to Z: Master Data Visualization with this Ruleset

Once we added in the third dimension, we can see that by species there is a clear linear relationship between length & width.

Below I’ve included the correlation when grouping by species and we can see on the high end a correlation of .74 and on the low end .46, which is still considerable.

A to Z: Master Data Visualization with this Ruleset

A to Z: Master Data Visualization with this Ruleset

Now let's get back to assessing two numeric variables. Here we’re looking at City MPG and Highway MPG. Here the scatter is moving up and to the right in a linear fashion, indicating a positively correlated relationship between the two. These two variables correlate at .96.

Something to keep in mind if you’re new to statistics. Even though these things move together; it doesn’t necessarily mean that one is causing the other. It just indicates they’re related.

To continue evaluating other combinations of two dimensional data, let's consider how we might analyze the relationship between two categorical variables.

Number of dimensions: 2

Datatype: categorical

Dataset: mpg; sample dataset included in base R

Purpose: Understanding the relationship between two variables

Charts: table, heatmap, bar chart

Before jumping into visualizations, I’m going to show two categorical variables in a table as visualizations of two categorical dimensions are representative of what we’ll find in a table.

The table below is from the mpg dataset; we’re looking at the class of car and whether the car is four wheel drive, front wheel drive, or rear wheel drive.

Here we can see the frequency of records that pertain to any given combination of the categorical variables

A to Z: Master Data Visualization with this Ruleset

At a quick glance, we can see that SUV, 4 wheel drives are the most common. Another layer you can add to this is looking at each cell as a percent of the whole.

A to Z: Master Data Visualization with this Ruleset

There is more to do with prop tables, but we’ll save that for another time.

From here we’re looking to visualize what we’re seeing in the prop table.

An excellent way to turn this table to a visualization is with a heatmap. Take a look at the chart below. On either axis, you see the categorical variables, and the color of the heat map is represented by the count of occurrences.

A to Z: Master Data Visualization with this Ruleset

As we saw in the original table, SUV, 4 wheel drive is the most common, with midsize front-wheel drive coming in second and compact front wheel drive taking third.

A heatmap presents some difficulty in terms of being able to gauge exactly what the count is. We have a legend, and depending on the software you’re using you may be able to serve that up easily as a tooltip.

Another option is to go back to the bar chart, with one of the categorical variable on the x-axis, the numeric variable (count) on the y-axis, and the second categorical variable is represented on the dimension of color.

A to Z: Master Data Visualization with this Ruleset

Here again, we can see which of the combinations is the most frequently occurring, thus bringing us to a similar end.

This next plotting option is very similar, but rather than just representing the third dimension on color, we can also use faceting. Faceting is a technique that allows you to give each level of a categorical variable its own plot. Take a look below.

A to Z: Master Data Visualization with this Ruleset

As mentioned here we can see a very similar plot to the previous ones but each level of class is broken out in its own plot.

At this point, we’ve seen a very similar result with various charts. The consideration you have to make is how well does a given plot convey your message or produce the necessary insight, is it overly complex, is it taking up too much space, what screen sizes will potential stakeholders be seeing your charts on, etc.

Now we’re on to plots with multiple dimensions.

Number of dimensions: 3–5

Datatype: numeric & categorical

Dataset: mpg; sample dataset included in base R

Purpose: Understanding the relationship between multiple variables

Charts: Scatter

A to Z: Master Data Visualization with this Ruleset

Here we see the same plot as before with a single modification. To include the third dimension of ‘engine displacement’, we are now changing the size of the points to correspond to engine displacement. As we can see it seems to relate inversely to higher city and highway gas mileage.

Now an alternative option for adding a dimension is color.

A to Z: Master Data Visualization with this Ruleset

Rather than using size, I’ve now included color to indicate engine displacement.

Below I’ve included an additional plot where we do both, which makes it even easier to see.

A to Z: Master Data Visualization with this Ruleset

One thing to keep in mind is that instead of putting engine displacement in color and size, we could actually introduce a fourth numeric dimension.

A to Z: Master Data Visualization with this Ruleset

Here I’ve swapped out cyl to fill the size dimension; which also appears to relate inversely to highway and city mpg.

A couple of other options we could introduce here could be to use color to introduce a categorical variable or to facet according to a categorical variable.

A to Z: Master Data Visualization with this Ruleset

Here you can see the sample plot now faceted by class, which allows us to see how these numeric variables relate to one another across different levels of a categorical variable.

Before we wrap up, a final consideration is if we have two dimensional data with a variable representing time. A great rule of thumb for time is to use a line chart.

Number of dimensions: 2

Datatype: numeric & time

Dataset: economics; sample dataset included in base R

Purpose: Understanding the relationship between time and a numeric variable

Chart: line

The data we’ll be looking at comes from the economics sample dataset and represents unemployment over the last 50 years or so.

A to Z: Master Data Visualization with this Ruleset

When charting a dataset with a time dimension, we are attempting to identify a trend in a given numeric variable to understand trend and whether given activity might coincide with that movement. Put your time dimension across the X-axis and whatever numeric variable you are measuring on the Y-axis.

There is a lot more that could potentially be done even here, but I’ll save that for another time.

Takeaway:

  • 1 dimension
  • numeric: histogram, box plot
  • categorical: table, bar chart, pie chart
  • 2 dimensions
  • numeric/numeric: scatter
  • numeric/ categorical: bar chart
  • categorical/ categorical: bar chart, table
  • 3 dimensions
  • numeric/ numeric/ numeric: scatter with size/color
  • numeric/ numeric/ categorical: scatter with color/facets
  • numeric/ categorical/ categorical: bar chart with color/facets
  • 4+ dimensions
  • Use varying combinations of datatypes and variables with your x & y axis, as well as color, fill, size, facets, etc.

There is so much you can do with data visualization & this is just the start of it. Here’s to hoping this helps you get started!

Add yourself to my email list if this was helpful; also be sure to let me know if you’d prefer code examples, additional information, etc.

Come check out some of my other posts at datasciencelessons.com & happy data science-ing!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

六度分隔

六度分隔

邓肯·J·瓦茨 / 陈禹 / 中国人民大学出版社 / 2011-3 / 46.00元

正如副标题所表明的,《六度分隔:一个相互连接的时代的科学》的基本内容是介绍一门正在形成中的新科学——关于网络的一般规律的科学。有这样一门科学吗?它的内容和方法是什么?近年来,这门学科有什么实质性的进展吗?在《六度分隔:一个相互连接的时代的科学》中,作者根据自己的亲身经历娓娓道来,用讲故事的方式,对于这些问题给出了令人信服的回答 除了简要的背景和总结以外,《六度分隔:一个相互连接的时代的科学》......一起来看看 《六度分隔》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器