R语言dplyr包实操

栏目: IT技术 · 发布时间: 4年前

R语言dplyr包实操

1. dplyr简介

dplyr是R语言的数据分析包，类似于 python 中的pandas，能对dataframe类型的数据做很方便的数据处理和分析操作。最初我也很奇怪dplyr这个奇怪的名字，我查到其中一种解释 - d代表dataframe - plyr是英文钳子plier的谐音

dplyr如同R的大多数包，都是函数式编程，这点跟Python面向对象编程区别很大。优点是初学者比较容易接受这种函数式思维，有点类似于流水线，每个函数就是一个车间，多个车间共同完成一个生产（数据分析）任务。

而在dplyr中，就有一个管道符 %>% ，符号左侧表示数据的输入，右侧表示下游数据处理环节。

2. 安装并导入dplyr库

pacman库的p_load函数功能包含了

install.packages(“dplyr”)
library(dplyr)

该写法更简单易用

pacman::p_load("dplyr")

3. 读取数据

#设置工作目录
setwd("/Users/thunderhit/Desktop/dplyr_learn")

#导入csv数据
aapl <- read.csv('aapl.csv', 
                 header=TRUE,
                 sep=',',
                 stringsAsFactors = FALSE) %>% as_tibble()
head(aapl)

A tibble: 6 × 6
Date	Open	High	Low	Close	Volume
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
7-Jul-17	142.90	144.75	142.90	144.18	19201712
6-Jul-17	143.02	143.50	142.41	142.73	24128782
5-Jul-17	143.69	144.79	142.72	144.09	21569557
3-Jul-17	144.88	145.30	143.10	143.50	14277848
30-Jun-17	144.45	144.96	143.78	144.02	23024107
29-Jun-17	144.71	145.13	142.28	143.68	31499368

查看数据类型

class(aapl)

'tbl_df'
'tbl'
'data.frame'

查看数据的字段

colnames(aapl)

'Date'
'Open'
'High'
'Low'
'Close'
'Volume'

查看记录数、字段数

dim(aapl)

4. dplyr常用函数

4.1 Arrange

对appl数据按照字段Volume进行降序排序

arrange(aapl, -Volume)

A tibble: 6 × 6
Date	Open	High	Low	Close	Volume
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
14-Sep-16	108.73	113.03	108.60	111.77	112340318
1-Feb-17	127.03	130.49	127.01	128.75	111985040
27-Jul-16	104.26	104.35	102.75	102.95	92344820
15-Sep-16	113.86	115.73	113.49	115.57	90613177
16-Sep-16	115.12	116.13	114.04	114.92	79886911
12-Jun-17	145.74	146.09	142.51	145.42	72307330

我们也可以用管道符 %>% ，两种写法得到的运行结果是一致的，可能用久了会觉得管道符 %>% 可读性更强，后面我们都会用 %>% 来写代码。

aapl %>% arrange(-Volume)

A tibble: 6 × 6
Date	Open	High	Low	Close	Volume
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
14-Sep-16	108.73	113.03	108.60	111.77	112340318
1-Feb-17	127.03	130.49	127.01	128.75	111985040
27-Jul-16	104.26	104.35	102.75	102.95	92344820
15-Sep-16	113.86	115.73	113.49	115.57	90613177
16-Sep-16	115.12	116.13	114.04	114.92	79886911
12-Jun-17	145.74	146.09	142.51	145.42	72307330

4.2 Select

选取 Date、Close和Volume三列

aapl %>% select(Date, Close, Volume)

A tibble: 6 × 3
Date	Close	Volume
<chr>	<dbl>	<int>
7-Jul-17	144.18	19201712
6-Jul-17	142.73	24128782
5-Jul-17	144.09	21569557
3-Jul-17	143.50	14277848
30-Jun-17	144.02	23024107
29-Jun-17	143.68	31499368

只选取Date、Close和Volume三列，其实另外一种表达方式是“排除Open、High、Low，选择剩下的字段的数据”。

aapl %>% select(-c("Open", "High", "Low"))

A tibble: 6 × 3
Date	Close	Volume
<chr>	<dbl>	<int>
7-Jul-17	144.18	19201712
6-Jul-17	142.73	24128782
5-Jul-17	144.09	21569557
3-Jul-17	143.50	14277848
30-Jun-17	144.02	23024107
29-Jun-17	143.68	31499368

4.3 Filter

按照筛选条件选择数据

#从数据中选择appl股价大于150美元的交易数据
aapl %>% filter(Close>=150)

A tibble: 6 × 6
Date	Open	High	Low	Close	Volume
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
8-Jun-17	155.25	155.54	154.40	154.99	21250798
7-Jun-17	155.02	155.98	154.48	155.37	21069647
6-Jun-17	153.90	155.81	153.78	154.45	26624926
5-Jun-17	154.34	154.45	153.46	153.93	25331662
2-Jun-17	153.58	155.45	152.89	155.45	27770715
1-Jun-17	153.17	153.33	152.22	153.18	16404088

从数据中选择appl - 股价大于150美元且收盘价大于开盘价的交易数据

aapl %>% filter((Close>=150) & (Close>Open))

A tibble: 11 × 6
Date	Open	High	Low	Close	Volume
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
7-Jun-17	155.02	155.98	154.48	155.37	21069647
6-Jun-17	153.90	155.81	153.78	154.45	26624926
2-Jun-17	153.58	155.45	152.89	155.45	27770715
1-Jun-17	153.17	153.33	152.22	153.18	16404088
30-May-17	153.42	154.43	153.33	153.67	20126851
25-May-17	153.73	154.35	153.03	153.87	19235598
18-May-17	151.27	153.34	151.13	152.54	33568215
12-May-17	154.70	156.42	154.67	156.10	32527017
11-May-17	152.45	154.07	152.31	153.95	27255058
9-May-17	153.87	154.88	153.45	153.99	39130363
8-May-17	149.03	153.70	149.03	153.01	48752413

4.4 Mutate

将现有的字段经过计算后生成新字段。

#将最好价High减去最低价Low的结果定义为maxDif，并取log
aapl %>% mutate(maxDif = High-Low,
                log_maxDif=log(maxDif))

A tibble: 6 × 8
Date	Open	High	Low	Close	Volume	maxDif	log_maxDif
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<dbl>	<dbl>
7-Jul-17	142.90	144.75	142.90	144.18	19201712	1.85	0.6151856
6-Jul-17	143.02	143.50	142.41	142.73	24128782	1.09	0.0861777
5-Jul-17	143.69	144.79	142.72	144.09	21569557	2.07	0.7275486
3-Jul-17	144.88	145.30	143.10	143.50	14277848	2.20	0.7884574
30-Jun-17	144.45	144.96	143.78	144.02	23024107	1.18	0.1655144
29-Jun-17	144.71	145.13	142.28	143.68	31499368	2.85	1.0473190

得到记录的位置(行数)

aapl  %>% mutate(n=row_number())

A tibble: 6 × 7
Date	Open	High	Low	Close	Volume	n
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<int>
7-Jul-17	142.90	144.75	142.90	144.18	19201712	1
6-Jul-17	143.02	143.50	142.41	142.73	24128782	2
5-Jul-17	143.69	144.79	142.72	144.09	21569557	3
3-Jul-17	144.88	145.30	143.10	143.50	14277848	4
30-Jun-17	144.45	144.96	143.78	144.02	23024107	5
29-Jun-17	144.71	145.13	142.28	143.68	31499368	6

4.5 Group_By

对资料进行分组，这里导入新的数据集 weather

#导入csv数据
weather <- read.csv('weather.csv', 
                    header=TRUE,
                    sep=',',
                    stringsAsFactors = FALSE) %>% as_tibble()  
weather

A tibble: 6 × 5
Date	city	temperature	windspeed	event
<chr>	<chr>	<int>	<int>	<chr>
1/1/2017	new york	32	6	Rain
1/1/2017	mumbai	90	5	Sunny
1/1/2017	paris	45	20	Sunny
1/2/2017	new york	36	7	Sunny
1/2/2017	mumbai	85	12	Fog
1/2/2017	paris	50	13	Cloudy

按照城市分组

weather %>% group_by(city)

A grouped_df: 6 × 5
Date	city	temperature	windspeed	event
<chr>	<chr>	<int>	<int>	<chr>
1/1/2017	new york	32	6	Rain
1/1/2017	mumbai	90	5	Sunny
1/1/2017	paris	45	20	Sunny
1/2/2017	new york	36	7	Sunny
1/2/2017	mumbai	85	12	Fog
1/2/2017	paris	50	13	Cloudy

为了让大家看到分组的功效，咱们按照城市分别计算平均温度

weather %>% group_by(city) %>% summarise(mean_temperature = mean(temperature))

summarise()` ungrouping output (override with `.groups` argument)

A tibble: 3 × 2
city	mean_temperature
<chr>	<dbl>
mumbai	87.5
new york	34.0
paris	47.5

weather %>%  summarise(mean_temperature = mean(temperature))

A tibble: 1 × 1
mean_temperature
<dbl>
56.33333

往期文章

从记者的Twitter关注看他们稿件的党派倾向？

Pandas时间序列数据操作

70G上市公司定期报告数据集

文本数据清洗之正则表达式

shreport库: 批量下载上海证券交易所上市公司年报

Numpy和Pandas性能改善的方法和技巧

漂亮~pandas可以无缝衔接Bokeh

YelpDaset: 酒店管理类数据集10+G

半个小时学会Markdown标记语法

后台回复关键词【dplyr实操】，可获得测试数据及代码

以上所述就是小编给大家介绍的《R语言dplyr包实操》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

JavaScript

David Flanagan / O'Reilly Media / 2011-5-13 / GBP 39.99

The book is a programmer's guide and comprehensive reference to the core JavaScript language and to the client-side JavaScript APIs defined by web browsers. The sixth edition covers HTML 5 and ECMA......一起来看看《JavaScript》这本书的介绍吧!

码农工具

R语言dplyr包实操

1. dplyr简介

2. 安装并导入dplyr库

3. 读取数据

4. dplyr常用函数

4.1 Arrange

4.2 Select

4.3 Filter

4.4 Mutate

4.5 Group_By

往期文章

JavaScript

JSON 在线解析

Base64 编码/解码

MD5 加密