1. dplyr简介
dplyr是R语言的数据分析包,类似于 python 中的pandas,能对dataframe类型的数据做很方便的数据处理和分析操作。最初我也很奇怪dplyr这个奇怪的名字,我查到其中一种解释 - d代表dataframe - plyr是英文钳子plier的谐音
而在dplyr中,就有一个管道符 %>% ,符号左侧表示数据的输入,右侧表示下游数据处理环节。
2. 安装并导入dplyr库
3. 读取数据
#设置工作目录 setwd("/Users/thunderhit/Desktop/dplyr_learn") #导入csv数据 aapl <- read.csv('aapl.csv', header=TRUE, sep=',', stringsAsFactors = FALSE) %>% as_tibble() head(aapl)
Date | Open | High | Low | Close | Volume |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
7-Jul-17 | 142.90 | 144.75 | 142.90 | 144.18 | 19201712 |
6-Jul-17 | 143.02 | 143.50 | 142.41 | 142.73 | 24128782 |
5-Jul-17 | 143.69 | 144.79 | 142.72 | 144.09 | 21569557 |
3-Jul-17 | 144.88 | 145.30 | 143.10 | 143.50 | 14277848 |
30-Jun-17 | 144.45 | 144.96 | 143.78 | 144.02 | 23024107 |
29-Jun-17 | 144.71 | 145.13 | 142.28 | 143.68 | 31499368 |
4. dplyr常用函数
4.1 Arrange
arrange(aapl, -Volume)
Date | Open | High | Low | Close | Volume |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
14-Sep-16 | 108.73 | 113.03 | 108.60 | 111.77 | 112340318 |
1-Feb-17 | 127.03 | 130.49 | 127.01 | 128.75 | 111985040 |
27-Jul-16 | 104.26 | 104.35 | 102.75 | 102.95 | 92344820 |
15-Sep-16 | 113.86 | 115.73 | 113.49 | 115.57 | 90613177 |
16-Sep-16 | 115.12 | 116.13 | 114.04 | 114.92 | 79886911 |
12-Jun-17 | 145.74 | 146.09 | 142.51 | 145.42 | 72307330 |
我们也可以用管道符 %>% ,两种写法得到的运行结果是一致的,可能用久了会觉得管道符 %>% 可读性更强,后面我们都会用 %>% 来写代码。
aapl %>% arrange(-Volume)
Date | Open | High | Low | Close | Volume |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
14-Sep-16 | 108.73 | 113.03 | 108.60 | 111.77 | 112340318 |
1-Feb-17 | 127.03 | 130.49 | 127.01 | 128.75 | 111985040 |
27-Jul-16 | 104.26 | 104.35 | 102.75 | 102.95 | 92344820 |
15-Sep-16 | 113.86 | 115.73 | 113.49 | 115.57 | 90613177 |
16-Sep-16 | 115.12 | 116.13 | 114.04 | 114.92 | 79886911 |
12-Jun-17 | 145.74 | 146.09 | 142.51 | 145.42 | 72307330 |
4.2 Select
选取 Date、Close和Volume三列
aapl %>% select(Date, Close, Volume)
Date | Close | Volume |
<chr> | <dbl> | <int> |
7-Jul-17 | 144.18 | 19201712 |
6-Jul-17 | 142.73 | 24128782 |
5-Jul-17 | 144.09 | 21569557 |
3-Jul-17 | 143.50 | 14277848 |
30-Jun-17 | 144.02 | 23024107 |
29-Jun-17 | 143.68 | 31499368 |
aapl %>% select(-c("Open", "High", "Low"))
Date | Close | Volume |
<chr> | <dbl> | <int> |
7-Jul-17 | 144.18 | 19201712 |
6-Jul-17 | 142.73 | 24128782 |
5-Jul-17 | 144.09 | 21569557 |
3-Jul-17 | 143.50 | 14277848 |
30-Jun-17 | 144.02 | 23024107 |
29-Jun-17 | 143.68 | 31499368 |
4.3 Filter
#从数据中选择appl股价大于150美元的交易数据 aapl %>% filter(Close>=150)
Date | Open | High | Low | Close | Volume |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
8-Jun-17 | 155.25 | 155.54 | 154.40 | 154.99 | 21250798 |
7-Jun-17 | 155.02 | 155.98 | 154.48 | 155.37 | 21069647 |
6-Jun-17 | 153.90 | 155.81 | 153.78 | 154.45 | 26624926 |
5-Jun-17 | 154.34 | 154.45 | 153.46 | 153.93 | 25331662 |
2-Jun-17 | 153.58 | 155.45 | 152.89 | 155.45 | 27770715 |
1-Jun-17 | 153.17 | 153.33 | 152.22 | 153.18 | 16404088 |
从数据中选择appl - 股价大于150美元 且 收盘价大于开盘价 的交易数据
aapl %>% filter((Close>=150) & (Close>Open))
Date | Open | High | Low | Close | Volume |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
7-Jun-17 | 155.02 | 155.98 | 154.48 | 155.37 | 21069647 |
6-Jun-17 | 153.90 | 155.81 | 153.78 | 154.45 | 26624926 |
2-Jun-17 | 153.58 | 155.45 | 152.89 | 155.45 | 27770715 |
1-Jun-17 | 153.17 | 153.33 | 152.22 | 153.18 | 16404088 |
30-May-17 | 153.42 | 154.43 | 153.33 | 153.67 | 20126851 |
25-May-17 | 153.73 | 154.35 | 153.03 | 153.87 | 19235598 |
18-May-17 | 151.27 | 153.34 | 151.13 | 152.54 | 33568215 |
12-May-17 | 154.70 | 156.42 | 154.67 | 156.10 | 32527017 |
11-May-17 | 152.45 | 154.07 | 152.31 | 153.95 | 27255058 |
9-May-17 | 153.87 | 154.88 | 153.45 | 153.99 | 39130363 |
8-May-17 | 149.03 | 153.70 | 149.03 | 153.01 | 48752413 |
4.4 Mutate
#将最好价High减去最低价Low的结果定义为maxDif,并取log aapl %>% mutate(maxDif = High-Low, log_maxDif=log(maxDif))
Date | Open | High | Low | Close | Volume | maxDif | log_maxDif |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <dbl> |
7-Jul-17 | 142.90 | 144.75 | 142.90 | 144.18 | 19201712 | 1.85 | 0.6151856 |
6-Jul-17 | 143.02 | 143.50 | 142.41 | 142.73 | 24128782 | 1.09 | 0.0861777 |
5-Jul-17 | 143.69 | 144.79 | 142.72 | 144.09 | 21569557 | 2.07 | 0.7275486 |
3-Jul-17 | 144.88 | 145.30 | 143.10 | 143.50 | 14277848 | 2.20 | 0.7884574 |
30-Jun-17 | 144.45 | 144.96 | 143.78 | 144.02 | 23024107 | 1.18 | 0.1655144 |
29-Jun-17 | 144.71 | 145.13 | 142.28 | 143.68 | 31499368 | 2.85 | 1.0473190 |
aapl %>% mutate(n=row_number())
Date | Open | High | Low | Close | Volume | n |
<chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <int> |
7-Jul-17 | 142.90 | 144.75 | 142.90 | 144.18 | 19201712 | 1 |
6-Jul-17 | 143.02 | 143.50 | 142.41 | 142.73 | 24128782 | 2 |
5-Jul-17 | 143.69 | 144.79 | 142.72 | 144.09 | 21569557 | 3 |
3-Jul-17 | 144.88 | 145.30 | 143.10 | 143.50 | 14277848 | 4 |
30-Jun-17 | 144.45 | 144.96 | 143.78 | 144.02 | 23024107 | 5 |
29-Jun-17 | 144.71 | 145.13 | 142.28 | 143.68 | 31499368 | 6 |
4.5 Group_By
对资料进行分组,这里导入新的 数据集 weather
#导入csv数据 weather <- read.csv('weather.csv', header=TRUE, sep=',', stringsAsFactors = FALSE) %>% as_tibble() weather
Date | city | temperature | windspeed | event |
<chr> | <chr> | <int> | <int> | <chr> |
1/1/2017 | new york | 32 | 6 | Rain |
1/1/2017 | mumbai | 90 | 5 | Sunny |
1/1/2017 | paris | 45 | 20 | Sunny |
1/2/2017 | new york | 36 | 7 | Sunny |
1/2/2017 | mumbai | 85 | 12 | Fog |
1/2/2017 | paris | 50 | 13 | Cloudy |
weather %>% group_by(city)
Date | city | temperature | windspeed | event |
<chr> | <chr> | <int> | <int> | <chr> |
1/1/2017 | new york | 32 | 6 | Rain |
1/1/2017 | mumbai | 90 | 5 | Sunny |
1/1/2017 | paris | 45 | 20 | Sunny |
1/2/2017 | new york | 36 | 7 | Sunny |
1/2/2017 | mumbai | 85 | 12 | Fog |
1/2/2017 | paris | 50 | 13 | Cloudy |
weather %>% group_by(city) %>% summarise(mean_temperature = mean(temperature))
summarise()` ungrouping output (override with `.groups` argument)
city | mean_temperature |
<chr> | <dbl> |
mumbai | 87.5 |
new york | 34.0 |
paris | 47.5 |
weather %>% summarise(mean_temperature = mean(temperature))
mean_temperature |
<dbl> |
56.33333 |
后台回复关键词【dplyr实操】,可 获得测试数据及代码
以上所述就是小编给大家介绍的《R语言dplyr包实操》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 编译型语言、解释型语言、静态类型语言、动态类型语言概念与区别
- 计算机语言发展的三个阶段:机器语言、汇编语言与高级语言
- 凹 (“Wa”) 语言:可以嵌入 Go 语言环境的脚本语言
- Rust语言恰巧是一门解决了Go语言所有问题的语言
- 获取系统语言/当前 App支持语言
- 【Go 语言教程】Go 语言简介
David Flanagan / O'Reilly Media / 2011-5-13 / GBP 39.99
The book is a programmer's guide and comprehensive reference to the core JavaScript language and to the client-side JavaScript APIs defined by web browsers. The sixth edition covers HTML 5 and ECMA......一起来看看 《JavaScript》 这本书的介绍吧!