R-数据操作

栏目: R语言 · 发布时间: 7年前

本文内容:

  • 基础函数操作数据框
  • sqldf包使用 SQL 查询数据框
  • data.table包操作数据
  • dplyr管道操作处理数据
  • rlist包处理嵌套数据结构

使用内置函数操作数据框

数据框的本质是一个由向量构成的列表,由于列长度相同,所以可以当做矩阵进行访问和操作。比如选择满足特定条件的行,使用 [] 符号,第一个参数提供一个逻辑向量,第二个参数留空。

本文大部分的代码都是基于一组产品的虚拟数据。我们先将数据载入,然后学习怎么用不同的方法操作数据。

if(!require(readr)) install.packages("readr")

product_info = read_csv("../../R/dataset/product-info.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   name = col_character(),
#>   type = col_character(),
#>   class = col_character(),
#>   released = col_character()
#> )
product_info
#> # A tibble: 6 x 5
#>   id    name      type  class   released
#>   <chr> <chr>     <chr> <chr>   <chr>   
#> 1 T01   SupCar    toy   vehicle yes     
#> 2 T02   SupPlane  toy   vehicle no      
#> 3 M01   JeepX     model vehicle yes     
#> 4 M02   AircraftX model vehicle yes     
#> 5 M03   Runner    model people  yes     
#> 6 M04   Dancer    model people  no

当数据以数据框的形式载入内存后,我们可以使用下面的代码查看每一列的类型:

sapply(product_info, class)
#>          id        name        type       class    released 
#> "character" "character" "character" "character" "character"

注意 read_csv 函数载入的数据框与内置函数 read.csv 函数是不同的,主要体现在不会将字符串转换为因子变量,当然前者的速度要快得多。

接下来我们正式学习用R内置的函数操作数据框进行分析和统计的一些方法。

内置函数操作数据框

选取 typetoy 的行:

product_info[product_info$type == "toy", ]
#> # A tibble: 2 x 5
#>   id    name     type  class   released
#>   <chr> <chr>    <chr> <chr>   <chr>   
#> 1 T01   SupCar   toy   vehicle yes     
#> 2 T02   SupPlane toy   vehicle no

或选取 releasedno 的行:

product_info[product_info$released == "no", ]
#> # A tibble: 2 x 5
#>   id    name     type  class   released
#>   <chr> <chr>    <chr> <chr>   <chr>   
#> 1 T02   SupPlane toy   vehicle no      
#> 2 M04   Dancer   model people  no

对列进行筛选需要将第1个参数留空,给第2个参数提供字符向量。

product_info[, c("id", "name", "type")]
#> # A tibble: 6 x 3
#>   id    name      type 
#>   <chr> <chr>     <chr>
#> 1 T01   SupCar    toy  
#> 2 T02   SupPlane  toy  
#> 3 M01   JeepX     model
#> 4 M02   AircraftX model
#> 5 M03   Runner    model
#> 6 M04   Dancer    model

行列筛选也是可以的,我们只要组合前面的两种情况即可。

product_info[product_info$type == "toy", c("name", "class", "released")]
#> # A tibble: 2 x 3
#>   name     class   released
#>   <chr>    <chr>   <chr>   
#> 1 SupCar   vehicle yes     
#> 2 SupPlane vehicle no

内置函数 subset() 可以简化取子集操作的过程:

subset(product_info,
       subset = type == "model" & released == "yes",
       select = name:class)
#> # A tibble: 3 x 3
#>   name      type  class  
#>   <chr>     <chr> <chr>  
#> 1 JeepX     model vehicle
#> 2 AircraftX model vehicle
#> 3 Runner    model people

使用 with() 函数在数据框的语义中计算表达式,即可以直接使用数据框的列名,而不必重复指定数据框:

with(product_info, name[released == "no"])
#> [1] "SupPlane" "Dancer"

除了构建子集,表达式还可以用来统计每列各个可能值出现的频数。

with(product_info, table(type[released == "yes"]))
#> 
#> model   toy 
#>     3     1

除了产品信息表,还有一张产品属性的统计表:

product_stats = read_csv("../../R/dataset/product-stats.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   material = col_character(),
#>   size = col_integer(),
#>   weight = col_double()
#> )
product_stats
#> # A tibble: 6 x 4
#>   id    material  size weight
#>   <chr> <chr>    <int>  <dbl>
#> 1 T01   Metal      120   10  
#> 2 T02   Metal      350   45  
#> 3 M01   Plastics    50   NA  
#> 4 M02   Plastics    85    3  
#> 5 M03   Wood        15   NA  
#> 6 M04   Wood        16    0.6

如果现在要获取尺寸最大的前3个产品的名字该怎么办?

一种方法是将 product_stats 按尺寸降序排列,选择前3个记录的id,然后用id值筛选 product_info 的行:

top3_id = unlist(product_stats[order(product_stats$size, decreasing = TRUE), "id"])[1:3]

product_info[product_info$id %in% top3_id, ]
#> # A tibble: 3 x 5
#>   id    name      type  class   released
#>   <chr> <chr>     <chr> <chr>   <chr>   
#> 1 T01   SupCar    toy   vehicle yes     
#> 2 T02   SupPlane  toy   vehicle no      
#> 3 M02   AircraftX model vehicle yes

我们用比较冗长的方式完成了任务。但仔细在想想,两个数据框是通过 id 连接到一起的,我们可以把它们合并到一起,然后执行提取操作:

product_table = merge(product_info, product_stats, by = "id")
product_table
#>    id      name  type   class released material size weight
#> 1 M01     JeepX model vehicle      yes Plastics   50     NA
#> 2 M02 AircraftX model vehicle      yes Plastics   85    3.0
#> 3 M03    Runner model  people      yes     Wood   15     NA
#> 4 M04    Dancer model  people       no     Wood   16    0.6
#> 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0
#> 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

现在通过合并的数据框,我们可以根据任意一列 排序 数据框,而不需要处理其他的表格数据:

product_table[order(product_table$size), ]
#>    id      name  type   class released material size weight
#> 3 M03    Runner model  people      yes     Wood   15     NA
#> 4 M04    Dancer model  people       no     Wood   16    0.6
#> 1 M01     JeepX model vehicle      yes Plastics   50     NA
#> 2 M02 AircraftX model vehicle      yes Plastics   85    3.0
#> 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0
#> 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

前面的问题我们也可以利用合并的数据框加以解决:

product_table[order(product_table$size, decreasing = TRUE), "name"][1:3]
#> [1] "SupPlane"  "SupCar"    "AircraftX"

有时候我们需要生成新数据框来对原始数据基础上进行调整和处理,从而避免破坏原始数据。 transform() 函数可以帮助我们完成这类任务,例如:

transform(product_table,
          released = ifelse(released == "yes", TRUE, FALSE),
          density = weight / size)
#>    id      name  type   class released material size weight density
#> 1 M01     JeepX model vehicle     TRUE Plastics   50     NA      NA
#> 2 M02 AircraftX model vehicle     TRUE Plastics   85    3.0  0.0353
#> 3 M03    Runner model  people     TRUE     Wood   15     NA      NA
#> 4 M04    Dancer model  people    FALSE     Wood   16    0.6  0.0375
#> 5 T01    SupCar   toy vehicle     TRUE    Metal  120   10.0  0.0833
#> 6 T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0  0.1286

前面数据中我们看到有一些缺失值(用 NA 表示),很多时候我们不希望数据出现任何缺失值,因此需要某种办法处理它们。为了演示处理的方法,我们再载入一张包含缺失值的表,包含每件产品的质量、耐久性、防水性的测试结果。

product_tests = read_csv("../../R/dataset/product-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   quality = col_integer(),
#>   durability = col_integer(),
#>   waterproof = col_character()
#> )
product_tests
#> # A tibble: 6 x 4
#>   id    quality durability waterproof
#>   <chr>   <int>      <int> <chr>     
#> 1 T01        NA         10 no        
#> 2 T02        10          9 no        
#> 3 M01         6          4 yes       
#> 4 M02         6          5 yes       
#> 5 M03         5         NA yes       
#> 6 M04         6          6 yes

na.omit() 可以删除所有包含缺失值的行:

na.omit(product_tests)
#> # A tibble: 4 x 4
#>   id    quality durability waterproof
#>   <chr>   <int>      <int> <chr>     
#> 1 T02        10          9 no        
#> 2 M01         6          4 yes       
#> 3 M02         6          5 yes       
#> 4 M04         6          6 yes

另外,函数 complete.cases() 可以返回逻辑向量,表明某行是否完整。

complete.cases(product_tests)
#> [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE

利用该函数可以筛选数据框,比如获得不含缺失值的 id 值:

product_tests[complete.cases(product_tests), "id"]
#> # A tibble: 4 x 1
#>   id   
#>   <chr>
#> 1 T02  
#> 2 M01  
#> 3 M02  
#> 4 M04

前面给出的3个表格有共同的 id 列,可惜R里面内置函数只能一次合并2个数据框:

product_full = merge(product_table, product_tests, by = "id")
product_full
#>    id      name  type   class released material size weight quality
#> 1 M01     JeepX model vehicle      yes Plastics   50     NA       6
#> 2 M02 AircraftX model vehicle      yes Plastics   85    3.0       6
#> 3 M03    Runner model  people      yes     Wood   15     NA       5
#> 4 M04    Dancer model  people       no     Wood   16    0.6       6
#> 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0      NA
#> 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0      10
#>   durability waterproof
#> 1          4        yes
#> 2          5        yes
#> 3         NA        yes
#> 4          6        yes
#> 5         10         no
#> 6          9         no

对完全合并好的表格,我们利用 tapply() 函数( apply 家族成员)可以进行统计,该函数专门用于处理表格数据,使用某些方法根据某列队另一列的数据进行统计。

例如根据type列计算quality列的均值:

mean_quality1 = tapply(product_full$quality,
                       list(product_full$type),
                       mean, na.rm=TRUE)
mean_quality1
#> model   toy 
#>  5.75 10.00

返回的结果看起来是个数值向量,我们使用 str() 看看:

str(mean_quality1)
#>  num [1:2(1d)] 5.75 10
#>  - attr(*, "dimnames")=List of 1
#>   ..$ : chr [1:2] "model" "toy"

实际上,这是个一维数组:

is.array(mean_quality1)
#> [1] TRUE

tapply() 返回的是一个数组,而不是简单的数值向量,因此可以方便地计算多组操作。

例如计算每一对 typeclass 组合的 quality 均值:

mean_quality2 = tapply(product_full$quality,
                       list(product_full$type, product_full$class),
                       mean, na.rm = TRUE)
mean_quality2
#>       people vehicle
#> model    5.5       6
#> toy       NA      10

对于二维数组,我们可以使用两个参数来获取其中的值:

typeof(mean_quality2)
#> [1] "double"
class(mean_quality2)
#> [1] "matrix"

mean_quality2["model", "vehicle"]
#> [1] 6

同理我们可以根据多列分组,使用 with() 可以避免反复输入 product_full

mean_quality3 = with(product_full,
                     tapply(quality, list(type, material, released),
                            mean, na.rm = TRUE))

mean_quality3
#> , , no
#> 
#>       Metal Plastics Wood
#> model    NA       NA    6
#> toy      10       NA   NA
#> 
#> , , yes
#> 
#>       Metal Plastics Wood
#> model    NA        6    5
#> toy     NaN       NA   NA

使用3个参数可以获取单元格中的值:

mean_quality3["model", "Wood", "yes"]
#> [1] 5

reshape2重塑数据框

前面我们学习了如何筛选、排序、合并和汇总数据框,有时候我们需要做些更复杂的操作。

例如下面数据包含两种产品不同日期的质量和耐久性的测试结果:

toy_tests = read_csv("../../R/dataset/product-toy-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   date = col_integer(),
#>   sample = col_integer(),
#>   quality = col_integer(),
#>   durability = col_integer()
#> )
toy_tests
#> # A tibble: 8 x 5
#>   id        date sample quality durability
#>   <chr>    <int>  <int>   <int>      <int>
#> 1 T01   20160201    100       9          9
#> 2 T01   20160302    150      10          9
#> 3 T01   20160405    180       9         10
#> 4 T01   20160502    140       9          9
#> 5 T02   20160201     70       7          9
#> 6 T02   20160303     75       8          8
#> 7 T02   20160403     90       9          8
#> 8 T02   20160502     85      10          9

如果需要同时比较两种产品的质量和耐久性,这种格式就比较麻烦,如果是下面的格式就好了:

date    T01 T02
20160201    9   9
2016    10  9

reshape2 包就是用来搞定这种任务的,如果没有安装,运行下面代码:

install.packages("reshape2")

安装成功后,我们就可以使用 dcast() 来转换数据,用于比较:

library(reshape2)
#> 
#> 载入程辑包:'reshape2'
#> The following object is masked from 'package:tidyr':
#> 
#>     smiths

toy_quality = dcast(toy_tests, date ~ id, value.var = "quality")
toy_quality
#>       date T01 T02
#> 1 20160201   9   7
#> 2 20160302  10  NA
#> 3 20160303  NA   8
#> 4 20160403  NA   9
#> 5 20160405   9  NA
#> 6 20160502   9  10

上述代码重塑了 toy_testsdate 列被共享, id 值被单独分割为列,每个 dateid 对应的值是 quality

可以看到数据中存在缺失值,有一种叫 末次观测值结转法(LOCF) 可以填补缺失值,当非缺失值后面紧跟一个缺失值时,就用该缺失值填补后面的缺失值,直到所有缺失值都被填满。 zoo 包提供了LOCF的一个实现,使用下面代码安装:

install.packages("zoo")

下面用一组简单的向量演示:

library(zoo)
#> 
#> 载入程辑包:'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
na.locf(c(1, 2, NA, NA, 3, 1, NA, 2, NA))
#> [1] 1 2 2 2 3 1 1 2 2

同样的方法我们可以应用于现在处理的数据:

na.locf(toy_quality$T01)
#> [1]  9 10 10 10  9  9

如果需要填补的数据很多,包含上千个产品,更好的做法是使用 lapply 进行自动分配:

toy_quality[-1] = lapply(toy_quality[-1], na.locf )
toy_quality
#>       date T01 T02
#> 1 20160201   9   7
#> 2 20160302  10   7
#> 3 20160303  10   8
#> 4 20160403  10   9
#> 5 20160405   9   9
#> 6 20160502   9  10

这里数据虽然已经没有了缺失值,但每一行数据的含义却发生了变化。原始数据中产品T01在20160303这天并没有测试,所以这一天的值应该被解释为在此之前的最后一次quality的测试值。另一个问题是两种产品都是按月测试的,但重塑后的数据框没有以固定的频率对其date。

下面方法进问题进行修正。

toy_tests$ym = substr(toy_tests$date, 1, 6)
toy_tests
#> # A tibble: 8 x 6
#>   id        date sample quality durability ym    
#>   <chr>    <int>  <int>   <int>      <int> <chr> 
#> 1 T01   20160201    100       9          9 201602
#> 2 T01   20160302    150      10          9 201603
#> 3 T01   20160405    180       9         10 201604
#> 4 T01   20160502    140       9          9 201605
#> 5 T02   20160201     70       7          9 201602
#> 6 T02   20160303     75       8          8 201603
#> 7 T02   20160403     90       9          8 201604
#> 8 T02   20160502     85      10          9 201605

我们只提取年月信息,然后利用它进行重塑。

toy_quality = dcast(toy_tests, ym ~ id, value.var = "quality")
toy_quality
#>       ym T01 T02
#> 1 201602   9   7
#> 2 201603  10   8
#> 3 201604   9   9
#> 4 201605   9  10

现在,两种产品每月的质量得分自然地展示出来,而且每月缺失值。

有时候,我们需要将 许多列合并为1列 ,用于表示被测量的对象,另外用1列存储对应的结果值。下面用 melt() 函数将原始数据两种测量组合到一起:

toy_tests2 = melt(toy_tests, id.vars = c("id", "ym"), 
                  measure.vars = c("quality", "durability"),
                  variable.name = "measure")
toy_tests2
#>     id     ym    measure value
#> 1  T01 201602    quality     9
#> 2  T01 201603    quality    10
#> 3  T01 201604    quality     9
#> 4  T01 201605    quality     9
#> 5  T02 201602    quality     7
#> 6  T02 201603    quality     8
#> 7  T02 201604    quality     9
#> 8  T02 201605    quality    10
#> 9  T01 201602 durability     9
#> 10 T01 201603 durability     9
#> 11 T01 201604 durability    10
#> 12 T01 201605 durability     9
#> 13 T02 201602 durability     9
#> 14 T02 201603 durability     8
#> 15 T02 201604 durability     8
#> 16 T02 201605 durability     9

这种格式正是 ggplot2 所喜爱的长格式数据,我们可以来画图:

library(ggplot2)

ggplot(toy_tests2, aes(x = ym, y = value)) + 
    geom_point() + 
    facet_grid(id ~ measure)

R-数据操作

我们得到了按照产品id和measure分组,以ym为x轴,以value为y轴的散点图,可以清晰对比分组后两种产品质量差异(以年月)。

我们还可以用不同的颜色来表示产品,下图可以给出与上图相同的信息:

ggplot(toy_tests2, aes(x = ym, y = value, color = id)) + 
    geom_point() + facet_grid(. ~ measure)

R-数据操作

通过sqldf包使用SQL查询数据框

有没有一种方法,能够直接使用SQL进行数据框查询,就像数据框是关系型数据库中的表一样呢? sqldf 包给出肯定答案。该包吸收了 SQLite 轻量结构和易于嵌入R会话的优点,可以用下面代码安装:

install.packages("sqldf")

首先加载包:

library(sqldf)
#> 载入需要的程辑包:gsubfn
#> 载入需要的程辑包:proto
#> 载入需要的程辑包:RSQLite

注意加载 sqldf 包时,几个依赖包会自动加载进来。 sql 包的实现依赖这些包,它基本上是在R和SQLite之间传输数据和转换数据类型

读入前面使用的产品表格:

product_info = read_csv("../../R/dataset/product-info.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   name = col_character(),
#>   type = col_character(),
#>   class = col_character(),
#>   released = col_character()
#> )
product_stats = read_csv("../../R/dataset/product-stats.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   material = col_character(),
#>   size = col_integer(),
#>   weight = col_double()
#> )
product_tests = read_csv("../../R/dataset/product-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   quality = col_integer(),
#>   durability = col_integer(),
#>   waterproof = col_character()
#> )
toy_tests = read_csv("../../R/dataset/product-toy-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   date = col_integer(),
#>   sample = col_integer(),
#>   quality = col_integer(),
#>   durability = col_integer()
#> )

sqldf 包的神奇之处在于我们可以使用SQL语句查询工作环境中的数据框,例如:

sqldf("select * from product_info")
#>    id      name  type   class released
#> 1 T01    SupCar   toy vehicle      yes
#> 2 T02  SupPlane   toy vehicle       no
#> 3 M01     JeepX model vehicle      yes
#> 4 M02 AircraftX model vehicle      yes
#> 5 M03    Runner model  people      yes
#> 6 M04    Dancer model  people       no

sqldf与SQLite一样,支持简单的选择性请求。

比如选择特定列:

sqldf("select id, name, class from product_info")
#>    id      name   class
#> 1 T01    SupCar vehicle
#> 2 T02  SupPlane vehicle
#> 3 M01     JeepX vehicle
#> 4 M02 AircraftX vehicle
#> 5 M03    Runner  people
#> 6 M04    Dancer  people

根据条件筛选记录:

sqldf("select id, name from product_info where released = 'yes' ")
#>    id      name
#> 1 T01    SupCar
#> 2 M01     JeepX
#> 3 M02 AircraftX
#> 4 M03    Runner

除了基本的数据库操作和分组统计,该包还支持查询多个数据框,比如:

sqldf("select * from product_info join product_stats using (id)")
#>    id      name  type   class released material size weight
#> 1 T01    SupCar   toy vehicle      yes    Metal  120   10.0
#> 2 T02  SupPlane   toy vehicle       no    Metal  350   45.0
#> 3 M01     JeepX model vehicle      yes Plastics   50     NA
#> 4 M02 AircraftX model vehicle      yes Plastics   85    3.0
#> 5 M03    Runner model  people      yes     Wood   15     NA
#> 6 M04    Dancer model  people       no     Wood   16    0.6

不过sqldf包的缺点也很明显:

  1. sqldf默认基于SQLite,因此SQLite的局限性就是该包的局限性,比如内置的分组汇总函数是有限的,而R本身的统计汇总函数要多得多
  2. 不方便动态编程
  3. SQL的限制性也限制了该包,我们难以像操作dplyr包一样用sqldf进行表格数据的操作、变换等等

如果你喜欢这个包并想用起来,阅读sqldf更多操作例子: https://github.com/ggrothendieck/sqldf#examples

使用data.table包操作数据

data.table 包提供了一个加强版的 data.frame ,它运行效率极高,而且能够处理适合内存的大数据集,它使用 [] 实现了一种自然地数据操作语法 。使用下面命令进行安装:

install.packages("data.table")

载入包:

library(data.table)
#> 
#> 载入程辑包:'data.table'
#> The following objects are masked from 'package:reshape2':
#> 
#>     dcast, melt
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

注意, data.table 包提供了加强版的 dcast()melt() ,它们的功能更强大、性能更高,内存使用也更高效。

创建 data.table 与创建 data.frame 类似:

dt = data.table(x = 1:3, y = rnorm(3), z = letters[1:3])
dt
#>    x     y z
#> 1: 1 0.819 a
#> 2: 2 0.426 b
#> 3: 3 0.289 c

检查它的结构:

str(dt)
#> Classes 'data.table' and 'data.frame':   3 obs. of  3 variables:
#>  $ x: int  1 2 3
#>  $ y: num  0.819 0.426 0.289
#>  $ z: chr  "a" "b" "c"
#>  - attr(*, ".internal.selfref")=<externalptr>

可以看到, dt 的类是 data.tabledata.frame ,也就是说 data.table 继承了 data.frame 的一些行为,但增强了其他部分。

** data.table 的基本语法是 dt[i, j, by],简单说就是使用 i 选择行,用 by 分组,然后计算 j **。接下来我们看看 data.table`继承了什么,增强了什么。

首先,我们仍然载入之前用到的产品数据,不过这里我们使用 data.table 包提供的 fread() 函数,它非常高效和智能,默认返回 data.table

product_info = fread("../../R/dataset/product-info.csv")
product_stats = fread("../../R/dataset/product-stats.csv")
product_tests = fread("../../R/dataset/product-tests.csv")
toy_tests = fread("../../R/dataset/product-toy-tests.csv")

如果查看表格信息,你会发现它和 data.frame 没什么两样:

product_info
#>     id      name  type   class released
#> 1: T01    SupCar   toy vehicle      yes
#> 2: T02  SupPlane   toy vehicle       no
#> 3: M01     JeepX model vehicle      yes
#> 4: M02 AircraftX model vehicle      yes
#> 5: M03    Runner model  people      yes
#> 6: M04    Dancer model  people       no

再看结构:

str(product_info)
#> Classes 'data.table' and 'data.frame':   6 obs. of  5 variables:
#>  $ id      : chr  "T01" "T02" "M01" "M02" ...
#>  $ name    : chr  "SupCar" "SupPlane" "JeepX" "AircraftX" ...
#>  $ type    : chr  "toy" "toy" "model" "model" ...
#>  $ class   : chr  "vehicle" "vehicle" "vehicle" "vehicle" ...
#>  $ released: chr  "yes" "no" "yes" "yes" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

data.frame 不同,如果只提供一个参数用来构建子集, data.table 是选择行而不是列:

product_info[1]
#>     id   name type   class released
#> 1: T01 SupCar  toy vehicle      yes

product_info[1:3]
#>     id     name  type   class released
#> 1: T01   SupCar   toy vehicle      yes
#> 2: T02 SupPlane   toy vehicle       no
#> 3: M01    JeepX model vehicle      yes

如果提供的是负数,那么将删除指定的行:

product_info[-1]
#>     id      name  type   class released
#> 1: T02  SupPlane   toy vehicle       no
#> 2: M01     JeepX model vehicle      yes
#> 3: M02 AircraftX model vehicle      yes
#> 4: M03    Runner model  people      yes
#> 5: M04    Dancer model  people       no

data.table提供了许多特殊符号,它们是data.table的重要组成。 .N 是最常用的符号之一,它表示当前分组中,对象的数目(就不用调用 nrow 函数啦)。在 [] 使用它指提取最后一行。

product_info[.N]
#>     id   name  type  class released
#> 1: M04 Dancer model people       no
product_info[c(1, .N)]
#>     id   name  type   class released
#> 1: T01 SupCar   toy vehicle      yes
#> 2: M04 Dancer model  people       no

在对 data.table 构建子集时,能够自动根据语义计算表达式,因此可以直接使用列名,像 with()subset() 那样。

比如:

product_info[released == "yes"]
#>     id      name  type   class released
#> 1: T01    SupCar   toy vehicle      yes
#> 2: M01     JeepX model vehicle      yes
#> 3: M02 AircraftX model vehicle      yes
#> 4: M03    Runner model  people      yes

方括号内的第1个参数是行筛选器,第2个则对筛选后的数据进行适当的计算。

例如提取列:

product_info[released == "yes", id]
#> [1] "T01" "M01" "M02" "M03"

在这里使用 "id" 结果不同,返回的必然是个data.table。

product_info[released == "yes", "id"]
#>     id
#> 1: T01
#> 2: M01
#> 3: M02
#> 4: M03

第二个参数可以是表达式,例如生成一张表,反应每种 typeclass 组合中 releasedyes 的数量:

product_info[released == "yes", table(type, class)]
#>        class
#> type    people vehicle
#>   model      1       2
#>   toy        0       1

要注意,给第2个参数提供list(),结果仍然转换为data.table:

product_info[released == "yes", list(id, name)]
#>     id      name
#> 1: T01    SupCar
#> 2: M01     JeepX
#> 3: M02 AircraftX
#> 4: M03    Runner

我们可以替换原有列,生成新的data.table:

product_info[, list(id, name, released = released == "yes")]
#>     id      name released
#> 1: T01    SupCar     TRUE
#> 2: T02  SupPlane    FALSE
#> 3: M01     JeepX     TRUE
#> 4: M02 AircraftX     TRUE
#> 5: M03    Runner     TRUE
#> 6: M04    Dancer    FALSE

还可以创建新列:

product_stats[, list(id, material, size, weight, density = size/weight)]
#>     id material size weight density
#> 1: T01    Metal  120   10.0   12.00
#> 2: T02    Metal  350   45.0    7.78
#> 3: M01 Plastics   50     NA      NA
#> 4: M02 Plastics   85    3.0   28.33
#> 5: M03     Wood   15     NA      NA
#> 6: M04     Wood   16    0.6   26.67

为了简化,data.table使用.()作为list()的缩写,这两者等价:

product_info[, .(id, name, type, class)]
#>     id      name  type   class
#> 1: T01    SupCar   toy vehicle
#> 2: T02  SupPlane   toy vehicle
#> 3: M01     JeepX model vehicle
#> 4: M02 AircraftX model vehicle
#> 5: M03    Runner model  people
#> 6: M04    Dancer model  people

product_info[released == "yes", .(id, name)]
#>     id      name
#> 1: T01    SupCar
#> 2: M01     JeepX
#> 3: M02 AircraftX
#> 4: M03    Runner

提供排序索引可以对记录排序:

product_stats[order(size, decreasing = TRUE)]
#>     id material size weight
#> 1: T02    Metal  350   45.0
#> 2: T01    Metal  120   10.0
#> 3: M02 Plastics   85    3.0
#> 4: M01 Plastics   50     NA
#> 5: M04     Wood   16    0.6
#> 6: M03     Wood   15     NA

前面都是在构建子集后,又创建新的data.table。这样挺麻烦的,因此 data.table 包提供了对列进行原地赋值的符号 := ,例如 product_stats 开始是这样的:

product_stats
#>     id material size weight
#> 1: T01    Metal  120   10.0
#> 2: T02    Metal  350   45.0
#> 3: M01 Plastics   50     NA
#> 4: M02 Plastics   85    3.0
#> 5: M03     Wood   15     NA
#> 6: M04     Wood   16    0.6

使用 := 直接在上面数据框创建新列:

product_stats[, density := size / weight]

虽然没有任何返回,但数据已经被修改了:

product_stats
#>     id material size weight density
#> 1: T01    Metal  120   10.0   12.00
#> 2: T02    Metal  350   45.0    7.78
#> 3: M01 Plastics   50     NA      NA
#> 4: M02 Plastics   85    3.0   28.33
#> 5: M03     Wood   15     NA      NA
#> 6: M04     Wood   16    0.6   26.67

使用 := 替换已有的列:

product_info[, released := released == "yes"]
product_info
#>     id      name  type   class released
#> 1: T01    SupCar   toy vehicle     TRUE
#> 2: T02  SupPlane   toy vehicle    FALSE
#> 3: M01     JeepX model vehicle     TRUE
#> 4: M02 AircraftX model vehicle     TRUE
#> 5: M03    Runner model  people     TRUE
#> 6: M04    Dancer model  people    FALSE

使用键获取值

索引支持是data.table另一个独特功能,即我们可以创建键(key),使用键获取记录及其高效。

例如,使用 setkey()id 设置为 product_info 中的一个键:

setkey(product_info, id)

同样的,函数无任何返回,但我们已经为原始数据设置了键,而且原来的数据看起来也没变化:

product_info
#>     id      name  type   class released
#> 1: M01     JeepX model vehicle     TRUE
#> 2: M02 AircraftX model vehicle     TRUE
#> 3: M03    Runner model  people     TRUE
#> 4: M04    Dancer model  people    FALSE
#> 5: T01    SupCar   toy vehicle     TRUE
#> 6: T02  SupPlane   toy vehicle    FALSE

但键已生成:

key(product_info)
#> [1] "id"

现在我们可以用它来获取数据了,比如提供一个id值:

product_info["M01"]
#>     id  name  type   class released
#> 1: M01 JeepX model vehicle     TRUE

也可以使用 setkeyv() 来设置键,但它只接受字符向量:

setkeyv(product_stats, "id")

当key是一个动态变化的向量时,这个函数会非常好用。

product_stats["M02"]
#>     id material size weight density
#> 1: M02 Plastics   85      3    28.3

如果两个表格有相同的键,我们可以轻松把他们连接到一起:

product_info[product_stats]
#>     id      name  type   class released material size weight density
#> 1: M01     JeepX model vehicle     TRUE Plastics   50     NA      NA
#> 2: M02 AircraftX model vehicle     TRUE Plastics   85    3.0   28.33
#> 3: M03    Runner model  people     TRUE     Wood   15     NA      NA
#> 4: M04    Dancer model  people    FALSE     Wood   16    0.6   26.67
#> 5: T01    SupCar   toy vehicle     TRUE    Metal  120   10.0   12.00
#> 6: T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0    7.78

data.table的键可以不止一个。例如使用 iddate 定位 toy_tests 中的记录:

setkey(toy_tests, id, date)

现在提供key中的两个元素就可以获取记录了

toy_tests[.("T01", 20160201)]
#>     id     date sample quality durability
#> 1: T01 20160201    100       9          9

如果提供第一个元素,会返回匹配的多个值:

toy_tests["T01"]
#>     id     date sample quality durability
#> 1: T01 20160201    100       9          9
#> 2: T01 20160302    150      10          9
#> 3: T01 20160405    180       9         10
#> 4: T01 20160502    140       9          9

key不能错序,因此不能单独提供第2个元素以及反序排列。

toy_tests[20160201]
#>      id date sample quality durability
#> 1: <NA>   NA     NA      NA         NA

toy_tests[.(20160202,"T01")]
#> Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'id' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns.

对数据进行分组汇总

by 是data.table中另一个重要参数(即方括号内的第3个参数),它可以将数据按照 by 值进行分组,并对分组计算第2个参数。

接下来,我们学习如何通过by以简便的方式实现数据的分组汇总。

最简单的用法是计算每组的记录条数:

product_info[, .N, by = released]
#>    released N
#> 1:     TRUE 4
#> 2:    FALSE 2

分组的变量可以不止一个,例如由 typeclass 确定一个分组:

product_info[, .N, by = .(type, class)]
#>     type   class N
#> 1: model vehicle 2
#> 2: model  people 2
#> 3:   toy vehicle 2

可以对每个分组进行统计计算,这里计算防水和非防水产品的质量得分均值:

product_tests[, mean(quality, na.rm = TRUE), by = .(waterproof)]
#>    waterproof    V1
#> 1:         no 10.00
#> 2:        yes  5.75

可以看到结果存储在V1列中,我们可以手动指定列名:

product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)), by = .(waterproof)]
#>    waterproof mean_quality
#> 1:         no        10.00
#> 2:        yes         5.75

注意操作需要�放在 list 中进行( .() )。

我们可以将多个[]按顺序连接起来,形成工作流(类似管道 %>% )。

下面的例子中,首先使用通用键id将product_info和product_tests连接起来,然后筛选已发布的产品,再按type和class进行分组,最后计算每组的quality和durability的均值。

type_class_test0 = product_info[product_tests][released == TRUE,
                                               .(mean_quality = mean(quality, na.rm=TRUE),
                                                 mean_durability = mean(durability, na.rm=TRUE)),
                                               by = .(type, class)]

type_class_test0
#>     type   class mean_quality mean_durability
#> 1:   toy vehicle          NaN            10.0
#> 2: model vehicle            6             4.5
#> 3: model  people            5             NaN

在返回的data.table中,by所对应的组合中的值是唯一的,虽然实现了目标,但结果中没有设置键:

key(type_class_test0)
#> NULL

这种情况下,我们可以使用keyby来确保结果的data.table自动将keyby对应的分组向量设置为键。一般data.table会保持原来的顺序返回,有时候我们想要设定排序,keyby也可以实现,所以是一举两得:

type_class_test = product_info[product_tests][released == TRUE, 
                                              .(mean_quality = mean(quality, na.rm = TRUE),
                                                mean_durability = mean(durability, na.rm = TRUE)),
                                              keyby = .(type, class)]
type_class_test
#>     type   class mean_quality mean_durability
#> 1: model  people            5             NaN
#> 2: model vehicle            6             4.5
#> 3:   toy vehicle          NaN            10.0
key(type_class_test)
#> [1] "type"  "class"

下面可以直接用键来获取值:

type_class_test[.("model", "vehicle"), mean_quality]
#> [1] 6

对大数据集使用键进行搜索,能够比迭代使用逻辑比较快得多,因为键搜索利用了二进制搜索,而迭代在不必要的计算上浪费了时间。

下面举例说明,首先创建有1000万行的数据,其中一列是索引列id,其他两列是随机数:

n = 10000000
test1 = data.frame(id = 1:n, x = rnorm(n), y = rnorm(n))

现在查找id为876543的行,看要花多少时间:

system.time(row <- test1[test1$id == 876543, ])
#>  用户  系统  流逝 
#> 0.152 0.020 0.172

作为对比,我们使用 data.table 来完成这个任务,使用 setDT() 将数据框转换为 data.table ,该函数可以原地转换,不需要复制,并可以设定键。

setDT(test1, key = "id")
class(test1)
#> [1] "data.table" "data.frame"

现在我们搜索相同的元素:

system.time(row <- test1[.(876543)])
#>  用户  系统  流逝 
#> 0.001 0.000 0.001

结果一致,但data.table用的时间要少得多。

重塑data.table

data.table扩展包为data.table对象提供了更强更快得 dcast()melt() 函数。

例如将toy_tests的每个产品质量得分按照年和月进行对齐

toy_tests[, ym := substr(date, 1, 6)]
toy_quality = dcast(toy_tests, ym ~ id, value.var = "quality")
toy_quality
#>        ym T01 T02
#> 1: 201602   9   7
#> 2: 201603  10   8
#> 3: 201604   9   9
#> 4: 201605   9  10

data.table::dcast() 提供了更强大的多变量支持:

toy_tests2 = data.table::dcast(toy_tests, ym ~ id, value.var = c("quality", "durability"))
toy_tests2
#>        ym quality_T01 quality_T02 durability_T01 durability_T02
#> 1: 201602           9           7              9              9
#> 2: 201603          10           8              9              8
#> 3: 201604           9           9             10              8
#> 4: 201605           9          10              9              9

看到没,data.table可以自动将id值与质量分类连接起来。

此时 ym 是键:

key(toy_tests2)
#> [1] "ym"

我们可以利用它提取数据:

toy_tests2["201602"]
#>        ym quality_T01 quality_T02 durability_T01 durability_T02
#> 1: 201602           9           7              9              9

使用原地设置函数

我们知道R存在复制修改机制,这在进行大数据计算时开销很大, data.table 提供了一系列支持语义的 set 函数,它们可以原地修改data.table,因此避免不必要的复制。

仍以 product_stats 为例,我们可以使用 setDF() 函数不要任何复制就可以将data.table变成data.frame。

product_stats
#>     id material size weight density
#> 1: M01 Plastics   50     NA      NA
#> 2: M02 Plastics   85    3.0   28.33
#> 3: M03     Wood   15     NA      NA
#> 4: M04     Wood   16    0.6   26.67
#> 5: T01    Metal  120   10.0   12.00
#> 6: T02    Metal  350   45.0    7.78

setDF(product_stats)

class(product_stats)
#> [1] "data.frame"

setDT() 可以将任意的data.frame转换为data.table,并设置键。

setDT(product_stats, key = "id")
class(product_stats)
#> [1] "data.table" "data.frame"

使用 setnames() 可以对列重命名:

setnames(product_stats, "size", "volume")

product_stats
#>     id material volume weight density
#> 1: M01 Plastics     50     NA      NA
#> 2: M02 Plastics     85    3.0   28.33
#> 3: M03     Wood     15     NA      NA
#> 4: M04     Wood     16    0.6   26.67
#> 5: T01    Metal    120   10.0   12.00
#> 6: T02    Metal    350   45.0    7.78

如果给行添加索引,使用:

product_stats[, i := .I]
product_stats
#>     id material volume weight density i
#> 1: M01 Plastics     50     NA      NA 1
#> 2: M02 Plastics     85    3.0   28.33 2
#> 3: M03     Wood     15     NA      NA 3
#> 4: M04     Wood     16    0.6   26.67 4
#> 5: T01    Metal    120   10.0   12.00 5
#> 6: T02    Metal    350   45.0    7.78 6

为方便,索引一般在第1列,所以我们要修改列的顺序:

setcolorder(product_stats, c("i", "id", "material", "weight", "volume", "density"))
product_stats
#>    i  id material weight volume density
#> 1: 1 M01 Plastics     NA     50      NA
#> 2: 2 M02 Plastics    3.0     85   28.33
#> 3: 3 M03     Wood     NA     15      NA
#> 4: 4 M04     Wood    0.6     16   26.67
#> 5: 5 T01    Metal   10.0    120   12.00
#> 6: 6 T02    Metal   45.0    350    7.78

data.table的动态作用域

我们不仅可以直接使用列,也可以提前定义注入 .N.I.SD 来指代数据中的重要部分。

为演示,我们先创建新的data.table,命名为 market_data ,其中date列是连续的。

market_data = data.table(date = as.Date("2015-05-01") + 0:299)
head(market_data)
#>          date
#> 1: 2015-05-01
#> 2: 2015-05-02
#> 3: 2015-05-03
#> 4: 2015-05-04
#> 5: 2015-05-05
#> 6: 2015-05-06

向调用函数一样,我们给data.table添加数据列:

set.seed(123)

market_data[, `:=`(
    price = round(30 * cumprod(1 + rnorm(300, 0.001, 0.05)), 2),
    volume = rbinom(300, 5000, 0.8)
)]

注意这里的price和volumn都是服从正态分布的随机数:

head(market_data)
#>          date price volume
#> 1: 2015-05-01  29.2   4021
#> 2: 2015-05-02  28.9   4000
#> 3: 2015-05-03  31.2   4033
#> 4: 2015-05-04  31.3   4036
#> 5: 2015-05-05  31.5   3995
#> 6: 2015-05-06  34.3   3955

我们以图形的方式展示数据:

plot(price ~ date, data = market_data,
     type = "l",
     main = "Market data")

R-数据操作

数据准备好后,我们看看动态作用域如何让事情变得简单。

看下时间范围:

market_data[, range(date)]
#> [1] "2015-05-01" "2016-02-24"

将数据整合缩减为月度数据:

monthly = market_data[,
                      .(open = price[[1]], high = max(price),
                        low = min(price), close = price[[.N]]),
                      keyby = .(year = year(date), month = month(date))]

head(monthly)
#>    year month open high  low close
#> 1: 2015     5 29.2 37.7 26.1  28.4
#> 2: 2015     6 28.1 37.6 28.1  37.2
#> 3: 2015     7 36.3 41.0 32.1  41.0
#> 4: 2015     8 41.5 50.0 30.9  30.9
#> 5: 2015     9 30.5 34.5 22.9  27.0
#> 6: 2015    10 25.7 33.2 24.6  29.3

计算过程为: 先根据by表达式将原始数据分割,分割后的每个部分都是原始数据的一个子集,并且原始数据和子集都是data.table。然后在每个子集data.table的语义中计算j表达式

下面代码没有按组聚合数据,而是画了每年的价格图:

oldpar = par(mfrow = c(1, 2))
market_data[, {
    plot(price ~ date, type = "l",
         main = sprintf("Market data (%d)", year))
}, by = .(year = year(date))]
par(oldpar)

这里我们没有为 plot() 设定data参数,图像也成功绘制,这是因为该操作是在data.table的语义中进行的。

此外,j表达式还可以用于构建模型的代码,下面是一个批量拟合线性模型的例子。这里使用 diamonds 数据集。

data("diamonds", package = "ggplot2")

setDT(diamonds)
head(diamonds)
#>    carat       cut color clarity depth table price    x    y    z
#> 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

该数据集包含超过5万条钻石信息的记录,每条记录了钻石的10个属性,现在我们队cut列中的每种切割类型都你拟合一个线性回归模型,由此观察每种切割类型中carat与depth是如何反映log(price)的信息。

diamonds[, {
    m = lm(log(price) ~ carat + depth)
    as.list(coef(m))
}, keyby = .(cut)]
#>          cut (Intercept) carat    depth
#> 1:      Fair        7.73  1.26 -0.01498
#> 2:      Good        7.08  1.97 -0.01460
#> 3: Very Good        6.29  2.09 -0.00289
#> 4:   Premium        5.93  1.85  0.00594
#> 5:     Ideal        8.50  2.13 -0.03808

动态作用域允许我们组合使用data.table内部或外部预定义的符号。举例,我们定义一个函数,计算market_data中由用户定义的列的年度均值:

average = function(column){
    market_data[, .(average = mean(.SD[[column]])),
                by = .(year = year(date))]
}

这里我们使用 .SD[[x]] 提取x列的值,这跟通过名字从列表中提取成分或元素相同。

下面计算每年的平均价格:

average("price")
#>    year average
#> 1: 2015    32.3
#> 2: 2016    32.4

每年平均数量:

average("volume")
#>    year average
#> 1: 2015    4000
#> 2: 2016    4003

我们可以利用此包专门的语法创造一个列数动态变化的组合,并且组合中的列是由动态变化的名称决定的。

这里我们假设添加额外的3列数据,每一列都是原始价格加了随机噪声生成的。不用重复调用 market_date[, price1 := ...] ,而是使用 market_data[, (columns) := list(...)] 来动态设定列,其中 columns 是一个包含列名的字符向量, list(...) 是每个列对应的值:

price_cols = paste0("price", 1:3)
market_data[, (price_cols) := lapply(1:3,
                                     function(i) round(price + rnorm(.N, 0, 5), 2))]
head(market_data)
#>          date price volume price1 price2 price3
#> 1: 2015-05-01  29.2   4021   30.6   27.4   33.2
#> 2: 2015-05-02  28.9   4000   29.7   20.4   36.0
#> 3: 2015-05-03  31.2   4033   34.3   26.9   27.2
#> 4: 2015-05-04  31.3   4036   29.3   29.0   28.0
#> 5: 2015-05-05  31.5   3995   36.0   32.1   34.8
#> 6: 2015-05-06  34.3   3955   30.1   31.0   35.2

另一方面,如果表格有很多列,并且需要对它们的子集进行一些计算,也可以用类似的语法来解决。

举例,我们现在需要对每个价格列调用 na.locf() 以去掉缺失值,先获取所有的价格列:

cols = colnames(market_data)

price_cols = cols[grep("^price", cols)]

price_cols
#> [1] "price"  "price1" "price2" "price3"

然后我们用类似的语法,并添加一个参数 .SDcols = price_cols ,这是为了让 .SD 中的列只是我们想要的那些价格列。

market_data[, (price_cols) := lapply(.SD, zoo::na.locf), .SDcols =  price_cols]

head(market_data)
#>          date price volume price1 price2 price3
#> 1: 2015-05-01  29.2   4021   30.6   27.4   33.2
#> 2: 2015-05-02  28.9   4000   29.7   20.4   36.0
#> 3: 2015-05-03  31.2   4033   34.3   26.9   27.2
#> 4: 2015-05-04  31.3   4036   29.3   29.0   28.0
#> 5: 2015-05-05  31.5   3995   36.0   32.1   34.8
#> 6: 2015-05-06  34.3   3955   30.1   31.0   35.2

最后,更多操作请前往 https://github.com/Rdatatable/data.table/wiki 查看完整功能列表。

使用dplyr操作数据框

关于dplyr的基本操作我已经写过很多笔记了,不再赘述,想学习的读者请参阅下面几篇文章,这部分我只挑没接触过的学习下。

data.table 类似, dplyr 也提供了 do() 函数来对每组数据进行任意操作。

例如将 diamondscut 分组,每组都按 log(price) ~ carat 拟合一个线性模型。和 data.table 不同的是,我们需要为操作指定一个名称,以便将结果存储在列中。而且 do() 表达式不能直接在分组数据的语义下计算 ,我们需要使用 . 来表示数据。

data("diamonds", package = "ggplot2")
models = diamonds %>% 
    group_by(cut) %>% 
    do(lmod = lm(log(price) ~ carat, data = .))
models
#> Source: local data frame [5 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 5 x 2
#>   cut       lmod    
#> * <ord>     <list>  
#> 1 Fair      <S3: lm>
#> 2 Good      <S3: lm>
#> 3 Very Good <S3: lm>
#> 4 Premium   <S3: lm>
#> 5 Ideal     <S3: lm>

注意结果创建了一个新列,该列不是典型的原子向量,每个元素都是模型的结果,包含线性回归对象的列表。我们可以通过索引来提取模型结果:

models$lmod[[1]]
#> 
#> Call:
#> lm(formula = log(price) ~ carat, data = .)
#> 
#> Coefficients:
#> (Intercept)        carat  
#>        6.78         1.25

在需要完成高度定制的操作时, do() 的优势非常明显 。下面举例。

假如我们需要分析 toy_tests 数据,要对每种产品的质量和耐久性进行汇总。如果只需要样本数最多的3个测试记录,并且每个产品的质量和耐久性是经样本数加权的平均数 ,下面是做法。

toy_tests %>% 
    group_by(id) %>% 
    arrange(desc(sample)) %>% 
    do(head(., 3)) %>% 
    summarise(
        quality = sum(quality * sample) / sum(sample),
        durability = sum(durability * sample) / sum(sample)
    )
#> # A tibble: 2 x 3
#>   id    quality durability
#>   <chr>   <dbl>      <dbl>
#> 1 T01      9.32       9.38
#> 2 T02      9.04       8.34

为了查看中间结果,可以运行 do() 之前的代码:

toy_tests %>% 
    group_by(id) %>% 
    arrange(desc(sample))
#> # A tibble: 8 x 6
#> # Groups:   id [2]
#>   id        date sample quality durability ym    
#>   <chr>    <int>  <int>   <int>      <int> <chr> 
#> 1 T01   20160405    180       9         10 201604
#> 2 T01   20160302    150      10          9 201603
#> 3 T01   20160502    140       9          9 201605
#> 4 T01   20160201    100       9          9 201602
#> 5 T02   20160403     90       9          8 201604
#> 6 T02   20160502     85      10          9 201605
#> 7 T02   20160303     75       8          8 201603
#> 8 T02   20160201     70       7          9 201602

使用rlist包处理嵌套数据结构

在R中,最常见的嵌套数据结构是列表对象,之前关注的都是操作表格数据,这部分我们一起玩转 rlist 包,它是针对非表格数据设计的包

rlist 的设计与 dplyr 非常相似,它提供了针对列表对象的映射、筛选、选择、排序和聚合功能。

安装:

install.packages("rlist")

载入:

library(rlist)

为了把数据以列表形式载入R,我们需要用 jsonlite::fromJSON() 或者直接使用 rlist 提供的 list.load() 函数 (这里我找不到原作者的数据文件,所以自己谷歌找了类似的学习下):

products = list.load("../../R/dataset/products.json")

products 的每个成分都包含产品所有信息,使用 list.map() 可以在每个成分的语义中计算表达式:

str(list.map(products, id))
#> List of 66
#>  $ : chr "amazonite"
#>  $ : chr "amber"
#>  $ : chr "amethyst"
#>  $ : chr "ametrine"
#>  $ : chr "apophyllite"
#>  $ : chr "aquamarine"
#>  $ : chr "aragonite"
#>  $ : chr "azurite"
#>  $ : chr "barite"
#>  $ : chr "bismuth"
#>  $ : chr "black_tourmaline"
#>  $ : chr "bloodstone"
#>  $ : chr "blue_calcite"
#>  $ : chr "kyanite"
#>  $ : chr "carnelian"
#>  $ : chr "celestite"
#>  $ : chr "chrysocolla"
#>  $ : chr "chrysoprase"
#>  $ : chr "citrine"
#>  $ : chr "copper"
#>  $ : chr "dalmatian_stone"
#>  $ : chr "danburite"
#>  $ : chr "desert_rose"
#>  $ : chr "emerald"
#>  $ : chr "epidote"
#>  $ : chr "fluorite"
#>  $ : chr "garnet"
#>  $ : chr "green_calcite"
#>  $ : chr "hematite"
#>  $ : chr "honey_calcite"
#>  $ : chr "howlite"
#>  $ : chr "jasper"
#>  $ : chr "jet"
#>  $ : chr "kunzite"
#>  $ : chr "labradorite"
#>  $ : chr "lapis_lazuli"
#>  $ : chr "lemon_quartz"
#>  $ : chr "lemurian_quartz"
#>  $ : chr "lepidolite"
#>  $ : chr "libyan_gold_tektite"
#>  $ : chr "magnesite"
#>  $ : chr "malachite"
#>  $ : chr "meteorite"
#>  $ : chr "moldavite"
#>  $ : chr "moonstone"
#>  $ : chr "orange_calcite"
#>  $ : chr "peridot"
#>  $ : chr "pyrite"
#>  $ : chr "quartz"
#>  $ : chr "rhodochrosite"
#>  $ : chr "rhodonite"
#>  $ : chr "rose_quartz"
#>  $ : chr "ruby"
#>  $ : chr "scolecite"
#>  $ : chr "selenite"
#>  $ : chr "serpentine"
#>  $ : chr "shungite"
#>  $ : chr "smoky_quartz"
#>  $ : chr "sodalite"
#>  $ : chr "tektite"
#>  $ : chr "tibetan_black_quartz"
#>  $ : chr "tigers_eye"
#>  $ : chr "titanium_quartz"
#>  $ : chr "tourmalined_quartz"
#>  $ : chr "turquoise"
#>  $ : chr "vanadinite"

list.mapv() 简化这个列表,返回一个向量:

list.mapv(products, name)
#>  [1] "Amazonite"            "Amber"                "Amethyst"            
#>  [4] "Ametrine"             "Apophyllite"          "Aquamarine"          
#>  [7] "Aragonite"            "Azurite"              "Barite"              
#> [10] "Bismuth"              "Black Tourmaline"     "Bloodstone"          
#> [13] "Blue Calcite"         "Blue Kyanite"         "Carnelian"           
#> [16] "Celestite"            "Chrysocolla"          "Chrysoprase"         
#> [19] "Citrine"              "Copper"               "Dalmatian Stone"     
#> [22] "Danburite"            "Desert Rose"          "Emerald"             
#> [25] "Epidote"              "Fluorite"             "Garnet"              
#> [28] "Green Calcite"        "Hematite"             "Honey Calcite"       
#> [31] "Howlite"              "Jasper"               "Jet"                 
#> [34] "Kunzite"              "Labradorite"          "Lapis Lazuli"        
#> [37] "Lemon Quartz"         "Lemurian Quartz"      "Lepidolite"          
#> [40] "Libyan Gold Tektite"  "Magnesite"            "Malachite"           
#> [43] "Meteorite"            "Moldavite"            "Moonstone"           
#> [46] "Orange Calcite"       "Peridot"              "Pyrite"              
#> [49] "Quartz"               "Rhodochrosite"        "Rhodonite"           
#> [52] "Rose Quartz"          "Ruby"                 "Scolecite"           
#> [55] "Selenite"             "Serpentine"           "Shungite"            
#> [58] "Smoky Quartz"         "Sodalite"             "Tektite"             
#> [61] "Tibetan Black Quartz" "Tiger Eye"            "Titanium Quartz"     
#> [64] "Tourmalined Quartz"   "Turquoise"            "Vanadinite"

为了对products进行筛选,我们可以使用 list.filter() ,只有所有条件都为 TRUEproducts 元素才会被返回:

liveTRUE = list.filter(products, livemode = TRUE)
list.mapv(liveTRUE, name)
#>  [1] "Amazonite"            "Amber"                "Amethyst"            
#>  [4] "Ametrine"             "Apophyllite"          "Aquamarine"          
#>  [7] "Aragonite"            "Azurite"              "Barite"              
#> [10] "Bismuth"              "Black Tourmaline"     "Bloodstone"          
#> [13] "Blue Calcite"         "Blue Kyanite"         "Carnelian"           
#> [16] "Celestite"            "Chrysocolla"          "Chrysoprase"         
#> [19] "Citrine"              "Copper"               "Dalmatian Stone"     
#> [22] "Danburite"            "Desert Rose"          "Emerald"             
#> [25] "Epidote"              "Fluorite"             "Garnet"              
#> [28] "Green Calcite"        "Hematite"             "Honey Calcite"       
#> [31] "Howlite"              "Jasper"               "Jet"                 
#> [34] "Kunzite"              "Labradorite"          "Lapis Lazuli"        
#> [37] "Lemon Quartz"         "Lemurian Quartz"      "Lepidolite"          
#> [40] "Libyan Gold Tektite"  "Magnesite"            "Malachite"           
#> [43] "Meteorite"            "Moldavite"            "Moonstone"           
#> [46] "Orange Calcite"       "Peridot"              "Pyrite"              
#> [49] "Quartz"               "Rhodochrosite"        "Rhodonite"           
#> [52] "Rose Quartz"          "Ruby"                 "Scolecite"           
#> [55] "Selenite"             "Serpentine"           "Shungite"            
#> [58] "Smoky Quartz"         "Sodalite"             "Tektite"             
#> [61] "Tibetan Black Quartz" "Tiger Eye"            "Titanium Quartz"     
#> [64] "Tourmalined Quartz"   "Turquoise"            "Vanadinite"

注意, rlist 的设计与 dplyr 函数非常类似,我们可以利用管道符号将数据往下传递:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.mapv(name)
#>  [1] "Amazonite"            "Amber"                "Amethyst"            
#>  [4] "Ametrine"             "Apophyllite"          "Aquamarine"          
#>  [7] "Aragonite"            "Azurite"              "Barite"              
#> [10] "Bismuth"              "Black Tourmaline"     "Bloodstone"          
#> [13] "Blue Calcite"         "Blue Kyanite"         "Carnelian"           
#> [16] "Celestite"            "Chrysocolla"          "Chrysoprase"         
#> [19] "Citrine"              "Copper"               "Dalmatian Stone"     
#> [22] "Danburite"            "Desert Rose"          "Emerald"             
#> [25] "Epidote"              "Fluorite"             "Garnet"              
#> [28] "Green Calcite"        "Hematite"             "Honey Calcite"       
#> [31] "Howlite"              "Jasper"               "Jet"                 
#> [34] "Kunzite"              "Labradorite"          "Lapis Lazuli"        
#> [37] "Lemon Quartz"         "Lemurian Quartz"      "Lepidolite"          
#> [40] "Libyan Gold Tektite"  "Magnesite"            "Malachite"           
#> [43] "Meteorite"            "Moldavite"            "Moonstone"           
#> [46] "Orange Calcite"       "Peridot"              "Pyrite"              
#> [49] "Quartz"               "Rhodochrosite"        "Rhodonite"           
#> [52] "Rose Quartz"          "Ruby"                 "Scolecite"           
#> [55] "Selenite"             "Serpentine"           "Shungite"            
#> [58] "Smoky Quartz"         "Sodalite"             "Tektite"             
#> [61] "Tibetan Black Quartz" "Tiger Eye"            "Titanium Quartz"     
#> [64] "Tourmalined Quartz"   "Turquoise"            "Vanadinite"

使用 list.select 可以筛选字段以及创建新的字段:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.select(id, name, caption) %>% 
    str()
#> List of 66
#>  $ :List of 3
#>   ..$ id     : chr "amazonite"
#>   ..$ name   : chr "Amazonite"
#>   ..$ caption: chr "love,calm,communication"
#>  $ :List of 3
#>   ..$ id     : chr "amber"
#>   ..$ name   : chr "Amber"
#>   ..$ caption: chr "cleansing,vitality,protection"
#>  $ :List of 3
#>   ..$ id     : chr "amethyst"
#>   ..$ name   : chr "Amethyst"
#>   ..$ caption: chr "cleansing,protection,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "ametrine"
#>   ..$ name   : chr "Ametrine"
#>   ..$ caption: chr "Clarity,motivation,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "apophyllite"
#>   ..$ name   : chr "Apophyllite"
#>   ..$ caption: chr "insight,intuition,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "aquamarine"
#>   ..$ name   : chr "Aquamarine"
#>   ..$ caption: chr "Clarity,calm,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "aragonite"
#>   ..$ name   : chr "Aragonite"
#>   ..$ caption: chr "balance,strength,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "azurite"
#>   ..$ name   : chr "Azurite"
#>   ..$ caption: chr "Calm,balance,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "barite"
#>   ..$ name   : chr "Barite"
#>   ..$ caption: chr "Dreamwork,insight,intuition"
#>  $ :List of 3
#>   ..$ id     : chr "bismuth"
#>   ..$ name   : chr "Bismuth"
#>   ..$ caption: chr "Transformation,vitality,focus"
#>  $ :List of 3
#>   ..$ id     : chr "black_tourmaline"
#>   ..$ name   : chr "Black Tourmaline"
#>   ..$ caption: chr "grounding,protection,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "bloodstone"
#>   ..$ name   : chr "Bloodstone"
#>   ..$ caption: chr "courage,cleansing,vitality,strength,manifestation,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "blue_calcite"
#>   ..$ name   : chr "Blue Calcite"
#>   ..$ caption: chr "Creativity,clarity,intuition,calm,protection,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "kyanite"
#>   ..$ name   : chr "Blue Kyanite"
#>   ..$ caption: chr "communication,intuition,insight"
#>  $ :List of 3
#>   ..$ id     : chr "carnelian"
#>   ..$ name   : chr "Carnelian"
#>   ..$ caption: chr "confidence,courage,strength,passion,motivation,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "celestite"
#>   ..$ name   : chr "Celestite"
#>   ..$ caption: chr "Intuition,protection,calm"
#>  $ :List of 3
#>   ..$ id     : chr "chrysocolla"
#>   ..$ name   : chr "Chrysocolla"
#>   ..$ caption: chr "communication,strength,insight"
#>  $ :List of 3
#>   ..$ id     : chr "chrysoprase"
#>   ..$ name   : chr "Chrysoprase"
#>   ..$ caption: chr "Balance,joy,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "citrine"
#>   ..$ name   : chr "Citrine"
#>   ..$ caption: chr "confidence,love,passion,manifestation,intuition,creativity"
#>  $ :List of 3
#>   ..$ id     : chr "copper"
#>   ..$ name   : chr "Copper"
#>   ..$ caption: chr "balance,vitality,grounding"
#>  $ :List of 3
#>   ..$ id     : chr "dalmatian_stone"
#>   ..$ name   : chr "Dalmatian Stone"
#>   ..$ caption: chr "protection,calm,joy"
#>  $ :List of 3
#>   ..$ id     : chr "danburite"
#>   ..$ name   : chr "Danburite"
#>   ..$ caption: chr "Calm,joy,communication"
#>  $ :List of 3
#>   ..$ id     : chr "desert_rose"
#>   ..$ name   : chr "Desert Rose"
#>   ..$ caption: chr "insight,courage,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "emerald"
#>   ..$ name   : chr "Emerald"
#>   ..$ caption: chr "Love,Vitality,Balance"
#>  $ :List of 3
#>   ..$ id     : chr "epidote"
#>   ..$ name   : chr "Epidote"
#>   ..$ caption: chr "Transformation,manifestation,motivation"
#>  $ :List of 3
#>   ..$ id     : chr "fluorite"
#>   ..$ name   : chr "Fluorite"
#>   ..$ caption: chr "focus,clarity,insight"
#>  $ :List of 3
#>   ..$ id     : chr "garnet"
#>   ..$ name   : chr "Garnet"
#>   ..$ caption: chr "Creativity,manifestation,passion"
#>  $ :List of 3
#>   ..$ id     : chr "green_calcite"
#>   ..$ name   : chr "Green Calcite"
#>   ..$ caption: chr "Calm,Balance,Love"
#>  $ :List of 3
#>   ..$ id     : chr "hematite"
#>   ..$ name   : chr "Hematite"
#>   ..$ caption: chr "balance,grounding,manifestation"
#>  $ :List of 3
#>   ..$ id     : chr "honey_calcite"
#>   ..$ name   : chr "Honey Calcite"
#>   ..$ caption: chr "confidence,motivation,manifestation"
#>  $ :List of 3
#>   ..$ id     : chr "howlite"
#>   ..$ name   : chr "Howlite"
#>   ..$ caption: chr "focus,calm,creativity"
#>  $ :List of 3
#>   ..$ id     : chr "jasper"
#>   ..$ name   : chr "Jasper"
#>   ..$ caption: chr "Grounding,Balance,Strength"
#>  $ :List of 3
#>   ..$ id     : chr "jet"
#>   ..$ name   : chr "Jet"
#>   ..$ caption: chr "Grounding,cleansing,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "kunzite"
#>   ..$ name   : chr "Kunzite"
#>   ..$ caption: chr "Love,joy,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "labradorite"
#>   ..$ name   : chr "Labradorite"
#>   ..$ caption: chr "Creativity,joy,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "lapis_lazuli"
#>   ..$ name   : chr "Lapis Lazuli"
#>   ..$ caption: chr "insight,calm,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "lemon_quartz"
#>   ..$ name   : chr "Lemon Quartz"
#>   ..$ caption: chr "Joy,Balance,Vitality"
#>  $ :List of 3
#>   ..$ id     : chr "lemurian_quartz"
#>   ..$ name   : chr "Lemurian Quartz"
#>   ..$ caption: chr "Love,strength,balance"
#>  $ :List of 3
#>   ..$ id     : chr "lepidolite"
#>   ..$ name   : chr "Lepidolite"
#>   ..$ caption: chr "Calm,Balance,Cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "libyan_gold_tektite"
#>   ..$ name   : chr "Libyan Gold Tektite"
#>   ..$ caption: chr "Motivation,confidence,manifestation,protection,creativity,strength"
#>  $ :List of 3
#>   ..$ id     : chr "magnesite"
#>   ..$ name   : chr "Magnesite"
#>   ..$ caption: chr "balance,calm,insight"
#>  $ :List of 3
#>   ..$ id     : chr "malachite"
#>   ..$ name   : chr "Malachite"
#>   ..$ caption: chr "creativity,strength,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "meteorite"
#>   ..$ name   : chr "Meteorite"
#>   ..$ caption: chr "Vitality,Intuition,Insight"
#>  $ :List of 3
#>   ..$ id     : chr "moldavite"
#>   ..$ name   : chr "Moldavite"
#>   ..$ caption: chr "Cleansing,Manifestation,Vitality,Insight,Transformation,Dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "moonstone"
#>   ..$ name   : chr "Moonstone"
#>   ..$ caption: chr "insight,intuition,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "orange_calcite"
#>   ..$ name   : chr "Orange Calcite"
#>   ..$ caption: chr "Creativity,vitality,passion"
#>  $ :List of 3
#>   ..$ id     : chr "peridot"
#>   ..$ name   : chr "Peridot"
#>   ..$ caption: chr "manifestation,vitality,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "pyrite"
#>   ..$ name   : chr "Pyrite"
#>   ..$ caption: chr "manifestation,creativity,confidence,passion,strength,focus"
#>  $ :List of 3
#>   ..$ id     : chr "quartz"
#>   ..$ name   : chr "Quartz"
#>   ..$ caption: chr "clarity,transformation,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "rhodochrosite"
#>   ..$ name   : chr "Rhodochrosite"
#>   ..$ caption: chr "Love,courage,joy"
#>  $ :List of 3
#>   ..$ id     : chr "rhodonite"
#>   ..$ name   : chr "Rhodonite"
#>   ..$ caption: chr "Love,transformation,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "rose_quartz"
#>   ..$ name   : chr "Rose Quartz"
#>   ..$ caption: chr "love,calm,creativity"
#>  $ :List of 3
#>   ..$ id     : chr "ruby"
#>   ..$ name   : chr "Ruby"
#>   ..$ caption: chr "Passion,vitality,courage,strength,protection,motivation"
#>  $ :List of 3
#>   ..$ id     : chr "scolecite"
#>   ..$ name   : chr "Scolecite"
#>   ..$ caption: chr "calm,insight,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "selenite"
#>   ..$ name   : chr "Selenite"
#>   ..$ caption: chr "Clarity,cleansing,motivation"
#>  $ :List of 3
#>   ..$ id     : chr "serpentine"
#>   ..$ name   : chr "Serpentine"
#>   ..$ caption: chr "Manifestation,transformation,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "shungite"
#>   ..$ name   : chr "Shungite"
#>   ..$ caption: chr "clarity,transformation,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "smoky_quartz"
#>   ..$ name   : chr "Smoky Quartz"
#>   ..$ caption: chr "calm,clarity,protection"
#>  $ :List of 3
#>   ..$ id     : chr "sodalite"
#>   ..$ name   : chr "Sodalite"
#>   ..$ caption: chr "Insight,Intuition,Dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "tektite"
#>   ..$ name   : chr "Tektite"
#>   ..$ caption: chr "Communication,transformation,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "tibetan_black_quartz"
#>   ..$ name   : chr "Tibetan Black Quartz"
#>   ..$ caption: chr "Protection,balance,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "tigers_eye"
#>   ..$ name   : chr "Tiger Eye"
#>   ..$ caption: chr "Balance,confidence,strength"
#>  $ :List of 3
#>   ..$ id     : chr "titanium_quartz"
#>   ..$ name   : chr "Titanium Quartz"
#>   ..$ caption: chr "strength,focus,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "tourmalined_quartz"
#>   ..$ name   : chr "Tourmalined Quartz"
#>   ..$ caption: chr "Cleansing,protection,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "turquoise"
#>   ..$ name   : chr "Turquoise"
#>   ..$ caption: chr "strength,balance,communication"
#>  $ :List of 3
#>   ..$ id     : chr "vanadinite"
#>   ..$ name   : chr "Vanadinite"
#>   ..$ caption: chr "creativity,motivation,vitality"

创建新字段:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.select(id, name, caption) %>% 
    list.select(connect = paste(id, name, sep = "-")) %>% 
    str()
#> List of 66
#>  $ :List of 1
#>   ..$ connect: chr "amazonite-Amazonite"
#>  $ :List of 1
#>   ..$ connect: chr "amber-Amber"
#>  $ :List of 1
#>   ..$ connect: chr "amethyst-Amethyst"
#>  $ :List of 1
#>   ..$ connect: chr "ametrine-Ametrine"
#>  $ :List of 1
#>   ..$ connect: chr "apophyllite-Apophyllite"
#>  $ :List of 1
#>   ..$ connect: chr "aquamarine-Aquamarine"
#>  $ :List of 1
#>   ..$ connect: chr "aragonite-Aragonite"
#>  $ :List of 1
#>   ..$ connect: chr "azurite-Azurite"
#>  $ :List of 1
#>   ..$ connect: chr "barite-Barite"
#>  $ :List of 1
#>   ..$ connect: chr "bismuth-Bismuth"
#>  $ :List of 1
#>   ..$ connect: chr "black_tourmaline-Black Tourmaline"
#>  $ :List of 1
#>   ..$ connect: chr "bloodstone-Bloodstone"
#>  $ :List of 1
#>   ..$ connect: chr "blue_calcite-Blue Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "kyanite-Blue Kyanite"
#>  $ :List of 1
#>   ..$ connect: chr "carnelian-Carnelian"
#>  $ :List of 1
#>   ..$ connect: chr "celestite-Celestite"
#>  $ :List of 1
#>   ..$ connect: chr "chrysocolla-Chrysocolla"
#>  $ :List of 1
#>   ..$ connect: chr "chrysoprase-Chrysoprase"
#>  $ :List of 1
#>   ..$ connect: chr "citrine-Citrine"
#>  $ :List of 1
#>   ..$ connect: chr "copper-Copper"
#>  $ :List of 1
#>   ..$ connect: chr "dalmatian_stone-Dalmatian Stone"
#>  $ :List of 1
#>   ..$ connect: chr "danburite-Danburite"
#>  $ :List of 1
#>   ..$ connect: chr "desert_rose-Desert Rose"
#>  $ :List of 1
#>   ..$ connect: chr "emerald-Emerald"
#>  $ :List of 1
#>   ..$ connect: chr "epidote-Epidote"
#>  $ :List of 1
#>   ..$ connect: chr "fluorite-Fluorite"
#>  $ :List of 1
#>   ..$ connect: chr "garnet-Garnet"
#>  $ :List of 1
#>   ..$ connect: chr "green_calcite-Green Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "hematite-Hematite"
#>  $ :List of 1
#>   ..$ connect: chr "honey_calcite-Honey Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "howlite-Howlite"
#>  $ :List of 1
#>   ..$ connect: chr "jasper-Jasper"
#>  $ :List of 1
#>   ..$ connect: chr "jet-Jet"
#>  $ :List of 1
#>   ..$ connect: chr "kunzite-Kunzite"
#>  $ :List of 1
#>   ..$ connect: chr "labradorite-Labradorite"
#>  $ :List of 1
#>   ..$ connect: chr "lapis_lazuli-Lapis Lazuli"
#>  $ :List of 1
#>   ..$ connect: chr "lemon_quartz-Lemon Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "lemurian_quartz-Lemurian Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "lepidolite-Lepidolite"
#>  $ :List of 1
#>   ..$ connect: chr "libyan_gold_tektite-Libyan Gold Tektite"
#>  $ :List of 1
#>   ..$ connect: chr "magnesite-Magnesite"
#>  $ :List of 1
#>   ..$ connect: chr "malachite-Malachite"
#>  $ :List of 1
#>   ..$ connect: chr "meteorite-Meteorite"
#>  $ :List of 1
#>   ..$ connect: chr "moldavite-Moldavite"
#>  $ :List of 1
#>   ..$ connect: chr "moonstone-Moonstone"
#>  $ :List of 1
#>   ..$ connect: chr "orange_calcite-Orange Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "peridot-Peridot"
#>  $ :List of 1
#>   ..$ connect: chr "pyrite-Pyrite"
#>  $ :List of 1
#>   ..$ connect: chr "quartz-Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "rhodochrosite-Rhodochrosite"
#>  $ :List of 1
#>   ..$ connect: chr "rhodonite-Rhodonite"
#>  $ :List of 1
#>   ..$ connect: chr "rose_quartz-Rose Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "ruby-Ruby"
#>  $ :List of 1
#>   ..$ connect: chr "scolecite-Scolecite"
#>  $ :List of 1
#>   ..$ connect: chr "selenite-Selenite"
#>  $ :List of 1
#>   ..$ connect: chr "serpentine-Serpentine"
#>  $ :List of 1
#>   ..$ connect: chr "shungite-Shungite"
#>  $ :List of 1
#>   ..$ connect: chr "smoky_quartz-Smoky Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "sodalite-Sodalite"
#>  $ :List of 1
#>   ..$ connect: chr "tektite-Tektite"
#>  $ :List of 1
#>   ..$ connect: chr "tibetan_black_quartz-Tibetan Black Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "tigers_eye-Tiger Eye"
#>  $ :List of 1
#>   ..$ connect: chr "titanium_quartz-Titanium Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "tourmalined_quartz-Tourmalined Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "turquoise-Turquoise"
#>  $ :List of 1
#>   ..$ connect: chr "vanadinite-Vanadinite"

list.sort() 函数,可以按照指定字段或值对列表元素进行排序,再用 list.stack() 将所有元素变成数据框:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.select(id, name, caption) %>% 
    list.sort(id) %>% 
    list.stack()
#>                      id                 name
#> 1             amazonite            Amazonite
#> 2                 amber                Amber
#> 3              amethyst             Amethyst
#> 4              ametrine             Ametrine
#> 5           apophyllite          Apophyllite
#> 6            aquamarine           Aquamarine
#> 7             aragonite            Aragonite
#> 8               azurite              Azurite
#> 9                barite               Barite
#> 10              bismuth              Bismuth
#> 11     black_tourmaline     Black Tourmaline
#> 12           bloodstone           Bloodstone
#> 13         blue_calcite         Blue Calcite
#> 14            carnelian            Carnelian
#> 15            celestite            Celestite
#> 16          chrysocolla          Chrysocolla
#> 17          chrysoprase          Chrysoprase
#> 18              citrine              Citrine
#> 19               copper               Copper
#> 20      dalmatian_stone      Dalmatian Stone
#> 21            danburite            Danburite
#> 22          desert_rose          Desert Rose
#> 23              emerald              Emerald
#> 24              epidote              Epidote
#> 25             fluorite             Fluorite
#> 26               garnet               Garnet
#> 27        green_calcite        Green Calcite
#> 28             hematite             Hematite
#> 29        honey_calcite        Honey Calcite
#> 30              howlite              Howlite
#> 31               jasper               Jasper
#> 32                  jet                  Jet
#> 33              kunzite              Kunzite
#> 34              kyanite         Blue Kyanite
#> 35          labradorite          Labradorite
#> 36         lapis_lazuli         Lapis Lazuli
#> 37         lemon_quartz         Lemon Quartz
#> 38      lemurian_quartz      Lemurian Quartz
#> 39           lepidolite           Lepidolite
#> 40  libyan_gold_tektite  Libyan Gold Tektite
#> 41            magnesite            Magnesite
#> 42            malachite            Malachite
#> 43            meteorite            Meteorite
#> 44            moldavite            Moldavite
#> 45            moonstone            Moonstone
#> 46       orange_calcite       Orange Calcite
#> 47              peridot              Peridot
#> 48               pyrite               Pyrite
#> 49               quartz               Quartz
#> 50        rhodochrosite        Rhodochrosite
#> 51            rhodonite            Rhodonite
#> 52          rose_quartz          Rose Quartz
#> 53                 ruby                 Ruby
#> 54            scolecite            Scolecite
#> 55             selenite             Selenite
#> 56           serpentine           Serpentine
#> 57             shungite             Shungite
#> 58         smoky_quartz         Smoky Quartz
#> 59             sodalite             Sodalite
#> 60              tektite              Tektite
#> 61 tibetan_black_quartz Tibetan Black Quartz
#> 62           tigers_eye            Tiger Eye
#> 63      titanium_quartz      Titanium Quartz
#> 64   tourmalined_quartz   Tourmalined Quartz
#> 65            turquoise            Turquoise
#> 66           vanadinite           Vanadinite
#>                                                               caption
#> 1                                             love,calm,communication
#> 2                                       cleansing,vitality,protection
#> 3                                        cleansing,protection,clarity
#> 4                                       Clarity,motivation,confidence
#> 5                                          insight,intuition,vitality
#> 6                                              Clarity,calm,cleansing
#> 7                                         balance,strength,confidence
#> 8                                              Calm,balance,dreamwork
#> 9                                         Dreamwork,insight,intuition
#> 10                                      Transformation,vitality,focus
#> 11                                     grounding,protection,cleansing
#> 12   courage,cleansing,vitality,strength,manifestation,transformation
#> 13             Creativity,clarity,intuition,calm,protection,dreamwork
#> 14            confidence,courage,strength,passion,motivation,vitality
#> 15                                          Intuition,protection,calm
#> 16                                     communication,strength,insight
#> 17                                         Balance,joy,transformation
#> 18         confidence,love,passion,manifestation,intuition,creativity
#> 19                                         balance,vitality,grounding
#> 20                                                protection,calm,joy
#> 21                                             Calm,joy,communication
#> 22                                            insight,courage,clarity
#> 23                                              Love,Vitality,Balance
#> 24                            Transformation,manifestation,motivation
#> 25                                              focus,clarity,insight
#> 26                                   Creativity,manifestation,passion
#> 27                                                  Calm,Balance,Love
#> 28                                    balance,grounding,manifestation
#> 29                                confidence,motivation,manifestation
#> 30                                              focus,calm,creativity
#> 31                                         Grounding,Balance,Strength
#> 32                                       Grounding,cleansing,vitality
#> 33                                                  Love,joy,vitality
#> 34                                    communication,intuition,insight
#> 35                                            Creativity,joy,vitality
#> 36                                        insight,calm,transformation
#> 37                                               Joy,Balance,Vitality
#> 38                                              Love,strength,balance
#> 39                                             Calm,Balance,Cleansing
#> 40 Motivation,confidence,manifestation,protection,creativity,strength
#> 41                                               balance,calm,insight
#> 42                                     creativity,strength,confidence
#> 43                                         Vitality,Intuition,Insight
#> 44  Cleansing,Manifestation,Vitality,Insight,Transformation,Dreamwork
#> 45                                        insight,intuition,dreamwork
#> 46                                        Creativity,vitality,passion
#> 47                              manifestation,vitality,transformation
#> 48         manifestation,creativity,confidence,passion,strength,focus
#> 49                                   clarity,transformation,cleansing
#> 50                                                   Love,courage,joy
#> 51                                        Love,transformation,clarity
#> 52                                               love,calm,creativity
#> 53            Passion,vitality,courage,strength,protection,motivation
#> 54                                             calm,insight,dreamwork
#> 55                                       Clarity,cleansing,motivation
#> 56                              Manifestation,transformation,vitality
#> 57                                   clarity,transformation,cleansing
#> 58                                            calm,clarity,protection
#> 59                                        Insight,Intuition,Dreamwork
#> 60                              Communication,transformation,vitality
#> 61                                       Protection,balance,cleansing
#> 62                                        Balance,confidence,strength
#> 63                                          strength,focus,confidence
#> 64                                       Cleansing,protection,clarity
#> 65                                     strength,balance,communication
#> 66                                     creativity,motivation,vitality

除此之外, list.group() 函数可以用指定变量字段值对元素分组, list.table() 提供了一个加强版的 table() 用于处理嵌套列表的表格处理。

更多 rlist 函数,请阅读 rlist 教程( https://renkun.me/rlist-tutorial ),另外 purrr 是基于其他理念处理嵌套数据结构的包,有兴趣可以看看。


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Numerical Recipes 3rd Edition

Numerical Recipes 3rd Edition

William H. Press、Saul A. Teukolsky、William T. Vetterling、Brian P. Flannery / Cambridge University Press / 2007-9-6 / GBP 64.99

Do you want easy access to the latest methods in scientific computing? This greatly expanded third edition of Numerical Recipes has it, with wider coverage than ever before, many new, expanded and upd......一起来看看 《Numerical Recipes 3rd Edition》 这本书的介绍吧!

URL 编码/解码
URL 编码/解码

URL 编码/解码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具