R机器学习介绍

栏目: R语言 · 发布时间: 6年前

内容简介:这是我在德国海德堡大学于2018年6月28日所做的关于研讨会介绍了机器学习的基本知识。通过一个示例数据集,我在R中使用caret和h2o包完成了一个标准的机器学习工作流:所有的分析都是在R中使用RStudio进行的。有关包括R版本、操作系统和包版本在内的详细会话信息,请参阅本文末尾的sessionInfo()输出。

这是我在德国海德堡大学于2018年6月28日所做的关于 R的机器学习介绍 的研讨会的幻灯片。整个研讨会的代码可以在视频下面找到。

研讨会介绍了机器学习的基本知识。通过一个示例数据集,我在R中使用caret和h2o包完成了一个标准的机器学习工作流:

  • 读取数据
  • 探索性数据分析
  • 缺失值
  • 特征工程
  • 训练和测试划分
  • 使用随机森林,梯度提升,ANN等训练模型
  • 超参数调优

设置

所有的分析都是在R中使用RStudio进行的。有关包括R版本、操作系统和包版本在内的详细会话信息,请参阅本文末尾的sessionInfo()输出。

所有的图片都是ggplot2生成。

library(tidyverse) 
library(readr)     
library(mice)

数据准备

这个分析例子我使用的数据集是 Breast Cancer Wisconsin (Diagnostic) Dataset 。这个数据集可以在 UC Irvine Machine Learning Repository 下载。

第一个数据集查看预测类:

  • 恶性的
  • 良性乳腺质量。

这些特征描述了细胞核的特性,是通过对乳腺肿块 细针吸出物(FNA) 的图像分析得出的:

  • 样本ID(代码号)
  • 丛厚度
  • 细胞大小均匀性
  • 细胞形状均匀性
  • 边际附着力
  • 单个上皮细胞大小
  • 裸核数
  • 乏味的染色质
  • 正常核数
  • 有丝分裂
  • 类,即诊断
bc_data <- read_delim("datasets/breast-cancer-wisconsin.data.txt",
                      delim = ",",
                      col_names = c("sample_code_number", 
                       "clump_thickness", 
                       "uniformity_of_cell_size", 
                       "uniformity_of_cell_shape", 
                       "marginal_adhesion", 
                       "single_epithelial_cell_size", 
                       "bare_nuclei", 
                       "bland_chromatin", 
                       "normal_nucleoli", 
                       "mitosis", 
                       "classes")) %>%
  mutate(bare_nuclei = as.numeric(bare_nuclei),
         classes = ifelse(classes == "2", "benign",
                          ifelse(classes == "4", "malignant", NA)))
summary(bc_data)

数据缺失

md.pattern(bc_data, plot = FALSE)
bc_data <- bc_data %>%
  drop_na() %>%
  select(classes, everything(), -sample_code_number)
head(bc_data)

缺失值可以用mice package来插补。

更多信息和教程与代码:

https://shirinsplayground.netlify.com/2018/04/flu_prediction/

数据探索

  • 分类响应变量
ggplot(bc_data, aes(x = classes, fill = classes)) +
  geom_bar()

R机器学习介绍

关于处理不平衡类更多信息: https://shiring.github.io/machine_learning/2017/04/02/unbalanced

  • 回归响应变量
ggplot(bc_data, aes(x = clump_thickness)) +
  geom_histogram(bins = 10)

R机器学习介绍

  • 特征集
gather(bc_data, x, y, clump_thickness:mitosis) %>%
  ggplot(aes(x = y, color = classes, fill = classes)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)

R机器学习介绍

  • 相关图
co_mat_benign <- filter(bc_data, classes == "benign") %>%
  select(-1) %>%
  cor()

co_mat_malignant <- filter(bc_data, classes == "malignant") %>%
  select(-1) %>%
  cor()

library(igraph)
g_benign <- graph.adjacency(co_mat_benign,
                         weighted = TRUE,
                         diag = FALSE,
                         mode = "upper")

g_malignant <- graph.adjacency(co_mat_malignant,
                         weighted = TRUE,
                         diag = FALSE,
                         mode = "upper")


# http://kateto.net/networks-r-igraph

cut.off_b <- mean(E(g_benign)$weight)
cut.off_m <- mean(E(g_malignant)$weight)

g_benign_2 <- delete_edges(g_benign, E(g_benign)[weight < cut.off_b])
g_malignant_2 <- delete_edges(g_malignant, E(g_malignant)[weight < cut.off_m])

c_g_benign_2 <- cluster_fast_greedy(g_benign_2) 
c_g_malignant_2 <- cluster_fast_greedy(g_malignant_2) 
par(mfrow = c(1,2))

plot(c_g_benign_2, g_benign_2,
     vertex.size = colSums(co_mat_benign) * 10,
     vertex.frame.color = NA, 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = E(g_benign_2)$weight * 15,
     layout = layout_with_fr(g_benign_2),
     main = "Benign tumors")

plot(c_g_malignant_2, g_malignant_2,
     vertex.size = colSums(co_mat_malignant) * 10,
     vertex.frame.color = NA, 
     vertex.label.color = "black", 
     vertex.label.cex = 0.8,
     edge.width = E(g_malignant_2)$weight * 15,
     layout = layout_with_fr(g_malignant_2),
     main = "Malignant tumors")

R机器学习介绍

主成分分析

library(ellipse)

# perform pca and extract scores
pcaOutput <- prcomp(as.matrix(bc_data[, -1]), scale = TRUE, center = TRUE)
pcaOutput2 <- as.data.frame(pcaOutput$x)
  
# define groups for plotting
pcaOutput2$groups <- bc_data$classes
  
centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)

conf.rgn  <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
  data.frame(groups = as.character(t),
             ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
                   centre = as.matrix(centroids[centroids$groups == t, 2:3]),
                   level = 0.95),
             stringsAsFactors = FALSE)))
    
ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + 
    geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
    geom_point(size = 2, alpha = 0.6) + 
    labs(color = "",
         fill = "")

R机器学习介绍

多维缩放

select(bc_data, -1) %>%
  dist() %>%
  cmdscale %>%
  as.data.frame() %>%
  mutate(group = bc_data$classes) %>%
  ggplot(aes(x = V1, y = V2, color = group)) +
    geom_point()

R机器学习介绍

t-SNE降维

library(tsne)

select(bc_data, -1) %>%
  dist() %>%
  tsne() %>%
  as.data.frame() %>%
  mutate(group = bc_data$classes) %>%
  ggplot(aes(x = V1, y = V2, color = group)) +
    geom_point()

R机器学习介绍

R的机器学习包

caret

# configure multicore
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

library(caret)

训练,验证和测试数据集

set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data  <- bc_data[-index, ]

bind_rows(data.frame(group = "train", train_data),
      data.frame(group = "test", test_data)) %>%
  gather(x, y, clump_thickness:mitosis) %>%
  ggplot(aes(x = y, color = group, fill = group)) +
    geom_density(alpha = 0.3) +
    facet_wrap( ~ x, scales = "free", ncol = 3)

R机器学习介绍

回归

set.seed(42)
model_glm <- caret::train(clump_thickness ~ .,
                          data = train_data,
                          method = "glm",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))
model_glm
predictions <- predict(model_glm, test_data)
# model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values
data.frame(residuals = resid(model_glm),
           predictors = model_glm$finalModel$linear.predictors) %>%
  ggplot(aes(x = predictors, y = residuals)) +
    geom_jitter() +
    geom_smooth(method = "lm")

R机器学习介绍

# y == train_data$clump_thickness
data.frame(residuals = resid(model_glm),
           y = model_glm$finalModel$y) %>%
  ggplot(aes(x = y, y = residuals)) +
    geom_jitter() +
    geom_smooth(method = "lm")

R机器学习介绍

data.frame(actual = test_data$clump_thickness,
           predicted = predictions) %>%
  ggplot(aes(x = actual, y = predicted)) +
    geom_jitter() +
    geom_smooth(method = "lm")

R机器学习介绍

分类

  • 决策树

rpart

library(rpart)
library(rpart.plot)

set.seed(42)
fit <- rpart(classes ~ .,
            data = train_data,
            method = "class",
            control = rpart.control(xval = 10, 
                                    minbucket = 2, 
                                    cp = 0), 
             parms = list(split = "information"))

rpart.plot(fit, extra = 100)

R机器学习介绍

  • 随机森林

随机森林 的预测是基于多分类树的生成。它们可以用于分类和回归任务。这里,我展示了一个分类任务。

set.seed(42)
model_rf <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 3, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))

当您指定saveforecasts = TRUE时,您可以使用model_rf$pred访问交叉验证结果。

model_rf
model_rf$finalModel$confusion

处理不平衡数据集

幸运的是,caret可以很容易地将过和欠抽样技术与交叉验证重采样结合在一起。我们可以简单地将抽样选项添加到我们的trainControl系统中,并选择欠抽样。剩下的和原来的模型一样。

set.seed(42)
model_rf_down <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  sampling = "down"))
model_rf_down

特征的重要性

imp <- model_rf$finalModel$importance
imp[order(imp, decreasing = TRUE), ]
importance <- varImp(model_rf, scale = TRUE)
plot(importance)

R机器学习介绍

预测测试数据集

confusionMatrix(predict(model_rf, test_data), as.factor(test_data$classes))
results <- data.frame(actual = test_data$classes,
                      predict(model_rf, test_data, type = "prob"))

results$prediction <- ifelse(results$benign > 0.5, "benign",
                             ifelse(results$malignant > 0.5, "malignant", NA))

results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge")

R机器学习介绍

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.6)

R机器学习介绍

极梯度提升树

极端梯度提升(XGBoost) 是一种用于监督学习的 梯度提升 的快速改进实现。

XGBoost使用了一种更正则化的模型正规化来控制过度拟合,从而使其具有更好的性能。xgboost的开发人员陈天琦说

XGBoost是一个树集成模型,它表示一组分类树和回归树(CART)的预测总和。在这一点上,XGBoost类似于Random Forests,但它使用了一种不同的模型训练方法。可用于分类和回归任务。这里,我展示了一个分类任务。

set.seed(42)
model_xgb <- caret::train(classes ~ .,
                          data = train_data,
                          method = "xgbTree",
                          preProcess = c("scale", "center"),
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 3, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))
model_xgb

特征的重要性

importance <- varImp(model_xgb, scale = TRUE)
plot(importance)

R机器学习介绍

预测测试数据集

confusionMatrix(predict(model_xgb, test_data), as.factor(test_data$classes))
results <- data.frame(actual = test_data$classes,
                      predict(model_xgb, test_data, type = "prob"))

results$prediction <- ifelse(results$benign > 0.5, "benign",
                             ifelse(results$malignant > 0.5, "malignant", NA))

results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)

ggplot(results, aes(x = prediction, fill = correct)) +
  geom_bar(position = "dodge")

R机器学习介绍

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) +
  geom_jitter(size = 3, alpha = 0.6)

R机器学习介绍

caret可用模型

https://topepo.github.io/caret/available-models.html

特征选择

对整个数据集进行特征选择会导致预测偏差,因此我们需要单独对训练数据进行整个建模过程!

  • 相关性

所有特征之间的相关性通过corrplot包计算和可视化。然后,我将删除所有相关性大于0.7的特性,使该特征与较低的平均值保持一致。

library(corrplot)

# calculate correlation matrix
corMatMy <- cor(train_data[, -1])
corrplot(corMatMy, order = "hclust")

R机器学习介绍

#Apply correlation filter at 0.70,
highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
highlyCor
train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]
  • 递归特征消除(RFE)

选择特征的另一种方法是使用递归特征消除。RFE使用一种随机森林算法来测试特征的组合,并对每个特征进行准确率评分。得分最高的组合通常是优选的。

set.seed(7)
results_rfe <- rfe(x = train_data[, -1], 
                   y = as.factor(train_data$classes), 
                   sizes = c(1:9), 
                   rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10))
predictors(results_rfe)
train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]
  • 遗传算法(GA)

遗传算法(GA)是基于自然选择的进化原理开发出来的:它的目标是通过建模选择一组特定的基因型来优化个体群。在每一代(即迭代)中,每个个体的适应度都是根据他们的基因型来计算的。然后,优胜劣汰的个体被选出来生产下一代。这一代的个体将拥有由亲本等位基因重组而成的基因型。这些新的基因型将再次决定每个个体的适应性。这个选择过程在特定的世代中迭代,(理想地)导致在基因库中固定最适合的等位基因。

这种优化概念也可以应用于非进化模型,比如机器学习中的特征选择过程。

set.seed(27)
model_ga <- gafs(x = train_data[, -1], 
                 y = as.factor(train_data$classes),
                 iters = 10, # generations of algorithm
                 popSize = 10, # population size for each generation
                 levels = c("malignant", "benign"),
                 gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF
                                           method = "cv",    # 10 fold cross validation
                                           genParallel = TRUE, # Use parallel programming
                                           allowParallel = TRUE))
plot(model_ga)

R机器学习介绍

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]

用caret做超参数调优

  • 笛卡尔网格
  • mtry:在每一次分割中随机抽取的候选变量的数量。
set.seed(42)
grid <- expand.grid(mtry = c(1:10))

model_rf_tune_man <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE),
                         tuneGrid = grid)
model_rf_tune_man
plot(model_rf_tune_man)

R机器学习介绍

  • 随机选择
set.seed(42)
model_rf_tune_auto <- caret::train(classes ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 10, 
                                                  repeats = 10, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  search = "random"),
                         tuneGrid = grid,
                         tuneLength = 15)
model_rf_tune_auto
plot(model_rf_tune_auto)

R机器学习介绍

使用h2o做网格搜索

R package h2o为 h2o 提供了一个方便的接口,h2o是一个开源的机器学习和深度学习平台。H2O为分类、回归和深度学习提供了广泛的通用机器学习算法。

library(h2o)
h2o.init(nthreads = -1)
h2o.no_progress()
bc_data_hf <- as.h2o(bc_data)
h2o.describe(bc_data_hf) %>%
  gather(x, y, Zeros:Sigma) %>%
  mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", 
                        ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% 
  ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
    geom_point(size = 4, alpha = 0.6) +
    scale_color_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    facet_grid(group ~ ., scales = "free") +
    labs(x = "Feature",
         y = "Value",
         color = "")

R机器学习介绍

library(reshape2) # for melting

bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1])

cor <- h2o.cor(bc_data_hf)
rownames(cor) <- colnames(cor)

melt(cor) %>%
  mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
  mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
  mutate(variable = factor(variable, levels = colnames(cor))) %>%
  ggplot(aes(x = variable, y = Var2, fill = value)) + 
    geom_tile(width = 0.9, height = 0.9) +
    scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
    labs(x = "", 
         y = "")

R机器学习介绍

训练,验证和测试数据集

splits <- h2o.splitFrame(bc_data_hf, 
                         ratios = c(0.7, 0.15), 
                         seed = 1)

train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

response <- "classes"
features <- setdiff(colnames(train), response)
summary(as.factor(train$classes), exact_quantiles = TRUE)
summary(as.factor(valid$classes), exact_quantiles = TRUE)
summary(as.factor(test$classes), exact_quantiles = TRUE)
pca <- h2o.prcomp(training_frame = train,
           x = features,
           validation_frame = valid,
           transform = "NORMALIZE",
           impute_missing = TRUE,
           k = 3,
           seed = 42)

eigenvec <- as.data.frame(pca@model$eigenvectors)
eigenvec$label <- features

library(ggrepel)
ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
  geom_point(color = "navy", alpha = 0.7) +
  geom_text_repel()

R机器学习介绍

分类

  • 随机森林
hyper_params <- list(
                     ntrees = c(25, 50, 75, 100),
                     max_depth = c(10, 20, 30),
                     min_rows = c(1, 3, 5)
                     )

search_criteria <- list(
                        strategy = "RandomDiscrete", 
                        max_models = 50,
                        max_runtime_secs = 360,
                        stopping_rounds = 5,          
                        stopping_metric = "AUC",      
                        stopping_tolerance = 0.0005,
                        seed = 42
                        )
f_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, 
                                                # alternatively h2o.gbm 
                                                # for Gradient boosting trees
                    x = features,
                    y = response,
                    grid_id = "rf_grid",
                    training_frame = train,
                    validation_frame = valid,
                    nfolds = 25,                           
                    fold_assignment = "Stratified",
                    hyper_params = hyper_params,
                    search_criteria = search_criteria,
                    seed = 42
                    )
# performance metrics where smaller is better -> order with decreasing = FALSE
sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss")

for (sort_by_1 in sort_options_1) {
  
  grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  h2o.saveModel(best_model, path="models", force = TRUE)
  
}


# performance metrics where bigger is better -> order with decreasing = TRUE
sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity")

for (sort_by_2 in sort_options_2) {
  
  grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE)
  
  model_ids <- grid@model_ids
  best_model <- h2o.getModel(model_ids[[1]])
  
  h2o.saveModel(best_model, path = "models", force = TRUE)
  
}
files <- list.files(path = "models")
rf_models <- files[grep("rf_grid_model", files)]

for (model_id in rf_models) {
  
  path <- paste0(getwd(), "/models/", model_id)
  best_model <- h2o.loadModel(path)
  mse_auc_test <- data.frame(model_id = model_id, 
                             mse = h2o.mse(h2o.performance(best_model, test)),
                             auc = h2o.auc(h2o.performance(best_model, test)))
  
  if (model_id == rf_models[[1]]) {
    
    mse_auc_test_comb <- mse_auc_test
    
  } else {
    
    mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test)
    
  }
}
mse_auc_test_comb %>%
  gather(x, y, mse:auc) %>%
  ggplot(aes(x = model_id, y = y, fill = model_id)) +
    facet_grid(x ~ ., scales = "free") +
    geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
          plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) +
    labs(x = "", y = "value", fill = "")

R机器学习介绍

for (model_id in rf_models) {
  
  best_model <- h2o.getModel(model_id)
  
  finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, 
                                                   nrow(test)),
                                    actual = as.vector(test$classes), 
                                    as.data.frame(h2o.predict(object = best_model, 
                                                              newdata = test)))
  
  finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == 
                                           finalRf_predictions$predict, 
                                         "yes", "no")
  
  finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, 
                                                  "benign", 
                                                  ifelse(finalRf_predictions$malignant 
                                                         > 0.8, "malignant", "uncertain"))
  
  finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == 
                                                     finalRf_predictions$predict_stringent, "yes", 
                                         ifelse(finalRf_predictions$predict_stringent == 
                                                  "uncertain", "na", "no"))
  
  if (model_id == rf_models[[1]]) {
    
    finalRf_predictions_comb <- finalRf_predictions
    
  } else {
    
    finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions)
    
  }
}
finalRf_predictions_comb %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap(~ model_id, ncol = 2) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

R机器学习介绍

finalRf_predictions_comb %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    facet_wrap(~ model_id, ncol = 2) +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

R机器学习介绍

rf_model <- h2o.loadModel("models/rf_grid_model_0")
h2o.varimp_plot(rf_model)

R机器学习介绍

h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE)
h2o.confusionMatrix(rf_model, valid = TRUE)
plot(rf_model,
     timestep = "number_of_trees",
     metric = "classification_error")

R机器学习介绍

plot(rf_model,
     timestep = "number_of_trees",
     metric = "logloss")

R机器学习介绍

plot(rf_model,
     timestep = "number_of_trees",
     metric = "AUC")

R机器学习介绍

plot(rf_model,
     timestep = "number_of_trees",
     metric = "rmse")

R机器学习介绍

h2o.auc(rf_model, train = TRUE)
h2o.auc(rf_model, valid = TRUE)
h2o.auc(rf_model, xval = TRUE)
perf <- h2o.performance(rf_model, test)
perf
plot(perf)

R机器学习介绍

perf@metrics$thresholds_and_metric_scores %>%
  ggplot(aes(x = fpr, y = tpr)) +
    geom_point() +
    geom_line() +
    geom_abline(slope = 1, intercept = 0) +
    labs(x = "False Positive Rate",
         y = "True Positive Rate")

R机器学习介绍

h2o.logloss(perf)
h2o.mse(perf)
h2o.auc(perf)
head(h2o.metric(perf))
finalRf_predictions <- data.frame(actual = as.vector(test$classes), 
                                  as.data.frame(h2o.predict(object = rf_model, 
                                                            newdata = test)))

finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == 
                                         finalRf_predictions$predict, "yes", "no")

finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", 
                                                ifelse(finalRf_predictions$malignant 
                                                       > 0.8, "malignant", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == 
                                                   finalRf_predictions$predict_stringent, "yes", 
                                       ifelse(finalRf_predictions$predict_stringent == 
                                                "uncertain", "na", "no"))

finalRf_predictions %>%
  group_by(actual, predict) %>%
  dplyr::summarise(n = n())
finalRf_predictions %>%
  ggplot(aes(x = actual, fill = accurate)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Default predictions")

R机器学习介绍

finalRf_predictions %>%
  subset(accurate_stringent != "na") %>%
  ggplot(aes(x = actual, fill = accurate_stringent)) +
    geom_bar(position = "dodge") +
    scale_fill_brewer(palette = "Set1") +
    labs(fill = "Were\npredictions\naccurate?",
         title = "Stringent predictions")

R机器学习介绍

df <- finalRf_predictions[, c(1, 3, 4)]

thresholds <- seq(from = 0, to = 1, by = 0.1)

prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA)

for (threshold in thresholds) {
  pred <- ifelse(df$benign > threshold, "benign", "malignant")
  pred_t <- ifelse(pred == df$actual, TRUE, FALSE)
  
  group <- data.frame(df, "pred" = pred_t) %>%
  group_by(actual, pred) %>%
  dplyr::summarise(n = n())
  
  group_b <- filter(group, actual == "benign")
  
  prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n)
  prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b
  
  group_m <- filter(group, actual == "malignant")
  
  prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n)
  prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m
}

prop_table %>%
  gather(x, y, prop_true_b:prop_true_m) %>%
  ggplot(aes(x = threshold, y = y, color = x)) +
    geom_point() +
    geom_line() +
    scale_color_brewer(palette = "Set1") +
    labs(y = "proportion of true predictions",
         color = "b: benign cases\nm: malignant cases")

R机器学习介绍

如果你对更多的机器学习帖子感兴趣,可以在我的博客上查看一下machine_learning的分类列表 https://shirinsplayground.netlify.com/categories/#posts-list-machine-learning - https://shiring.github.io/categories.html#machine_learning-ref

stopCluster(cl)
h2o.shutdown()
sessionInfo()

作者:Dr. Shirin Glander 原文链接: https://shirinsplayground.netlify.com/2018/06/intro_to_ml_workshop_heidelberg/

您有什么想法,请留言。

版权声明: 作者保留权利。文章为作者独立观点,不代表数据人网立场。严禁修改,转载请注明原文链接:http://shujuren.org/article/764.html

数据人网: 数据人学习,交流和分享的平台,诚邀您创造和分享数据知识,共建和共享数据智库。


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Clean Architecture

Clean Architecture

Robert C. Martin / Prentice Hall / 2017-9-20 / USD 34.99

Practical Software Architecture Solutions from the Legendary Robert C. Martin (“Uncle Bob”) By applying universal rules of software architecture, you can dramatically improve developer producti......一起来看看 《Clean Architecture》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具