内容简介:这是我在德国海德堡大学于2018年6月28日所做的关于研讨会介绍了机器学习的基本知识。通过一个示例数据集,我在R中使用caret和h2o包完成了一个标准的机器学习工作流:所有的分析都是在R中使用RStudio进行的。有关包括R版本、操作系统和包版本在内的详细会话信息,请参阅本文末尾的sessionInfo()输出。
这是我在德国海德堡大学于2018年6月28日所做的关于 R的机器学习介绍 的研讨会的幻灯片。整个研讨会的代码可以在视频下面找到。
研讨会介绍了机器学习的基本知识。通过一个示例数据集,我在R中使用caret和h2o包完成了一个标准的机器学习工作流:
- 读取数据
- 探索性数据分析
- 缺失值
- 特征工程
- 训练和测试划分
- 使用随机森林,梯度提升,ANN等训练模型
- 超参数调优
设置
所有的分析都是在R中使用RStudio进行的。有关包括R版本、操作系统和包版本在内的详细会话信息,请参阅本文末尾的sessionInfo()输出。
所有的图片都是ggplot2生成。
- 包
library(tidyverse) library(readr) library(mice)
数据准备
这个分析例子我使用的数据集是 Breast Cancer Wisconsin (Diagnostic) Dataset 。这个数据集可以在 UC Irvine Machine Learning Repository 下载。
第一个数据集查看预测类:
- 恶性的
- 良性乳腺质量。
这些特征描述了细胞核的特性,是通过对乳腺肿块 细针吸出物(FNA) 的图像分析得出的:
- 样本ID(代码号)
- 丛厚度
- 细胞大小均匀性
- 细胞形状均匀性
- 边际附着力
- 单个上皮细胞大小
- 裸核数
- 乏味的染色质
- 正常核数
- 有丝分裂
- 类,即诊断
bc_data <- read_delim("datasets/breast-cancer-wisconsin.data.txt", delim = ",", col_names = c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes")) %>% mutate(bare_nuclei = as.numeric(bare_nuclei), classes = ifelse(classes == "2", "benign", ifelse(classes == "4", "malignant", NA)))
summary(bc_data)
数据缺失
md.pattern(bc_data, plot = FALSE)
bc_data <- bc_data %>% drop_na() %>% select(classes, everything(), -sample_code_number) head(bc_data)
缺失值可以用mice package来插补。
更多信息和教程与代码:
https://shirinsplayground.netlify.com/2018/04/flu_prediction/
数据探索
- 分类响应变量
ggplot(bc_data, aes(x = classes, fill = classes)) + geom_bar()
关于处理不平衡类更多信息: https://shiring.github.io/machine_learning/2017/04/02/unbalanced
- 回归响应变量
ggplot(bc_data, aes(x = clump_thickness)) + geom_histogram(bins = 10)
- 特征集
gather(bc_data, x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = classes, fill = classes)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)
- 相关图
co_mat_benign <- filter(bc_data, classes == "benign") %>% select(-1) %>% cor() co_mat_malignant <- filter(bc_data, classes == "malignant") %>% select(-1) %>% cor() library(igraph) g_benign <- graph.adjacency(co_mat_benign, weighted = TRUE, diag = FALSE, mode = "upper") g_malignant <- graph.adjacency(co_mat_malignant, weighted = TRUE, diag = FALSE, mode = "upper") # http://kateto.net/networks-r-igraph cut.off_b <- mean(E(g_benign)$weight) cut.off_m <- mean(E(g_malignant)$weight) g_benign_2 <- delete_edges(g_benign, E(g_benign)[weight < cut.off_b]) g_malignant_2 <- delete_edges(g_malignant, E(g_malignant)[weight < cut.off_m]) c_g_benign_2 <- cluster_fast_greedy(g_benign_2) c_g_malignant_2 <- cluster_fast_greedy(g_malignant_2) par(mfrow = c(1,2)) plot(c_g_benign_2, g_benign_2, vertex.size = colSums(co_mat_benign) * 10, vertex.frame.color = NA, vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = E(g_benign_2)$weight * 15, layout = layout_with_fr(g_benign_2), main = "Benign tumors") plot(c_g_malignant_2, g_malignant_2, vertex.size = colSums(co_mat_malignant) * 10, vertex.frame.color = NA, vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = E(g_malignant_2)$weight * 15, layout = layout_with_fr(g_malignant_2), main = "Malignant tumors")
主成分分析
library(ellipse) # perform pca and extract scores pcaOutput <- prcomp(as.matrix(bc_data[, -1]), scale = TRUE, center = TRUE) pcaOutput2 <- as.data.frame(pcaOutput$x) # define groups for plotting pcaOutput2$groups <- bc_data$classes centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean) conf.rgn <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t) data.frame(groups = as.character(t), ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]), centre = as.matrix(centroids[centroids$groups == t, 2:3]), level = 0.95), stringsAsFactors = FALSE))) ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) + geom_point(size = 2, alpha = 0.6) + labs(color = "", fill = "")
多维缩放
select(bc_data, -1) %>% dist() %>% cmdscale %>% as.data.frame() %>% mutate(group = bc_data$classes) %>% ggplot(aes(x = V1, y = V2, color = group)) + geom_point()
t-SNE降维
library(tsne) select(bc_data, -1) %>% dist() %>% tsne() %>% as.data.frame() %>% mutate(group = bc_data$classes) %>% ggplot(aes(x = V1, y = V2, color = group)) + geom_point()
R的机器学习包
# configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)
训练,验证和测试数据集
set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] bind_rows(data.frame(group = "train", train_data), data.frame(group = "test", test_data)) %>% gather(x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = group, fill = group)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)
回归
set.seed(42) model_glm <- caret::train(clump_thickness ~ ., data = train_data, method = "glm", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE))
model_glm
predictions <- predict(model_glm, test_data)
# model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values data.frame(residuals = resid(model_glm), predictors = model_glm$finalModel$linear.predictors) %>% ggplot(aes(x = predictors, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")
# y == train_data$clump_thickness data.frame(residuals = resid(model_glm), y = model_glm$finalModel$y) %>% ggplot(aes(x = y, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")
data.frame(actual = test_data$clump_thickness, predicted = predictions) %>% ggplot(aes(x = actual, y = predicted)) + geom_jitter() + geom_smooth(method = "lm")
分类
- 决策树
library(rpart) library(rpart.plot) set.seed(42) fit <- rpart(classes ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) rpart.plot(fit, extra = 100)
- 随机森林
随机森林 的预测是基于多分类树的生成。它们可以用于分类和回归任务。这里,我展示了一个分类任务。
set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, savePredictions = TRUE, verboseIter = FALSE))
当您指定saveforecasts = TRUE时,您可以使用model_rf$pred访问交叉验证结果。
model_rf
model_rf$finalModel$confusion
处理不平衡数据集
幸运的是,caret可以很容易地将过和欠抽样技术与交叉验证重采样结合在一起。我们可以简单地将抽样选项添加到我们的trainControl系统中,并选择欠抽样。剩下的和原来的模型一样。
set.seed(42) model_rf_down <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, sampling = "down"))
model_rf_down
特征的重要性
imp <- model_rf$finalModel$importance imp[order(imp, decreasing = TRUE), ]
importance <- varImp(model_rf, scale = TRUE) plot(importance)
预测测试数据集
confusionMatrix(predict(model_rf, test_data), as.factor(test_data$classes))
results <- data.frame(actual = test_data$classes, predict(model_rf, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")
ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)
极梯度提升树
极端梯度提升(XGBoost) 是一种用于监督学习的 梯度提升 的快速改进实现。
XGBoost使用了一种更正则化的模型正规化来控制过度拟合,从而使其具有更好的性能。xgboost的开发人员陈天琦说
XGBoost是一个树集成模型,它表示一组分类树和回归树(CART)的预测总和。在这一点上,XGBoost类似于Random Forests,但它使用了一种不同的模型训练方法。可用于分类和回归任务。这里,我展示了一个分类任务。
set.seed(42) model_xgb <- caret::train(classes ~ ., data = train_data, method = "xgbTree", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, savePredictions = TRUE, verboseIter = FALSE))
model_xgb
特征的重要性
importance <- varImp(model_xgb, scale = TRUE) plot(importance)
预测测试数据集
confusionMatrix(predict(model_xgb, test_data), as.factor(test_data$classes))
results <- data.frame(actual = test_data$classes, predict(model_xgb, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")
ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)
caret可用模型
https://topepo.github.io/caret/available-models.html
特征选择
对整个数据集进行特征选择会导致预测偏差,因此我们需要单独对训练数据进行整个建模过程!
- 相关性
所有特征之间的相关性通过corrplot包计算和可视化。然后,我将删除所有相关性大于0.7的特性,使该特征与较低的平均值保持一致。
library(corrplot) # calculate correlation matrix corMatMy <- cor(train_data[, -1]) corrplot(corMatMy, order = "hclust")
#Apply correlation filter at 0.70, highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
highlyCor
train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]
- 递归特征消除(RFE)
选择特征的另一种方法是使用递归特征消除。RFE使用一种随机森林算法来测试特征的组合,并对每个特征进行准确率评分。得分最高的组合通常是优选的。
set.seed(7) results_rfe <- rfe(x = train_data[, -1], y = as.factor(train_data$classes), sizes = c(1:9), rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10))
predictors(results_rfe)
train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]
- 遗传算法(GA)
遗传算法(GA)是基于自然选择的进化原理开发出来的:它的目标是通过建模选择一组特定的基因型来优化个体群。在每一代(即迭代)中,每个个体的适应度都是根据他们的基因型来计算的。然后,优胜劣汰的个体被选出来生产下一代。这一代的个体将拥有由亲本等位基因重组而成的基因型。这些新的基因型将再次决定每个个体的适应性。这个选择过程在特定的世代中迭代,(理想地)导致在基因库中固定最适合的等位基因。
这种优化概念也可以应用于非进化模型,比如机器学习中的特征选择过程。
set.seed(27) model_ga <- gafs(x = train_data[, -1], y = as.factor(train_data$classes), iters = 10, # generations of algorithm popSize = 10, # population size for each generation levels = c("malignant", "benign"), gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel = TRUE, # Use parallel programming allowParallel = TRUE))
plot(model_ga)
train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]
用caret做超参数调优
- 笛卡尔网格
- mtry:在每一次分割中随机抽取的候选变量的数量。
set.seed(42) grid <- expand.grid(mtry = c(1:10)) model_rf_tune_man <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE), tuneGrid = grid)
model_rf_tune_man
plot(model_rf_tune_man)
- 随机选择
set.seed(42) model_rf_tune_auto <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneGrid = grid, tuneLength = 15)
model_rf_tune_auto
plot(model_rf_tune_auto)
使用h2o做网格搜索
R package h2o为 h2o 提供了一个方便的接口,h2o是一个开源的机器学习和深度学习平台。H2O为分类、回归和深度学习提供了广泛的通用机器学习算法。
library(h2o) h2o.init(nthreads = -1)
h2o.no_progress() bc_data_hf <- as.h2o(bc_data)
h2o.describe(bc_data_hf) %>% gather(x, y, Zeros:Sigma) %>% mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% ggplot(aes(x = Label, y = as.numeric(y), color = x)) + geom_point(size = 4, alpha = 0.6) + scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + facet_grid(group ~ ., scales = "free") + labs(x = "Feature", y = "Value", color = "")
library(reshape2) # for melting bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1]) cor <- h2o.cor(bc_data_hf) rownames(cor) <- colnames(cor) melt(cor) %>% mutate(Var2 = rep(rownames(cor), nrow(cor))) %>% mutate(Var2 = factor(Var2, levels = colnames(cor))) %>% mutate(variable = factor(variable, levels = colnames(cor))) %>% ggplot(aes(x = variable, y = Var2, fill = value)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "Cor.") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "")
训练,验证和测试数据集
splits <- h2o.splitFrame(bc_data_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "classes" features <- setdiff(colnames(train), response)
summary(as.factor(train$classes), exact_quantiles = TRUE) summary(as.factor(valid$classes), exact_quantiles = TRUE) summary(as.factor(test$classes), exact_quantiles = TRUE)
pca <- h2o.prcomp(training_frame = train, x = features, validation_frame = valid, transform = "NORMALIZE", impute_missing = TRUE, k = 3, seed = 42) eigenvec <- as.data.frame(pca@model$eigenvectors) eigenvec$label <- features library(ggrepel) ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) + geom_point(color = "navy", alpha = 0.7) + geom_text_repel()
分类
- 随机森林
hyper_params <- list( ntrees = c(25, 50, 75, 100), max_depth = c(10, 20, 30), min_rows = c(1, 3, 5) ) search_criteria <- list( strategy = "RandomDiscrete", max_models = 50, max_runtime_secs = 360, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.0005, seed = 42 )
f_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, # alternatively h2o.gbm # for Gradient boosting trees x = features, y = response, grid_id = "rf_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 )
# performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path="models", force = TRUE) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path = "models", force = TRUE) }
files <- list.files(path = "models")
rf_models <- files[grep("rf_grid_model", files)] for (model_id in rf_models) { path <- paste0(getwd(), "/models/", model_id) best_model <- h2o.loadModel(path) mse_auc_test <- data.frame(model_id = model_id, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (model_id == rf_models[[1]]) { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } }
mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = model_id, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")
for (model_id in rf_models) { best_model <- h2o.getModel(model_id) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), actual = as.vector(test$classes), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (model_id == rf_models[[1]]) { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } }
finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 2) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")
finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 2) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")
rf_model <- h2o.loadModel("models/rf_grid_model_0")
h2o.varimp_plot(rf_model)
h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE)
h2o.confusionMatrix(rf_model, valid = TRUE)
plot(rf_model, timestep = "number_of_trees", metric = "classification_error")
plot(rf_model, timestep = "number_of_trees", metric = "logloss")
plot(rf_model, timestep = "number_of_trees", metric = "AUC")
plot(rf_model, timestep = "number_of_trees", metric = "rmse")
h2o.auc(rf_model, train = TRUE) h2o.auc(rf_model, valid = TRUE) h2o.auc(rf_model, xval = TRUE) perf <- h2o.performance(rf_model, test) perf
plot(perf)
perf@metrics$thresholds_and_metric_scores %>% ggplot(aes(x = fpr, y = tpr)) + geom_point() + geom_line() + geom_abline(slope = 1, intercept = 0) + labs(x = "False Positive Rate", y = "True Positive Rate")
h2o.logloss(perf) h2o.mse(perf) h2o.auc(perf) head(h2o.metric(perf))
finalRf_predictions <- data.frame(actual = as.vector(test$classes), as.data.frame(h2o.predict(object = rf_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) finalRf_predictions %>% group_by(actual, predict) %>% dplyr::summarise(n = n())
finalRf_predictions %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")
finalRf_predictions %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")
df <- finalRf_predictions[, c(1, 3, 4)] thresholds <- seq(from = 0, to = 1, by = 0.1) prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA) for (threshold in thresholds) { pred <- ifelse(df$benign > threshold, "benign", "malignant") pred_t <- ifelse(pred == df$actual, TRUE, FALSE) group <- data.frame(df, "pred" = pred_t) %>% group_by(actual, pred) %>% dplyr::summarise(n = n()) group_b <- filter(group, actual == "benign") prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n) prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b group_m <- filter(group, actual == "malignant") prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n) prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m } prop_table %>% gather(x, y, prop_true_b:prop_true_m) %>% ggplot(aes(x = threshold, y = y, color = x)) + geom_point() + geom_line() + scale_color_brewer(palette = "Set1") + labs(y = "proportion of true predictions", color = "b: benign cases\nm: malignant cases")
如果你对更多的机器学习帖子感兴趣,可以在我的博客上查看一下machine_learning的分类列表 https://shirinsplayground.netlify.com/categories/#posts-list-machine-learning - https://shiring.github.io/categories.html#machine_learning-ref
stopCluster(cl) h2o.shutdown()
sessionInfo()
作者:Dr. Shirin Glander 原文链接: https://shirinsplayground.netlify.com/2018/06/intro_to_ml_workshop_heidelberg/
您有什么想法,请留言。
版权声明: 作者保留权利。文章为作者独立观点,不代表数据人网立场。严禁修改,转载请注明原文链接:http://shujuren.org/article/764.html
数据人网: 数据人学习,交流和分享的平台,诚邀您创造和分享数据知识,共建和共享数据智库。
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- 从脑电波到机器人运动——深度学习:介绍
- GitHub 行业内关于智能客服、聊天机器人的应用和架构、算法分享和介绍
- ASP.NET Core模块化前后端分离快速开发框架介绍之3、数据访问模块介绍
- 简编漫画介绍WebAssembly
- CGroup 介绍
- CGroup 介绍
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。