内容简介:这是我在德国海德堡大学于2018年6月28日所做的关于研讨会介绍了机器学习的基本知识。通过一个示例数据集,我在R中使用caret和h2o包完成了一个标准的机器学习工作流:所有的分析都是在R中使用RStudio进行的。有关包括R版本、操作系统和包版本在内的详细会话信息,请参阅本文末尾的sessionInfo()输出。
这是我在德国海德堡大学于2018年6月28日所做的关于 R的机器学习介绍 的研讨会的幻灯片。整个研讨会的代码可以在视频下面找到。
研讨会介绍了机器学习的基本知识。通过一个示例数据集,我在R中使用caret和h2o包完成了一个标准的机器学习工作流:
- 读取数据
- 探索性数据分析
- 缺失值
- 特征工程
- 训练和测试划分
- 使用随机森林,梯度提升,ANN等训练模型
- 超参数调优
设置
所有的分析都是在R中使用RStudio进行的。有关包括R版本、操作系统和包版本在内的详细会话信息,请参阅本文末尾的sessionInfo()输出。
所有的图片都是ggplot2生成。
- 包
library(tidyverse) library(readr) library(mice)
数据准备
这个分析例子我使用的数据集是 Breast Cancer Wisconsin (Diagnostic) Dataset 。这个数据集可以在 UC Irvine Machine Learning Repository 下载。
第一个数据集查看预测类:
- 恶性的
- 良性乳腺质量。
这些特征描述了细胞核的特性,是通过对乳腺肿块 细针吸出物(FNA) 的图像分析得出的:
- 样本ID(代码号)
- 丛厚度
- 细胞大小均匀性
- 细胞形状均匀性
- 边际附着力
- 单个上皮细胞大小
- 裸核数
- 乏味的染色质
- 正常核数
- 有丝分裂
- 类,即诊断
bc_data <- read_delim("datasets/breast-cancer-wisconsin.data.txt",
delim = ",",
col_names = c("sample_code_number",
"clump_thickness",
"uniformity_of_cell_size",
"uniformity_of_cell_shape",
"marginal_adhesion",
"single_epithelial_cell_size",
"bare_nuclei",
"bland_chromatin",
"normal_nucleoli",
"mitosis",
"classes")) %>%
mutate(bare_nuclei = as.numeric(bare_nuclei),
classes = ifelse(classes == "2", "benign",
ifelse(classes == "4", "malignant", NA)))
summary(bc_data)
数据缺失
md.pattern(bc_data, plot = FALSE)
bc_data <- bc_data %>% drop_na() %>% select(classes, everything(), -sample_code_number) head(bc_data)
缺失值可以用mice package来插补。
更多信息和教程与代码:
https://shirinsplayground.netlify.com/2018/04/flu_prediction/
数据探索
- 分类响应变量
ggplot(bc_data, aes(x = classes, fill = classes)) + geom_bar()
关于处理不平衡类更多信息: https://shiring.github.io/machine_learning/2017/04/02/unbalanced
- 回归响应变量
ggplot(bc_data, aes(x = clump_thickness)) + geom_histogram(bins = 10)
- 特征集
gather(bc_data, x, y, clump_thickness:mitosis) %>%
ggplot(aes(x = y, color = classes, fill = classes)) +
geom_density(alpha = 0.3) +
facet_wrap( ~ x, scales = "free", ncol = 3)
- 相关图
co_mat_benign <- filter(bc_data, classes == "benign") %>%
select(-1) %>%
cor()
co_mat_malignant <- filter(bc_data, classes == "malignant") %>%
select(-1) %>%
cor()
library(igraph)
g_benign <- graph.adjacency(co_mat_benign,
weighted = TRUE,
diag = FALSE,
mode = "upper")
g_malignant <- graph.adjacency(co_mat_malignant,
weighted = TRUE,
diag = FALSE,
mode = "upper")
# http://kateto.net/networks-r-igraph
cut.off_b <- mean(E(g_benign)$weight)
cut.off_m <- mean(E(g_malignant)$weight)
g_benign_2 <- delete_edges(g_benign, E(g_benign)[weight < cut.off_b])
g_malignant_2 <- delete_edges(g_malignant, E(g_malignant)[weight < cut.off_m])
c_g_benign_2 <- cluster_fast_greedy(g_benign_2)
c_g_malignant_2 <- cluster_fast_greedy(g_malignant_2)
par(mfrow = c(1,2))
plot(c_g_benign_2, g_benign_2,
vertex.size = colSums(co_mat_benign) * 10,
vertex.frame.color = NA,
vertex.label.color = "black",
vertex.label.cex = 0.8,
edge.width = E(g_benign_2)$weight * 15,
layout = layout_with_fr(g_benign_2),
main = "Benign tumors")
plot(c_g_malignant_2, g_malignant_2,
vertex.size = colSums(co_mat_malignant) * 10,
vertex.frame.color = NA,
vertex.label.color = "black",
vertex.label.cex = 0.8,
edge.width = E(g_malignant_2)$weight * 15,
layout = layout_with_fr(g_malignant_2),
main = "Malignant tumors")
主成分分析
library(ellipse)
# perform pca and extract scores
pcaOutput <- prcomp(as.matrix(bc_data[, -1]), scale = TRUE, center = TRUE)
pcaOutput2 <- as.data.frame(pcaOutput$x)
# define groups for plotting
pcaOutput2$groups <- bc_data$classes
centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean)
conf.rgn <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t)
data.frame(groups = as.character(t),
ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]),
centre = as.matrix(centroids[centroids$groups == t, 2:3]),
level = 0.95),
stringsAsFactors = FALSE)))
ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) +
geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) +
geom_point(size = 2, alpha = 0.6) +
labs(color = "",
fill = "")
多维缩放
select(bc_data, -1) %>%
dist() %>%
cmdscale %>%
as.data.frame() %>%
mutate(group = bc_data$classes) %>%
ggplot(aes(x = V1, y = V2, color = group)) +
geom_point()
t-SNE降维
library(tsne)
select(bc_data, -1) %>%
dist() %>%
tsne() %>%
as.data.frame() %>%
mutate(group = bc_data$classes) %>%
ggplot(aes(x = V1, y = V2, color = group)) +
geom_point()
R的机器学习包
# configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)
训练,验证和测试数据集
set.seed(42)
index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE)
train_data <- bc_data[index, ]
test_data <- bc_data[-index, ]
bind_rows(data.frame(group = "train", train_data),
data.frame(group = "test", test_data)) %>%
gather(x, y, clump_thickness:mitosis) %>%
ggplot(aes(x = y, color = group, fill = group)) +
geom_density(alpha = 0.3) +
facet_wrap( ~ x, scales = "free", ncol = 3)
回归
set.seed(42)
model_glm <- caret::train(clump_thickness ~ .,
data = train_data,
method = "glm",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
savePredictions = TRUE,
verboseIter = FALSE))
model_glm
predictions <- predict(model_glm, test_data)
# model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values
data.frame(residuals = resid(model_glm),
predictors = model_glm$finalModel$linear.predictors) %>%
ggplot(aes(x = predictors, y = residuals)) +
geom_jitter() +
geom_smooth(method = "lm")
# y == train_data$clump_thickness
data.frame(residuals = resid(model_glm),
y = model_glm$finalModel$y) %>%
ggplot(aes(x = y, y = residuals)) +
geom_jitter() +
geom_smooth(method = "lm")
data.frame(actual = test_data$clump_thickness,
predicted = predictions) %>%
ggplot(aes(x = actual, y = predicted)) +
geom_jitter() +
geom_smooth(method = "lm")
分类
- 决策树
library(rpart)
library(rpart.plot)
set.seed(42)
fit <- rpart(classes ~ .,
data = train_data,
method = "class",
control = rpart.control(xval = 10,
minbucket = 2,
cp = 0),
parms = list(split = "information"))
rpart.plot(fit, extra = 100)
- 随机森林
随机森林 的预测是基于多分类树的生成。它们可以用于分类和回归任务。这里,我展示了一个分类任务。
set.seed(42)
model_rf <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
savePredictions = TRUE,
verboseIter = FALSE))
当您指定saveforecasts = TRUE时,您可以使用model_rf$pred访问交叉验证结果。
model_rf
model_rf$finalModel$confusion
处理不平衡数据集
幸运的是,caret可以很容易地将过和欠抽样技术与交叉验证重采样结合在一起。我们可以简单地将抽样选项添加到我们的trainControl系统中,并选择欠抽样。剩下的和原来的模型一样。
set.seed(42)
model_rf_down <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
savePredictions = TRUE,
verboseIter = FALSE,
sampling = "down"))
model_rf_down
特征的重要性
imp <- model_rf$finalModel$importance imp[order(imp, decreasing = TRUE), ]
importance <- varImp(model_rf, scale = TRUE) plot(importance)
预测测试数据集
confusionMatrix(predict(model_rf, test_data), as.factor(test_data$classes))
results <- data.frame(actual = test_data$classes,
predict(model_rf, test_data, type = "prob"))
results$prediction <- ifelse(results$benign > 0.5, "benign",
ifelse(results$malignant > 0.5, "malignant", NA))
results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)
ggplot(results, aes(x = prediction, fill = correct)) +
geom_bar(position = "dodge")
ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)
极梯度提升树
极端梯度提升(XGBoost) 是一种用于监督学习的 梯度提升 的快速改进实现。
XGBoost使用了一种更正则化的模型正规化来控制过度拟合,从而使其具有更好的性能。xgboost的开发人员陈天琦说
XGBoost是一个树集成模型,它表示一组分类树和回归树(CART)的预测总和。在这一点上,XGBoost类似于Random Forests,但它使用了一种不同的模型训练方法。可用于分类和回归任务。这里,我展示了一个分类任务。
set.seed(42)
model_xgb <- caret::train(classes ~ .,
data = train_data,
method = "xgbTree",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
savePredictions = TRUE,
verboseIter = FALSE))
model_xgb
特征的重要性
importance <- varImp(model_xgb, scale = TRUE) plot(importance)
预测测试数据集
confusionMatrix(predict(model_xgb, test_data), as.factor(test_data$classes))
results <- data.frame(actual = test_data$classes,
predict(model_xgb, test_data, type = "prob"))
results$prediction <- ifelse(results$benign > 0.5, "benign",
ifelse(results$malignant > 0.5, "malignant", NA))
results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE)
ggplot(results, aes(x = prediction, fill = correct)) +
geom_bar(position = "dodge")
ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)
caret可用模型
https://topepo.github.io/caret/available-models.html
特征选择
对整个数据集进行特征选择会导致预测偏差,因此我们需要单独对训练数据进行整个建模过程!
- 相关性
所有特征之间的相关性通过corrplot包计算和可视化。然后,我将删除所有相关性大于0.7的特性,使该特征与较低的平均值保持一致。
library(corrplot) # calculate correlation matrix corMatMy <- cor(train_data[, -1]) corrplot(corMatMy, order = "hclust")
#Apply correlation filter at 0.70, highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
highlyCor
train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]
- 递归特征消除(RFE)
选择特征的另一种方法是使用递归特征消除。RFE使用一种随机森林算法来测试特征的组合,并对每个特征进行准确率评分。得分最高的组合通常是优选的。
set.seed(7)
results_rfe <- rfe(x = train_data[, -1],
y = as.factor(train_data$classes),
sizes = c(1:9),
rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10))
predictors(results_rfe)
train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]
- 遗传算法(GA)
遗传算法(GA)是基于自然选择的进化原理开发出来的:它的目标是通过建模选择一组特定的基因型来优化个体群。在每一代(即迭代)中,每个个体的适应度都是根据他们的基因型来计算的。然后,优胜劣汰的个体被选出来生产下一代。这一代的个体将拥有由亲本等位基因重组而成的基因型。这些新的基因型将再次决定每个个体的适应性。这个选择过程在特定的世代中迭代,(理想地)导致在基因库中固定最适合的等位基因。
这种优化概念也可以应用于非进化模型,比如机器学习中的特征选择过程。
set.seed(27)
model_ga <- gafs(x = train_data[, -1],
y = as.factor(train_data$classes),
iters = 10, # generations of algorithm
popSize = 10, # population size for each generation
levels = c("malignant", "benign"),
gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF
method = "cv", # 10 fold cross validation
genParallel = TRUE, # Use parallel programming
allowParallel = TRUE))
plot(model_ga)
train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]
用caret做超参数调优
- 笛卡尔网格
- mtry:在每一次分割中随机抽取的候选变量的数量。
set.seed(42)
grid <- expand.grid(mtry = c(1:10))
model_rf_tune_man <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
savePredictions = TRUE,
verboseIter = FALSE),
tuneGrid = grid)
model_rf_tune_man
plot(model_rf_tune_man)
- 随机选择
set.seed(42)
model_rf_tune_auto <- caret::train(classes ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
savePredictions = TRUE,
verboseIter = FALSE,
search = "random"),
tuneGrid = grid,
tuneLength = 15)
model_rf_tune_auto
plot(model_rf_tune_auto)
使用h2o做网格搜索
R package h2o为 h2o 提供了一个方便的接口,h2o是一个开源的机器学习和深度学习平台。H2O为分类、回归和深度学习提供了广泛的通用机器学习算法。
library(h2o) h2o.init(nthreads = -1)
h2o.no_progress() bc_data_hf <- as.h2o(bc_data)
h2o.describe(bc_data_hf) %>%
gather(x, y, Zeros:Sigma) %>%
mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max",
ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>%
ggplot(aes(x = Label, y = as.numeric(y), color = x)) +
geom_point(size = 4, alpha = 0.6) +
scale_color_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
facet_grid(group ~ ., scales = "free") +
labs(x = "Feature",
y = "Value",
color = "")
library(reshape2) # for melting
bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1])
cor <- h2o.cor(bc_data_hf)
rownames(cor) <- colnames(cor)
melt(cor) %>%
mutate(Var2 = rep(rownames(cor), nrow(cor))) %>%
mutate(Var2 = factor(Var2, levels = colnames(cor))) %>%
mutate(variable = factor(variable, levels = colnames(cor))) %>%
ggplot(aes(x = variable, y = Var2, fill = value)) +
geom_tile(width = 0.9, height = 0.9) +
scale_fill_gradient2(low = "white", high = "red", name = "Cor.") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(x = "",
y = "")
训练,验证和测试数据集
splits <- h2o.splitFrame(bc_data_hf,
ratios = c(0.7, 0.15),
seed = 1)
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
response <- "classes"
features <- setdiff(colnames(train), response)
summary(as.factor(train$classes), exact_quantiles = TRUE) summary(as.factor(valid$classes), exact_quantiles = TRUE) summary(as.factor(test$classes), exact_quantiles = TRUE)
pca <- h2o.prcomp(training_frame = train,
x = features,
validation_frame = valid,
transform = "NORMALIZE",
impute_missing = TRUE,
k = 3,
seed = 42)
eigenvec <- as.data.frame(pca@model$eigenvectors)
eigenvec$label <- features
library(ggrepel)
ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) +
geom_point(color = "navy", alpha = 0.7) +
geom_text_repel()
分类
- 随机森林
hyper_params <- list(
ntrees = c(25, 50, 75, 100),
max_depth = c(10, 20, 30),
min_rows = c(1, 3, 5)
)
search_criteria <- list(
strategy = "RandomDiscrete",
max_models = 50,
max_runtime_secs = 360,
stopping_rounds = 5,
stopping_metric = "AUC",
stopping_tolerance = 0.0005,
seed = 42
)
f_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest,
# alternatively h2o.gbm
# for Gradient boosting trees
x = features,
y = response,
grid_id = "rf_grid",
training_frame = train,
validation_frame = valid,
nfolds = 25,
fold_assignment = "Stratified",
hyper_params = hyper_params,
search_criteria = search_criteria,
seed = 42
)
# performance metrics where smaller is better -> order with decreasing = FALSE
sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss")
for (sort_by_1 in sort_options_1) {
grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE)
model_ids <- grid@model_ids
best_model <- h2o.getModel(model_ids[[1]])
h2o.saveModel(best_model, path="models", force = TRUE)
}
# performance metrics where bigger is better -> order with decreasing = TRUE
sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity")
for (sort_by_2 in sort_options_2) {
grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE)
model_ids <- grid@model_ids
best_model <- h2o.getModel(model_ids[[1]])
h2o.saveModel(best_model, path = "models", force = TRUE)
}
files <- list.files(path = "models")
rf_models <- files[grep("rf_grid_model", files)]
for (model_id in rf_models) {
path <- paste0(getwd(), "/models/", model_id)
best_model <- h2o.loadModel(path)
mse_auc_test <- data.frame(model_id = model_id,
mse = h2o.mse(h2o.performance(best_model, test)),
auc = h2o.auc(h2o.performance(best_model, test)))
if (model_id == rf_models[[1]]) {
mse_auc_test_comb <- mse_auc_test
} else {
mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test)
}
}
mse_auc_test_comb %>%
gather(x, y, mse:auc) %>%
ggplot(aes(x = model_id, y = y, fill = model_id)) +
facet_grid(x ~ ., scales = "free") +
geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) +
labs(x = "", y = "value", fill = "")
for (model_id in rf_models) {
best_model <- h2o.getModel(model_id)
finalRf_predictions <- data.frame(model_id = rep(best_model@model_id,
nrow(test)),
actual = as.vector(test$classes),
as.data.frame(h2o.predict(object = best_model,
newdata = test)))
finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual ==
finalRf_predictions$predict,
"yes", "no")
finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8,
"benign",
ifelse(finalRf_predictions$malignant
> 0.8, "malignant", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual ==
finalRf_predictions$predict_stringent, "yes",
ifelse(finalRf_predictions$predict_stringent ==
"uncertain", "na", "no"))
if (model_id == rf_models[[1]]) {
finalRf_predictions_comb <- finalRf_predictions
} else {
finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions)
}
}
finalRf_predictions_comb %>%
ggplot(aes(x = actual, fill = accurate)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~ model_id, ncol = 2) +
labs(fill = "Were\npredictions\naccurate?",
title = "Default predictions")
finalRf_predictions_comb %>%
subset(accurate_stringent != "na") %>%
ggplot(aes(x = actual, fill = accurate_stringent)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~ model_id, ncol = 2) +
labs(fill = "Were\npredictions\naccurate?",
title = "Stringent predictions")
rf_model <- h2o.loadModel("models/rf_grid_model_0")
h2o.varimp_plot(rf_model)
h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE)
h2o.confusionMatrix(rf_model, valid = TRUE)
plot(rf_model,
timestep = "number_of_trees",
metric = "classification_error")
plot(rf_model,
timestep = "number_of_trees",
metric = "logloss")
plot(rf_model,
timestep = "number_of_trees",
metric = "AUC")
plot(rf_model,
timestep = "number_of_trees",
metric = "rmse")
h2o.auc(rf_model, train = TRUE) h2o.auc(rf_model, valid = TRUE) h2o.auc(rf_model, xval = TRUE) perf <- h2o.performance(rf_model, test) perf
plot(perf)
perf@metrics$thresholds_and_metric_scores %>%
ggplot(aes(x = fpr, y = tpr)) +
geom_point() +
geom_line() +
geom_abline(slope = 1, intercept = 0) +
labs(x = "False Positive Rate",
y = "True Positive Rate")
h2o.logloss(perf) h2o.mse(perf) h2o.auc(perf) head(h2o.metric(perf))
finalRf_predictions <- data.frame(actual = as.vector(test$classes),
as.data.frame(h2o.predict(object = rf_model,
newdata = test)))
finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual ==
finalRf_predictions$predict, "yes", "no")
finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign",
ifelse(finalRf_predictions$malignant
> 0.8, "malignant", "uncertain"))
finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual ==
finalRf_predictions$predict_stringent, "yes",
ifelse(finalRf_predictions$predict_stringent ==
"uncertain", "na", "no"))
finalRf_predictions %>%
group_by(actual, predict) %>%
dplyr::summarise(n = n())
finalRf_predictions %>%
ggplot(aes(x = actual, fill = accurate)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(fill = "Were\npredictions\naccurate?",
title = "Default predictions")
finalRf_predictions %>%
subset(accurate_stringent != "na") %>%
ggplot(aes(x = actual, fill = accurate_stringent)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(fill = "Were\npredictions\naccurate?",
title = "Stringent predictions")
df <- finalRf_predictions[, c(1, 3, 4)]
thresholds <- seq(from = 0, to = 1, by = 0.1)
prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA)
for (threshold in thresholds) {
pred <- ifelse(df$benign > threshold, "benign", "malignant")
pred_t <- ifelse(pred == df$actual, TRUE, FALSE)
group <- data.frame(df, "pred" = pred_t) %>%
group_by(actual, pred) %>%
dplyr::summarise(n = n())
group_b <- filter(group, actual == "benign")
prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n)
prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b
group_m <- filter(group, actual == "malignant")
prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n)
prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m
}
prop_table %>%
gather(x, y, prop_true_b:prop_true_m) %>%
ggplot(aes(x = threshold, y = y, color = x)) +
geom_point() +
geom_line() +
scale_color_brewer(palette = "Set1") +
labs(y = "proportion of true predictions",
color = "b: benign cases\nm: malignant cases")
如果你对更多的机器学习帖子感兴趣,可以在我的博客上查看一下machine_learning的分类列表 https://shirinsplayground.netlify.com/categories/#posts-list-machine-learning - https://shiring.github.io/categories.html#machine_learning-ref
stopCluster(cl) h2o.shutdown()
sessionInfo()
作者:Dr. Shirin Glander 原文链接: https://shirinsplayground.netlify.com/2018/06/intro_to_ml_workshop_heidelberg/
您有什么想法,请留言。
版权声明: 作者保留权利。文章为作者独立观点,不代表数据人网立场。严禁修改,转载请注明原文链接:http://shujuren.org/article/764.html
数据人网: 数据人学习,交流和分享的平台,诚邀您创造和分享数据知识,共建和共享数据智库。
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- 从脑电波到机器人运动——深度学习:介绍
- GitHub 行业内关于智能客服、聊天机器人的应用和架构、算法分享和介绍
- ASP.NET Core模块化前后端分离快速开发框架介绍之3、数据访问模块介绍
- 简编漫画介绍WebAssembly
- CGroup 介绍
- CGroup 介绍
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Mastering Regular Expressions, Second Edition
Jeffrey E F Friedl / O'Reilly Media / 2002-07-15 / USD 39.95
Regular expressions are an extremely powerful tool for manipulating text and data. They have spread like wildfire in recent years, now offered as standard features in Perl, Java, VB.NET and C# (and an......一起来看看 《Mastering Regular Expressions, Second Edition》 这本书的介绍吧!