library(tidyverse)
library(data.table)
library(dplyr)
library(readxl)
library(caret)
library(magrittr)
ptm<-proc.time()
f<-"/home/bizon/Documents/nuMoM2b_Dataset_NICHD_Data_Challenge.csv"
mu_data_df <- fread(f)
Warning in fread(f) :
Detected 11717 column names but the data has 11633 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
proc.time()-ptm
user system elapsed
5.312 0.246 2.064
I want to be able to look up quickly and select the correct predictor and response variables directly inside this R Studio environment. Eventually, this could be extended to allow variable selection to pass them on to “predictors” and “response” variables used in the various machine learning (ML) models below.
For now, we are simply manually selecting the clinically/medically meaningful predictors/response variables, check them for artifacts, clean up those artifacts as appropriate and proceed to ML modeling to test our hypotheses.
ptm<-proc.time()
f1_info<-"/home/bizon/Documents/mu2b/nuMoM2b_Dataset_Information.xlsx"
f2_code<-"/home/bizon/Documents/mu2b/nuMoM2b_Codebook_NICHD_Data_Challenge.xlsx"
mu_info_df <- read_excel(f1_info)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ...
mu_code_df <- read_excel(f2_code)
proc.time()-ptm
user system elapsed
0.092 0.013 0.101
Now that we loaded our Information and Code spreadsheets, we can use DT library to view and search in them easily.
Note: it is best to open the output tables in a new window.
library(DT)
datatable(mu_code_df, filter = 'top', options = list(pageLength = 10, autoWidth = TRUE))
I need to remove NA and non-sense values from CMAE04a4c.
Note: this step can be repeated as needed in the preprocessing pipeline for the desired response variable
# select "CMAE04a4c" | if NA, remove that row
# remove all rows where the column CMAJ01 has NA
clean_CMAE04a4c_df<-mu_data_df[!is.na(mu_data_df$CMAE04a4c), ]
# The above won't work on an h2o object, but works on a regular dataframe, so reading as df first and then, once preprocessing is done, moving on to h2o
glimpse(clean_CMAE04a4c_df$CMAE04a4c)
int [1:8792] 0 0 0 0 0 0 0 0 0 0 ...
count(clean_CMAE04a4c_df, CMAE04a4c)
nrow(clean_CMAE04a4c_df[mu_data_df$CMAE04a4c])
[1] 507
sum(is.na(clean_CMAE04a4c_df$CMAE04a4c))
[1] 0
Only ten cases where PTSD postpartum CMAE04a4c=1. Not enough data for the ML approach at this time.
Exploring CMAE04a1c (postpartum depression).
# select "CMAE04a1c" | if NA, remove that row
# remove all rows where the column CMAJ01 has NA
clean_CMAE04a1c_df<-mu_data_df[!is.na(mu_data_df$CMAE04a1c), ]
# The above won't work on an h2o object, but works on a regular dataframe, so reading as df first and then, once preprocessing is done, moving on to h2o
glimpse(clean_CMAE04a1c_df$CMAE04a1c)
int [1:8792] 0 0 0 0 0 0 0 0 0 0 ...
count(clean_CMAE04a1c_df, CMAE04a1c)
nrow(clean_CMAE04a1c_df[mu_data_df$CMAE04a1c])
[1] 835
sum(is.na(clean_CMAE04a1c_df$CMAE04a1c))
[1] 0
There are n=338 positive outcomes, so will pursue this in the following with the ML approaches as done for the re-hospitalization outcome before.
library(h2o)
h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 hours 19 minutes
H2O cluster timezone: America/Vancouver
H2O data parsing timezone: UTC
H2O cluster version: 3.34.0.3
H2O cluster version age: 5 days
H2O cluster name: H2O_started_from_R_bizon_rik192
H2O cluster total nodes: 1
H2O cluster total memory: 11.02 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.1.1 (2021-08-10)
ptm<-proc.time()
mu_data <- as.h2o(clean_CMAE04a1c_df)
|
| | 0%
|
|==============================================================================================================| 100%
proc.time()-ptm
user system elapsed
5.863 0.478 12.571
mu_data[,"CMAE04a1c"]<-as.factor(mu_data[,"CMAE04a1c"])
# predictors "participant", "demographics" dataset (column A) includes the relevant features
# concludes the variables from "demographics" dataset; model as is and test later by adding participant features from the "CMA" set
predictors<-c("CRace",
"Race",
"eRace",
"eHispanic",
"BMI",
"BMI_Cat",
"Education",
"GravCat",
"SmokeCat1",
"SmokeCat2",
"SmokeCat3",
"Ins_Govt",
"Ins_Mil",
"Ins_Comm",
"Ins_Pers",
"Ins_Othr",
"PctFedPoverty",
"poverty",
"V1AD02g", #adding psych predictors here
"CMAE04a1b",
"CMAE04a4a",
"CMAE04a4b"
) #makes a list of predicting variables
response<-"CMAE04a1c"
(first, let’s do a “quick&dirty” way = no hold-out dataset to get some sense of the data from ML standpoint)
test2_mother_mu_data_RFmodel<-h2o.randomForest(x=predictors,y=response,training_frame = mu_data, nfolds=10,seed = 1234)
|
| | 0%
|
|======= | 7%
|
|======================= | 21%
|
|====================================== | 35%
|
|======================================================= | 50%
|
|==================================================================================================== | 91%
|
|==============================================================================================================| 100%
test2_mother_mu_data_RFmodel
Model Details:
==============
H2OBinomialModel: drf
Model ID: DRF_model_R_1634062886437_2
Model Summary:
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.01885464
RMSE: 0.1373122
LogLoss: 0.117594
Mean Per-Class Error: 0.1108243
AUC: 0.9346045
AUCPR: 0.6157573
Gini: 0.869209
R^2: 0.4899476
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.01800805
RMSE: 0.1341941
LogLoss: 0.1028235
Mean Per-Class Error: 0.09555821
AUC: 0.9277543
AUCPR: 0.6235033
Gini: 0.8555087
R^2: 0.5128496
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
AUC: 0.9346045
AUCPR: 0.6157573
That was a quick test.
Now I need to create a hold-out dataset first; repeat the above then as test:validation.
mother2_mu_data_split <- h2o.splitFrame(mu_data, ratios=0.8, seed = 1)
train=mother2_mu_data_split[[1]]
valid=mother2_mu_data_split[[2]]
ptm<-proc.time()
mother2_mu_data_RFmodel<-h2o.randomForest(x=predictors,y=response,training_frame = train, nfolds=10,seed = 1234)
|
| | 0%
|
|================ | 20%
|
|=============================== | 37%
|
|============================================== | 55%
|
|============================================================= | 73%
|
|================================================================================== | 97%
|
|====================================================================================| 100%
proc.time()-ptm
user system elapsed
0.560 0.021 6.879
Yields AUC: 0.9288383 and AUCPR: 0.6338817
ptm<-proc.time()
mother2_mu_data_predict<-h2o.predict(object=mother2_mu_data_RFmodel, newdata=valid)
|
| | 0%
|
|====================================================================================| 100%
mother2_mu_data_RFmodel
Model Details:
==============
H2OBinomialModel: drf
Model ID: DRF_model_R_1634062886437_658
Model Summary:
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.0181672
RMSE: 0.1347857
LogLoss: 0.119153
Mean Per-Class Error: 0.117894
AUC: 0.9235196
AUCPR: 0.6503907
Gini: 0.8470391
R^2: 0.5034799
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.01771996
RMSE: 0.1331163
LogLoss: 0.09748087
Mean Per-Class Error: 0.0900565
AUC: 0.9288383
AUCPR: 0.6338817
Gini: 0.8576766
R^2: 0.5157031
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
proc.time()-ptm
user system elapsed
0.080 0.005 0.111
mod=mother2_mu_data_RFmodel
perf <- h2o.performance(mod,valid)
metrics <- as.data.frame(h2o.metric(perf))
metrics
metrics %>%
ggplot(aes(recall,precision)) +
geom_line() +
theme_minimal()
metrics %>%
ggplot(aes(precision, accuracy)) +
geom_line() +
theme_minimal()
ptm<-proc.time()
# toggle progress bar if desired:
# h2o.show_progress()
exp <-h2o.explain(object=mother2_mu_data_RFmodel, newdata=valid)
results_df <- function(h2o_model) {
h2o_model@model$cross_validation_metrics_summary %>%
as.data.frame() %>%
select(-mean, -sd) %>%
t() %>%
as.data.frame() %>%
mutate_all(as.character) %>%
mutate_all(as.numeric) -> k
k %>%
select(Accuracy = accuracy,
AUC = auc,
Precision = precision,
Specificity = specificity,
Recall = recall,
Logloss = logloss) %>%
return()
}
# Using function
results_df(mod) -> outcome
# Outcome
outcome %>%
gather(Metrics, Values) %>%
ggplot(aes(Metrics, Values, fill = Metrics, color = Metrics)) +
geom_boxplot(alpha = 0.3, show.legend = FALSE) +
facet_wrap(~ Metrics, scales = "free") +
labs(title = "Performance of our ML model using H2o package ",
caption = "Data Source: NICHD Decoding Maternal Morbidity Data Challenge\nCreated by Martin Frasch (further credit to https://bit.ly/3BpPqcb)") +
theme_minimal()
# Statistics summary
outcome %>%
gather(Metrics, Values) %>%
group_by(Metrics) %>%
summarise_each(funs(mean, median, min, max, sd, n())) %>%
mutate_if(is.numeric, function(x) {round(100*x, 2)}) %>%
knitr::kable(col.names = c("Criterion", "Mean", "Median", "Min", "Max", "SD", "N"))
Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
Please use `across()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Warning: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Criterion | Mean | Median | Min | Max | SD | N |
---|---|---|---|---|---|---|
Accuracy | 97.93 | 97.77 | 97.28 | 98.70 | 0.45 | 1000 |
AUC | 93.41 | 93.81 | 87.33 | 99.12 | 3.72 | 1000 |
Logloss | 9.74 | 10.01 | 4.79 | 15.68 | 3.59 | 1000 |
Precision | 68.19 | 64.58 | 59.26 | 80.77 | 8.26 | 1000 |
Recall | 83.46 | 81.63 | 71.43 | 100.00 | 8.95 | 1000 |
Specificity | 98.48 | 98.51 | 98.12 | 99.33 | 0.37 | 1000 |
maternal_dt_model<-h2o.gbm(x=predictors,y=response,training_frame = train, validation_frame = valid, balance_classes = TRUE, seed = 1234, nfolds=10)
|
| | 0%
|
|========= | 11%
|
|======================== | 28%
|
|====================================== | 45%
|
|===================================================== | 64%
|
|============================================================================= | 91%
|
|====================================================================================| 100%
# GBM hyperparamters
gbm_params = list(max_depth = seq(2, 10))
# Train and validate a cartesian grid of GBMs
gbm_grid = h2o.grid("gbm", x = predictors, y = response,
grid_id = "gbm_grid_1tree8",
training_frame = train,
validation_frame = valid,
balance_classes = TRUE,
ntrees = 1, min_rows = 1, sample_rate = 1, col_sample_rate = 1,
learn_rate = .01, seed = 1234,
hyper_params = gbm_params)
|
| | 0%
|
|====================================================================================| 100%
gbm_gridperf = h2o.getGrid(grid_id = "gbm_grid_1tree8",
sort_by = "auc",
decreasing = TRUE)
# what is the performance of this GBM?
maternal_dt_model
Model Details:
==============
H2OBinomialModel: gbm
Model ID: GBM_model_R_1634062886437_1556
Model Summary:
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 0.1187193
RMSE: 0.3445567
LogLoss: 0.4121877
Mean Per-Class Error: 0.02596636
AUC: 0.9941941
AUCPR: 0.9926069
Gini: 0.9883883
R^2: 0.5251227
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: gbm
** Reported on validation data. **
MSE: 0.02092163
RMSE: 0.1446431
LogLoss: 0.07845379
Mean Per-Class Error: 0.1021054
AUC: 0.9468121
AUCPR: 0.5202862
Gini: 0.8936243
R^2: 0.4563593
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: gbm
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.01704429
RMSE: 0.1305538
LogLoss: 0.06699951
Mean Per-Class Error: 0.08252005
AUC: 0.9473067
AUCPR: 0.6510553
Gini: 0.8946135
R^2: 0.5341696
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
We obtain AUC: 0.9473067, AUCPR: 0.6510553
gbm_gridperf
H2O Grid Details
================
Grid ID: gbm_grid_1tree8
Used hyper parameters:
- max_depth
Number of models: 9
Number of failed models: 0
Hyper-Parameter Search Summary: ordered by decreasing auc
Inflection point is at max_depth=5
maternal_1_tree = h2o.gbm(x = predictors, y = response,
training_frame = train, balance_classes = TRUE,
ntrees = 1, min_rows = 1, sample_rate = 1, col_sample_rate = 1,
max_depth = 5,
# use early stopping once the validation AUC doesn't improve by at least 0.01%
# for 5 consecutive scoring events
stopping_rounds = 3, stopping_tolerance = 0.01,
stopping_metric = "AUC",
seed = 1)
Warning in .h2o.processResponseWarnings(res) :
early stopping is enabled but neither score_tree_interval or score_each_iteration are defined. Early stopping will not be reproducible!.
|
| | 0%
|
|====================================================================================| 100%
maternal_1_tree
Model Details:
==============
H2OBinomialModel: gbm
Model ID: GBM_model_R_1634062886437_2476
Model Summary:
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 0.4571694
RMSE: 0.676143
LogLoss: 1.575699
Mean Per-Class Error: 0.06445889
AUC: 0.9717425
AUCPR: 0.9703432
Gini: 0.943485
R^2: -0.8286777
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
AUCPR: 0.882762
maternal_Tree = h2o.getModelTree(model = maternal_1_tree, tree_number = 1)
# Visualizing H2O Trees
library(data.tree)
createDataTree <- function(h2oTree) {
h2oTreeRoot = h2oTree@root_node
dataTree = Node$new(h2oTreeRoot@split_feature)
dataTree$type = 'split'
addChildren(dataTree, h2oTreeRoot)
return(dataTree)
}
addChildren <- function(dtree, node) {
if(class(node)[1] != 'H2OSplitNode') return(TRUE)
feature = node@split_feature
id = node@id
na_direction = node@na_direction
if(is.na(node@threshold)) {
leftEdgeLabel = printValues(node@left_levels, na_direction=='LEFT', 4)
rightEdgeLabel = printValues(node@right_levels, na_direction=='RIGHT', 4)
}else {
leftEdgeLabel = paste("<", node@threshold, ifelse(na_direction=='LEFT',',NA',''))
rightEdgeLabel = paste(">=", node@threshold, ifelse(na_direction=='RIGHT',',NA',''))
}
left_node = node@left_child
right_node = node@right_child
if(class(left_node)[[1]] == 'H2OLeafNode')
leftLabel = paste("prediction:", left_node@prediction)
else
leftLabel = left_node@split_feature
if(class(right_node)[[1]] == 'H2OLeafNode')
rightLabel = paste("prediction:", right_node@prediction)
else
rightLabel = right_node@split_feature
if(leftLabel == rightLabel) {
leftLabel = paste(leftLabel, "(L)")
rightLabel = paste(rightLabel, "(R)")
}
dtreeLeft = dtree$AddChild(leftLabel)
dtreeLeft$edgeLabel = leftEdgeLabel
dtreeLeft$type = ifelse(class(left_node)[1] == 'H2OSplitNode', 'split', 'leaf')
dtreeRight = dtree$AddChild(rightLabel)
dtreeRight$edgeLabel = rightEdgeLabel
dtreeRight$type = ifelse(class(right_node)[1] == 'H2OSplitNode', 'split', 'leaf')
addChildren(dtreeLeft, left_node)
addChildren(dtreeRight, right_node)
return(FALSE)
}
printValues <- function(values, is_na_direction, n=4) {
l = length(values)
if(l == 0)
value_string = ifelse(is_na_direction, "NA", "")
else
value_string = paste0(paste0(values[1:min(n,l)], collapse = ', '),
ifelse(l > n, ",...", ""),
ifelse(is_na_direction, ", NA", ""))
return(value_string)
}
This decision tree, also supplied as PDF, is meant to help build intuition about how the model.
library(DiagrammeR)
# customized DT for our H2O model
maternal_mu2DataTree = createDataTree(maternal_Tree)
GetEdgeLabel <- function(node) {return (node$edgeLabel)}
GetNodeShape <- function(node) {switch(node$type,
split = "diamond", leaf = "oval")}
GetFontName <- function(node) {switch(node$type,
split = 'Palatino-bold',
leaf = 'Palatino')}
SetEdgeStyle(maternal_mu2DataTree, fontname = 'Palatino-italic',
label = GetEdgeLabel, labelfloat = TRUE,
fontsize = "26", fontcolor='royalblue4')
SetNodeStyle(maternal_mu2DataTree, fontname = GetFontName, shape = GetNodeShape,
fontsize = "26", fontcolor='royalblue4',
height="0.75", width="1")
SetGraphStyle(maternal_mu2DataTree, rankdir = "LR", dpi=70.)
plot(maternal_mu2DataTree, output = "graph")
NA
ptm<-proc.time()
exp_dt<-h2o.explain(maternal_dt_model,valid)
proc.time()-ptm
user system elapsed
51.442 1.332 68.781
exp_dt
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
GBM_model_R_1634062886437_1556
------------------------------
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.216446697145969:
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
============
> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
# Build and train the model:
mo2b_nb <- h2o.naiveBayes(x = predictors,
y = response,
training_frame = train,
laplace = 0,
nfolds = 10,
seed = 1234)
|
| | 0%
|
|================== | 21%
|
|====================================================================================| 100%
# Eval performance:
perf <- h2o.performance(mo2b_nb)
# Generate the predictions on a test set (if necessary):
pred <- h2o.predict(mo2b_nb, newdata = valid)
|
| | 0%
|
|====================================================================================| 100%
perf
H2OBinomialMetrics: naivebayes
** Reported on training data. **
MSE: 0.04521123
RMSE: 0.2126293
LogLoss: 0.6202355
Mean Per-Class Error: 0.08343609
AUC: 0.9259393
AUCPR: 0.5121705
Gini: 0.8518787
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
NB model achieves AUC: 0.9259393 and AUCPR: 0.5121705
# best viewed in a new window or see, please, the PDF included with the submission
exp_nb <- h2o.explain(mo2b_nb,valid)
Warning: StackedEnsemble does not have a variable importance. Picking all columns. Set `columns` to a vector of columns to explain just a subset of columns.
Note the highly variable partial importance of the different socio-demographic characteristics
exp_nb
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
NaiveBayes_model_R_1634062886437_2480
-------------------------------------
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.999997946867559:
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Compare here. The findings are to be interpreted with caution at this stage. Once we obtain a larger external dataset for validation, with a more balanced case distribution, this will become more useful and allow building an inference engine that could be deployed for use. I am presenting this code therefore as a reference for future work.
Nevertheless, it is evident that an optimization even at this stage results in a classification prediction performance of AUROC = 0.9937349. This result can vary depending on the run.
Note please, this code runs for about 70 min on a well-equipped deep learning workstation.
ptm<-proc.time()
maternal_aml <- h2o.automl(x=predictors,y=response,training_frame = train, max_models = 20, seed = 1)
|
| | 0%
|
|=== | 3%
|
|===== | 6%
|
|====== | 7%
|
|======== | 9%
|
|========== | 12%
|
|============ | 14%
|
|============== | 17%
|
|================ | 20%
|
|=================== | 23%
|
|===================== | 25%
|
|================================= | 40%
|
|========================================== | 50%
|
|============================================== | 54%
|
|================================================ | 57%
|
|==================================================== | 62%
|
|======================================================== | 67%
|
|============================================================= | 72%
|
|============================================================== | 74%
|
|================================================================ | 76%
|
|================================================================= | 77%
|
|================================================================== | 79%
|
|==================================================================== | 80%
|
|====================================================================================| 100%
maternal_lb <- maternal_aml@leaderboard
#print(maternal_lb, n = nrow(maternal_lb)) #Print all rows instead of default 6 rows
proc.time()-ptm
user system elapsed
79.778 1.204 1085.929
ptm<-proc.time()
maternal_perf_valid <- h2o.performance(maternal_aml@leader,newdata=valid,xval=FALSE,valid=TRUE)
pred <- h2o.predict(maternal_aml@leader, valid)
|
| | 0%
|
|====================================================================================| 100%
h2o.auc(maternal_aml@leader)
[1] 0.9937349
We observe no specificity because the dataset is unbalanced such that by luck of draw (when the dataset is split 80:20) we get no true positives.
ptm<-proc.time()
exp <-h2o.explain(maternal_aml@leader, valid)
Warning: StackedEnsemble does not have a variable importance. Picking all columns. Set `columns` to a vector of columns to explain just a subset of columns.
proc.time()-ptm
user system elapsed
5.311 0.429 54.001
print(exp)
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
StackedEnsemble_BestOfFamily_2_AutoML_1_20211012_125843
-------------------------------------------------------
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.411276621237627:
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] DiagrammeR_1.0.6.1 data.tree_1.0.0 ggcorrplot_0.1.3 h2o_3.34.0.3
[5] DT_0.19 magrittr_2.0.1 caret_6.0-90 lattice_0.20-45
[9] readxl_1.3.1 data.table_1.14.2 forcats_0.5.1 stringr_1.4.0
[13] dplyr_1.0.7 purrr_0.3.4 readr_2.0.2 tidyr_1.1.4
[17] tibble_3.1.5 ggplot2_3.3.5 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] nlme_3.1-153 bitops_1.0-7 fs_1.5.0 bit64_4.0.5
[5] lubridate_1.8.0 RColorBrewer_1.1-2 httr_1.4.2 bslib_0.3.1
[9] tools_4.1.1 backports_1.2.1 utf8_1.2.2 R6_2.5.1
[13] rpart_4.1-15 DBI_1.1.1 colorspace_2.0-2 nnet_7.3-16
[17] withr_2.4.2 tidyselect_1.1.1 bit_4.0.4 compiler_4.1.1
[21] cli_3.0.1 rvest_1.0.1 xml2_1.3.2 labeling_0.4.2
[25] sass_0.4.0 scales_1.1.1 digest_0.6.28 rmarkdown_2.11
[29] pkgconfig_2.0.3 htmltools_0.5.2 parallelly_1.28.1 highr_0.9
[33] dbplyr_2.1.1 fastmap_1.1.0 htmlwidgets_1.5.4 rlang_0.4.11
[37] rstudioapi_0.13 visNetwork_2.1.0 farver_2.1.0 jquerylib_0.1.4
[41] generics_0.1.0 jsonlite_1.7.2 ModelMetrics_1.2.2.2 RCurl_1.98-1.5
[45] Matrix_1.3-4 Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0
[49] lifecycle_1.0.1 yaml_2.2.1 stringi_1.7.5 pROC_1.18.0
[53] MASS_7.3-54 plyr_1.8.6 recipes_0.1.17 grid_4.1.1
[57] parallel_4.1.1 listenv_0.8.0 crayon_1.4.1 haven_2.4.3
[61] splines_4.1.1 hms_1.1.1 knitr_1.36 pillar_1.6.3
[65] future.apply_1.8.1 reshape2_1.4.4 codetools_0.2-18 stats4_4.1.1
[69] reprex_2.0.1 glue_1.4.2 evaluate_0.14 renv_0.14.0
[73] modelr_0.1.8 vctrs_0.3.8 tzdb_0.1.2 foreach_1.5.1
[77] cellranger_1.1.0 gtable_0.3.0 future_1.22.1 assertthat_0.2.1
[81] xfun_0.26 gower_0.2.2 prodlim_2019.11.13 broom_0.7.9
[85] class_7.3-19 survival_3.2-13 timeDate_3043.102 iterators_1.0.13
[89] lava_1.6.10 globals_0.14.0 ellipsis_0.3.2 ipred_0.9-12