Quick posting showing use of H2O.
This is a quick post to show use of H2O as an automl tool. As stated in the docs (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html), H2O is an open source, distributed, fast and scalable machine learning platform written in Java. It supports many different supervised and unsupervised algorithms. I am interested in how optimum the generated models are. Here I will just demonstrate the use of the platform, compare to a quick GLM, and will end by showing LIME for explanation of model outcomes.
For the demo dataset, I am using the Palmer Penguins dataset. This dataset has 344 measurements of penguin physical characteristics along with labels for sex, species, location and year of the measurement. For this demonstration, I will see what H2O can accomplish in predicting sex from the other attributes. First, some EDA just to get a sense of what the dataset looks like:
# love the skimr package, quick look shows some missing values and distributions
skim(palmerpenguins::penguins)
Name | palmerpenguins::penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
# get dataset reformatted and filtered, dont want to deal with missing data
penguins_new <- palmerpenguins::penguins %>%
select(-year) %>%
filter(complete.cases(.)) %>%
mutate_if(is.factor, as_factor)
penguins_new %>%
ggplot(aes(x=species, y=(bill_length_mm))) +
geom_violin() +
geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) +
facet_grid(~island) +
labs(title="bill length by species, island, sex")
penguins_new %>%
ggplot(aes(x=species, y=(bill_depth_mm))) +
geom_violin() +
geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) +
facet_grid(~island) +
labs(title="bill depth by species, island, sex")
penguins_new %>%
ggplot(aes(x=species, y=(flipper_length_mm))) +
geom_violin() +
geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) +
facet_grid(~island) +
labs(title="flipper length by species, island, sex")
penguins_new %>%
ggplot(aes(x=species, y=(body_mass_g))) +
geom_violin() +
geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) +
facet_grid(~island) +
labs(title="body by species, island, sex")
This is where an SME would add the secret sauce. I am definitely NOT a penguin biologist, but I am going to pretend I know bill volume is important and can be computed by assuming the bill is a pyramid with volume calculated as \[volume = \frac{1}{3} \ast length \ast depth^2\] I am also going to center and scale the numeric predictors. Bill_volume might actually do some good in separating sexes, so let’s keep it. This will make it’s way into the recipe in the next section.
penguins_new <- penguins_new %>%
mutate(bill_volume = 1/3 * bill_length_mm * bill_depth_mm^2)
penguins_new %>%
ggplot(aes(x=species, y=(bill_volume))) +
geom_violin() +
geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) +
facet_grid(~island) +
labs(title="bill volume by species, island, sex")
Here we will do the normal 10-fold cross validation and have our holdout set for publication metrics. To get this, we will use rsample::initial_split
to create the holdout set of 10% of the data. Additionally, I am creating a recipe for the data processing to assist in later predictions using new data.
# Put most of the data into the training set, want the other set more like a true
# test set as H2O is doing cross validation so the validation data set is pulled
# from the training dataset
set.seed(34547)
penguins_split <- initial_split(
palmerpenguins::penguins %>% filter(complete.cases(.)),
prop = 0.90)
# now do the split
train_data <- training(penguins_split)
test_data <- testing(penguins_split)
# just want to center and scale, leave factors to h2o.
# feature engineering step should be in recipe so that new data is captured
# so adding the bill_volume calc here
pengiun_recipe <- recipe(sex~., data=train_data) %>%
step_select(-year) %>%
step_factor2string() %>%
step_string2factor() %>%
step_mutate(bill_volume = 1/3 * bill_length_mm * bill_depth_mm^2) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
# printing the recipe gives a nice summary
pengiun_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 7
Operations:
Variables selected -year
Character variables from <none>
Factor variables from <none>
Variable mutation for 1 / 3 * bill_length_mm * bill_de...
Centering for all_numeric_predictors()
Scaling for all_numeric_predictors()
With our data prepped, we are now ready to proceed to the modeling stage. In this case, we are creating a model to predict sex and have a labeled dataset for training. In machine learning speak, this is a supervised learning classification problem. H2O supports many different algorithms for performing this task. We are going to run H2O’s AutoML to get an idea of how it does in an automated algorithm search and follow that up by running the h2o.glm
method to see what we can do manually.
H2O’s automl method is an attempt at a completely hands off machine learning excersize. H2O will create models using tree based, deep learning and regression methods. It will run until a user specified time limit or the number of models built reaches the user specified limit. By default, it will use 5-fold cross-validation. I am going to specify a 20 min time limit, change the cross-validation strategy to 10-fold, and set the seed for reproducibility.
train_h2o <- as.h2o(train_data_baked)
test_h2o <- as.h2o(test_data_baked)
y <- "sex"
x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = train_h2o,
max_runtime_secs = 1200,
nfolds = 10,
seed = 38981
)
From the automl training, we can look at the leaderboard and get a quick sense of how it did by looking at the confusion matrix for our holdout set. In the below, we see the StackedEnsemble and DeepLearning methods performed best using AUC as the metric. I want to focus on the DeepLearning model so will extract the top DeepLearning model.
automl_models_h2o@leaderboard
model_id auc
1 StackedEnsemble_BestOfFamily_7_AutoML_1_20221110_205708 0.9723863
2 StackedEnsemble_BestOfFamily_6_AutoML_1_20221110_205708 0.9721625
3 DeepLearning_grid_1_AutoML_1_20221110_205708_model_11 0.9715807
4 DeepLearning_grid_1_AutoML_1_20221110_205708_model_4 0.9715360
5 StackedEnsemble_BestOfFamily_4_AutoML_1_20221110_205708 0.9709094
6 DeepLearning_grid_1_AutoML_1_20221110_205708_model_12 0.9708199
logloss aucpr mean_per_class_error rmse mse
1 0.2083160 0.9655582 0.07109291 0.2427961 0.05894995
2 0.2132914 0.9653595 0.06429019 0.2455755 0.06030734
3 0.2500111 0.9702776 0.07075725 0.2529788 0.06399825
4 0.2410589 0.9713846 0.08458647 0.2603849 0.06780029
5 0.2176247 0.9617852 0.07086914 0.2462841 0.06065588
6 0.2179888 0.9703479 0.08096133 0.2485852 0.06179461
[166 rows x 7 columns]
# grab first model hash
best_model <- automl_models_h2o@leaderboard %>%
as_tibble() %>%
filter(str_detect(model_id,"DeepLearning")) %>%
slice(1)
best_model
# A tibble: 1 × 7
model_id auc logloss aucpr mean_…¹ rmse mse
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 DeepLearning_grid_1_AutoML… 0.972 0.250 0.970 0.0708 0.253 0.0640
# … with abbreviated variable name ¹mean_per_class_error
# save the results in its own object
best_model_results <- h2o.getModel(paste0(best_model[1]))
Suppose, we are interested specifically in a GLM, H2O allows one to create only the GLM models using the glm method. This allows us to fine tune the model. For instance, in the below I specify lambda = 0
to turn off regularization and again use 10-fold cross validation.
pengiun_glm <- h2o.glm(family = "binomial",
x = x,
y = y,
training_frame = train_h2o,
nfolds = 10,
lambda = 0.0,
seed = 58931)
Exploring H2O model results is facilitated through a bunch of helper functions. Here I will show functions for looking at performance, prediction and explainability.
Model performance is a highly technical discussion I am avoiding here. What I instead want to highlight is how to get to the performance metrics H2O gives access to. First, for classification problems like our example, the confusion matrix is a good first indicator of performance.
AutoML best model:
h2o.confusionMatrix(best_model_results)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.862739508209633:
female male Error Rate
female 146 1 0.006803 =1/147
male 1 151 0.006579 =1/152
Totals 147 152 0.006689 =2/299
GLM model:
h2o.confusionMatrix(pengiun_glm)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.463943952801159:
female male Error Rate
female 136 11 0.074830 =11/147
male 7 145 0.046053 =7/152
Totals 143 156 0.060201 =18/299
Additionally, given a trained model object, depending on the metric desired, one would use the metric method or retrieve the value from a derived performance object. For classification, generally you would operate on the performance object. In this example, incorrect predictions in either direction are equally undesirable such that the metrics we may be concerned with are F1, accuracy or AUC.
# make performance objects
best_model_performance <- h2o.performance(best_model_results)
glm_performance <- h2o.performance(pengiun_glm)
# F1, just retrieving max irrespective of threshold
best_model_F1 <- max(h2o.F1(best_model_performance)$f1)
glm_F1 <- max(h2o.F1(glm_performance)$f1)
# accuracy, just retrieving max irrespective of threshold
best_model_accuracy <- max(h2o.accuracy(best_model_performance)$accuracy)
glm_accuracy <- max(h2o.accuracy(glm_performance)$accuracy)
# AUC
best_model_auc <- h2o.auc(best_model_performance)
glm_auc <- h2o.auc(glm_performance)
results <- as.data.frame(cbind(F1 = c(best_model_F1, glm_F1),
Accuracy = c(best_model_accuracy, glm_accuracy),
AUC = c(best_model_auc, glm_auc)))
rownames(results) <- c("autoML", "GLM")
knitr::kable(round(results, 3), caption = "Classification metrics")
F1 | Accuracy | AUC | |
---|---|---|---|
autoML | 0.993 | 0.993 | 0.999 |
GLM | 0.942 | 0.940 | 0.977 |
The goal of modeling is to create a model that can predict on new data. H2O offers a method to give predictions on new data. Let’s make the predictions and plot the confusion matrix on this. Could have done it using H2O’s confusionMatrix
function, but in preperation for LIME, let’s grab the predictions first for autoML and then the GLM:.
# make some quick predictions to see how we did
automl_best_model_predictions <-
h2o.predict(best_model_results, newdata = test_h2o) %>%
as_tibble() %>%
bind_cols(
test_data_baked %>% select(sex)
)
glm_predictions <-
h2o.predict(pengiun_glm, newdata = test_h2o) %>%
as_tibble() %>%
bind_cols(
test_data_baked %>% select(sex)
)
both_predictions <-
bind_cols(automl_best_model_predictions[,1:3],
glm_predictions)
names(both_predictions) <- c(paste0("automl_",names(automl_best_model_predictions[,1:3])),
paste0("glm_",names(glm_predictions[1:3])),"sex")
both_predictions <- both_predictions %>%
select(sex, contains("predict"), everything())
both_predictions %>%
yardstick::conf_mat(sex, automl_predict) %>%
autoplot(type="heatmap") + labs(title="autoML confusion")
From this, it looks like the GLM performs better on new data. This suggests there may be some over fitting. We can look at the learning curve for the top model. When we do this, it does indeed appear as though the model was overfit as evident by the increasing logloss for the CV training datasets starting around epoc 1000.
h2o.learning_curve_plot(best_model_results)
In many cases, explaining a model is important, for instance, in mechanistic studies or in model improvement efforts. H2O offers a single function for a global look at model factor importance, explain
. Limiting to the classification example, the output includes a confusion matrix, variable importance plots and partial dependence plots as show below. Each of these and more can be generated through individual calls. For the autoML leader:
h2o.explain(best_model_results, test_h2o)
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
DeepLearning_grid_1_AutoML_1_20221110_205708_model_11
-----------------------------------------------------
| | female | male | Error | Rate
|:---:|:---:|:---:|:---:|:---:|
| **female** |15 | 3 | 0.166666666666667 | =3/18 |
| **male** |0 | 16 | 0 | =0/16 |
| **Totals** |15 | 19 | 0.0882352941176471 | =3/34 |
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
From this, it appears species and bill length are the top variables leading to assignment of sex. Note that sex=male
is a positive outcome, so response is 1 in the continuous plots.
Explanations for the glm model:
h2o.explain(pengiun_glm, test_h2o)
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
GLM_model_R_1668131819232_38796
-------------------------------
| | female | male | Error | Rate
|:---:|:---:|:---:|:---:|:---:|
| **female** |18 | 0 | 0 | =0/18 |
| **male** |1 | 15 | 0.0625 | =1/16 |
| **Totals** |19 | 15 | 0.0294117647058824 | =1/34 |
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
The partial dependence plots for the continuous features look a lot more certain as the data approaches the extremes, ie large bill_volume
values suggest with high probability a male penguin.
The interesting features are often the ones where the predictions are incorrect. To look at feature importance on individual predictions, H2O does offer methods for tree based algorithms, but not currently on regression or deep learning models. For this, I am turning to LIME for explanation of individual observations. To use LIME, you first build an explainer using the model and training data, then create an explanation of the data of interest. The explaination uses permutations of the observations variable values to get an idea of the local influence of the variables.
First, let’s remember what our prediction set is along with a filter for incorrect predictions.
both_predictions
# A tibble: 34 × 7
sex automl_predict glm_predict automl…¹ automl…² glm_f…³ glm_m…⁴
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 female female female 1.00e+ 0 4.96e- 6 6.04e-1 0.396
2 male male male 9.78e-65 1 e+ 0 6.59e-3 0.993
3 male female female 1.00e+ 0 1.27e- 4 7.74e-1 0.226
4 male male male 8.56e- 8 1.00e+ 0 3.81e-2 0.962
5 female female female 1 e+ 0 2.50e-28 9.94e-1 0.00637
6 male male male 9.16e-11 1.00e+ 0 8.86e-4 0.999
7 female female female 1.00e+ 0 4.41e- 6 8.69e-1 0.131
8 male male male 4.75e-11 1.00e+ 0 1.39e-1 0.861
9 female female female 1 e+ 0 5.89e-24 9.57e-1 0.0434
10 female female female 1.00e+ 0 6.61e- 7 8.93e-1 0.107
# … with 24 more rows, and abbreviated variable names ¹automl_female,
# ²automl_male, ³glm_female, ⁴glm_male
both_predictions %>%
mutate_if(is.factor, as.character) %>%
filter(!(sex == automl_predict) | !(sex == glm_predict))
# A tibble: 4 × 7
sex automl_predict glm_predict automl_f…¹ autom…² glm_f…³ glm_m…⁴
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 male female female 1.00 1.27e-4 0.774 0.226
2 female male female 0.000153 1.00e+0 0.557 0.443
3 female male female 0.0117 9.88e-1 0.623 0.377
4 female male female 0.00211 9.98e-1 0.952 0.0481
# … with abbreviated variable names ¹automl_female, ²automl_male,
# ³glm_female, ⁴glm_male
Now on to the autoML DeepLearning leader, let’s look at data point 3 which is incorrectly predicted by both the DL and GLM models.
test_data %>%
slice(3) %>%
mutate(bill_volume = 1/3 * bill_length_mm * bill_depth_mm^2) %>%
glimpse()
Rows: 1
Columns: 9
$ species <fct> Adelie
$ island <fct> Biscoe
$ bill_length_mm <dbl> 37.7
$ bill_depth_mm <dbl> 18.7
$ flipper_length_mm <int> 180
$ body_mass_g <int> 3600
$ sex <fct> male
$ year <int> 2007
$ bill_volume <dbl> 4394.438
Looking back at the plots, this is a smaller bird, suggesting perhaps why the algorithm would have difficulties.
## ok, now for the fun, lets look at what the importance of each feature is
## build the explainer
explainer_penguins_autoML <- train_data_baked %>%
select(-sex) %>%
lime::lime(
model = best_model_results,
bin_continous = TRUE,
n_bins = 4,
quantile_bins = TRUE
)
#summary(explainer_penguins)
# explain a data point, should look at one it failed on
explanation_autoML <- test_data_baked %>%
slice(3) %>%
select(-sex) %>%
lime::explain(
explainer = explainer_penguins_autoML,
n_labels = 1,
#labels = "female",
n_features = 4,
n_permutations = 5000,
kernel_width = 0.75
)
# plot the importance in explaining the decision
lime::plot_features(explanation_autoML)
Looking at the variable importance plot shows bill length and body mass contributed to the assignment as female while species contradicted the assignment. Looking at the heatmap shows body mass and species=Adelie contributed the most to assignment of sex in the first 20 test cases.
On to the GLM model:
## ok, now for the fun, lets look at what the importance of each feature is
## build the explainer
explainer_penguins_glm <- train_data_baked %>%
select(-sex) %>%
lime::lime(
model = pengiun_glm,
bin_continous = TRUE,
n_bins = 4,
quantile_bins = TRUE
)
#summary(explainer_penguins)
# explain a data point
explanation_glm <- test_data_baked %>% slice(3) %>% select(-sex) %>%
lime::explain(
explainer = explainer_penguins_glm,
n_labels = 1,
#labels = "female",
n_features = 4,
n_permutations = 5000,
kernel_width = 0.75
)
# plot the importance in explaining the decision
lime::plot_features(explanation_glm)
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rsettlage/rsettlage.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Settlage (2022, Nov. 4). Robert E. Settlage: Fun with H2O and LIME. Retrieved from https://rsettlage.github.io/deeplearning/2022-11-04-fun-with-h2o/
BibTeX citation
@misc{settlage2022fun, author = {Settlage, Robert}, title = {Robert E. Settlage: Fun with H2O and LIME}, url = {https://rsettlage.github.io/deeplearning/2022-11-04-fun-with-h2o/}, year = {2022} }