Robert E. Settlage: Fun with H2O and LIME

Robert Settlage

This is a quick post to show use of H2O as an automl tool. As stated in the docs (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html), H2O is an open source, distributed, fast and scalable machine learning platform written in Java. It supports many different supervised and unsupervised algorithms. I am interested in how optimum the generated models are. Here I will just demonstrate the use of the platform, compare to a quick GLM, and will end by showing LIME for explanation of model outcomes.

For the demo dataset, I am using the Palmer Penguins dataset. This dataset has 344 measurements of penguin physical characteristics along with labels for sex, species, location and year of the measurement. For this demonstration, I will see what H2O can accomplish in predicting sex from the other attributes. First, some EDA just to get a sense of what the dataset looks like:

EDA

# love the skimr package, quick look shows some missing values and distributions
skim(palmerpenguins::penguins)

Table 1: Data summary
Name	palmerpenguins::penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

# get dataset reformatted and filtered, dont want to deal with missing data
penguins_new <- palmerpenguins::penguins %>% 
  select(-year) %>%
  filter(complete.cases(.)) %>%
  mutate_if(is.factor, as_factor)


penguins_new %>%
  ggplot(aes(x=species, y=(bill_length_mm))) +
  geom_violin() +
  geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) + 
  facet_grid(~island) +
  labs(title="bill length by species, island, sex")

penguins_new %>%
  ggplot(aes(x=species, y=(bill_depth_mm))) +
  geom_violin() +
  geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) + 
  facet_grid(~island) +
  labs(title="bill depth by species, island, sex")

penguins_new %>%
  ggplot(aes(x=species, y=(flipper_length_mm))) +
  geom_violin() +
  geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) + 
  facet_grid(~island) +
  labs(title="flipper length by species, island, sex")

penguins_new %>%
  ggplot(aes(x=species, y=(body_mass_g))) +
  geom_violin() +
  geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) + 
  facet_grid(~island) +
  labs(title="body by species, island, sex")

pairs(x=penguins_new %>% select_if(is.numeric), 
      pch=20)

Feature Engineering

This is where an SME would add the secret sauce. I am definitely NOT a penguin biologist, but I am going to pretend I know bill volume is important and can be computed by assuming the bill is a pyramid with volume calculated as \[volume = \frac{1}{3} \ast length \ast depth^2\] I am also going to center and scale the numeric predictors. Bill_volume might actually do some good in separating sexes, so let’s keep it. This will make it’s way into the recipe in the next section.

penguins_new <- penguins_new %>%
  mutate(bill_volume = 1/3 * bill_length_mm * bill_depth_mm^2)

penguins_new %>%
  ggplot(aes(x=species, y=(bill_volume))) +
  geom_violin() +
  geom_jitter(aes(color=sex), alpha=0.4, width = 0.1) + 
  facet_grid(~island) +
  labs(title="bill volume by species, island, sex")

Modeling with H2O

Here we will do the normal 10-fold cross validation and have our holdout set for publication metrics. To get this, we will use rsample::initial_split to create the holdout set of 10% of the data. Additionally, I am creating a recipe for the data processing to assist in later predictions using new data.

# Put most of the data into the training set, want the other set more like a true
# test set as H2O is doing cross validation so the validation data set is pulled 
# from the training dataset

set.seed(34547)
penguins_split <- initial_split(
  palmerpenguins::penguins %>% filter(complete.cases(.)), 
  prop = 0.90)
# now do the split
train_data <- training(penguins_split)
test_data  <- testing(penguins_split)

# just want to center and scale, leave factors to h2o.
# feature engineering step should be in recipe so that new data is captured
# so adding the bill_volume calc here
pengiun_recipe <- recipe(sex~., data=train_data) %>%
  step_select(-year) %>%
  step_factor2string() %>%
  step_string2factor() %>%
  step_mutate(bill_volume = 1/3 * bill_length_mm * bill_depth_mm^2) %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())

# printing the recipe gives a nice summary 
pengiun_recipe

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          7

Operations:

Variables selected -year
Character variables from <none>
Factor variables from <none>
Variable mutation for 1 / 3 * bill_length_mm * bill_de...
Centering for all_numeric_predictors()
Scaling for all_numeric_predictors()

# don't forget to bake the data

train_data_baked <- pengiun_recipe %>% 
    prep() %>% 
    bake(train_data) 

test_data_baked <- pengiun_recipe %>% 
    prep() %>% 
    bake(test_data)

With our data prepped, we are now ready to proceed to the modeling stage. In this case, we are creating a model to predict sex and have a labeled dataset for training. In machine learning speak, this is a supervised learning classification problem. H2O supports many different algorithms for performing this task. We are going to run H2O’s AutoML to get an idea of how it does in an automated algorithm search and follow that up by running the h2o.glm method to see what we can do manually.

Automl

H2O’s automl method is an attempt at a completely hands off machine learning excersize. H2O will create models using tree based, deep learning and regression methods. It will run until a user specified time limit or the number of models built reaches the user specified limit. By default, it will use 5-fold cross-validation. I am going to specify a 20 min time limit, change the cross-validation strategy to 10-fold, and set the seed for reproducibility.

train_h2o <- as.h2o(train_data_baked)
test_h2o  <- as.h2o(test_data_baked)

y <- "sex"
x <- setdiff(names(train_h2o), y)

automl_models_h2o <- h2o.automl(
    x = x,
    y = y,
    training_frame   = train_h2o,
    max_runtime_secs = 1200, 
    nfolds           = 10,
    seed = 38981
)

From the automl training, we can look at the leaderboard and get a quick sense of how it did by looking at the confusion matrix for our holdout set. In the below, we see the StackedEnsemble and DeepLearning methods performed best using AUC as the metric. I want to focus on the DeepLearning model so will extract the top DeepLearning model.

automl_models_h2o@leaderboard

                                                 model_id       auc
1 StackedEnsemble_BestOfFamily_7_AutoML_1_20221110_205708 0.9723863
2 StackedEnsemble_BestOfFamily_6_AutoML_1_20221110_205708 0.9721625
3   DeepLearning_grid_1_AutoML_1_20221110_205708_model_11 0.9715807
4    DeepLearning_grid_1_AutoML_1_20221110_205708_model_4 0.9715360
5 StackedEnsemble_BestOfFamily_4_AutoML_1_20221110_205708 0.9709094
6   DeepLearning_grid_1_AutoML_1_20221110_205708_model_12 0.9708199
    logloss     aucpr mean_per_class_error      rmse        mse
1 0.2083160 0.9655582           0.07109291 0.2427961 0.05894995
2 0.2132914 0.9653595           0.06429019 0.2455755 0.06030734
3 0.2500111 0.9702776           0.07075725 0.2529788 0.06399825
4 0.2410589 0.9713846           0.08458647 0.2603849 0.06780029
5 0.2176247 0.9617852           0.07086914 0.2462841 0.06065588
6 0.2179888 0.9703479           0.08096133 0.2485852 0.06179461

[166 rows x 7 columns]

# grab first model hash
best_model <- automl_models_h2o@leaderboard %>% 
  as_tibble() %>%
  filter(str_detect(model_id,"DeepLearning")) %>%
  slice(1)
best_model

# A tibble: 1 × 7
  model_id                      auc logloss aucpr mean_…¹  rmse    mse
  <chr>                       <dbl>   <dbl> <dbl>   <dbl> <dbl>  <dbl>
1 DeepLearning_grid_1_AutoML… 0.972   0.250 0.970  0.0708 0.253 0.0640
# … with abbreviated variable name ¹mean_per_class_error

# save the results in its own object
best_model_results <- h2o.getModel(paste0(best_model[1]))

Manually specifying a model method

Suppose, we are interested specifically in a GLM, H2O allows one to create only the GLM models using the glm method. This allows us to fine tune the model. For instance, in the below I specify lambda = 0 to turn off regularization and again use 10-fold cross validation.

pengiun_glm <- h2o.glm(family = "binomial",
                       x = x,
                       y = y,
                       training_frame = train_h2o,
                       nfolds = 10,
                       lambda = 0.0,
                       seed = 58931)

Explore results

Exploring H2O model results is facilitated through a bunch of helper functions. Here I will show functions for looking at performance, prediction and explainability.

Model performance

Model performance is a highly technical discussion I am avoiding here. What I instead want to highlight is how to get to the performance metrics H2O gives access to. First, for classification problems like our example, the confusion matrix is a good first indicator of performance.

AutoML best model:

h2o.confusionMatrix(best_model_results)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.862739508209633:
       female male    Error    Rate
female    146    1 0.006803  =1/147
male        1  151 0.006579  =1/152
Totals    147  152 0.006689  =2/299

GLM model:

h2o.confusionMatrix(pengiun_glm)

Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.463943952801159:
       female male    Error     Rate
female    136   11 0.074830  =11/147
male        7  145 0.046053   =7/152
Totals    143  156 0.060201  =18/299

Additionally, given a trained model object, depending on the metric desired, one would use the metric method or retrieve the value from a derived performance object. For classification, generally you would operate on the performance object. In this example, incorrect predictions in either direction are equally undesirable such that the metrics we may be concerned with are F1, accuracy or AUC.

# make performance objects
best_model_performance <- h2o.performance(best_model_results)
glm_performance <- h2o.performance(pengiun_glm)

# F1, just retrieving max irrespective of threshold
best_model_F1 <- max(h2o.F1(best_model_performance)$f1)
glm_F1 <- max(h2o.F1(glm_performance)$f1)

# accuracy, just retrieving max irrespective of threshold
best_model_accuracy <- max(h2o.accuracy(best_model_performance)$accuracy)
glm_accuracy <- max(h2o.accuracy(glm_performance)$accuracy)

# AUC
best_model_auc <- h2o.auc(best_model_performance)
glm_auc <- h2o.auc(glm_performance)

results <- as.data.frame(cbind(F1 = c(best_model_F1, glm_F1), 
                               Accuracy = c(best_model_accuracy, glm_accuracy), 
                               AUC = c(best_model_auc, glm_auc)))
rownames(results) <- c("autoML", "GLM")
knitr::kable(round(results, 3), caption = "Classification metrics")

Table 2: Classification metrics
	F1	Accuracy	AUC
autoML	0.993	0.993	0.999
GLM	0.942	0.940	0.977

Model prediction

The goal of modeling is to create a model that can predict on new data. H2O offers a method to give predictions on new data. Let’s make the predictions and plot the confusion matrix on this. Could have done it using H2O’s confusionMatrix function, but in preperation for LIME, let’s grab the predictions first for autoML and then the GLM:.

# make some quick predictions to see how we did
automl_best_model_predictions <- 
  h2o.predict(best_model_results, newdata = test_h2o) %>% 
  as_tibble() %>% 
  bind_cols(
    test_data_baked %>% select(sex)
    ) 

glm_predictions <- 
  h2o.predict(pengiun_glm, newdata = test_h2o) %>% 
  as_tibble() %>% 
  bind_cols(
    test_data_baked %>% select(sex)
    ) 

both_predictions <-
  bind_cols(automl_best_model_predictions[,1:3],
            glm_predictions)

names(both_predictions) <- c(paste0("automl_",names(automl_best_model_predictions[,1:3])), 
         paste0("glm_",names(glm_predictions[1:3])),"sex")

both_predictions <- both_predictions %>%
  select(sex, contains("predict"), everything())

both_predictions %>% 
  yardstick::conf_mat(sex, automl_predict) %>% 
  autoplot(type="heatmap") + labs(title="autoML confusion")

both_predictions %>% 
  yardstick::conf_mat(sex, glm_predict) %>% 
  autoplot(type="heatmap") + labs(title="GLM confusion")

From this, it looks like the GLM performs better on new data. This suggests there may be some over fitting. We can look at the learning curve for the top model. When we do this, it does indeed appear as though the model was overfit as evident by the increasing logloss for the CV training datasets starting around epoc 1000.

h2o.learning_curve_plot(best_model_results)

Model explainability

In many cases, explaining a model is important, for instance, in mechanistic studies or in model improvement efforts. H2O offers a single function for a global look at model factor importance, explain. Limiting to the classification example, the output includes a confusion matrix, variable importance plots and partial dependence plots as show below. Each of these and more can be generated through individual calls. For the autoML leader:

h2o.explain(best_model_results, test_h2o)



Confusion Matrix
================

> Confusion matrix shows a predicted class vs an actual class.



DeepLearning_grid_1_AutoML_1_20221110_205708_model_11
-----------------------------------------------------

|  | female | male | Error | Rate
|:---:|:---:|:---:|:---:|:---:|
| **female** |15 | 3 | 0.166666666666667 |  =3/18 | 
| **male** |0 | 16 | 0 |  =0/16 | 
| **Totals** |15 | 19 | 0.0882352941176471 |  =3/34 | 


Variable Importance
===================

> The variable importance plot shows the relative importance of the most important variables in the model.



Partial Dependence Plots
========================

> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.

From this, it appears species and bill length are the top variables leading to assignment of sex. Note that sex=male is a positive outcome, so response is 1 in the continuous plots.

Explanations for the glm model:

h2o.explain(pengiun_glm, test_h2o)



Confusion Matrix
================

> Confusion matrix shows a predicted class vs an actual class.



GLM_model_R_1668131819232_38796
-------------------------------

|  | female | male | Error | Rate
|:---:|:---:|:---:|:---:|:---:|
| **female** |18 | 0 | 0 |  =0/18 | 
| **male** |1 | 15 | 0.0625 |  =1/16 | 
| **Totals** |19 | 15 | 0.0294117647058824 |  =1/34 | 


Variable Importance
===================

> The variable importance plot shows the relative importance of the most important variables in the model.



Partial Dependence Plots
========================

> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.

The partial dependence plots for the continuous features look a lot more certain as the data approaches the extremes, ie large bill_volume values suggest with high probability a male penguin.

Feature importance and LIME

The interesting features are often the ones where the predictions are incorrect. To look at feature importance on individual predictions, H2O does offer methods for tree based algorithms, but not currently on regression or deep learning models. For this, I am turning to LIME for explanation of individual observations. To use LIME, you first build an explainer using the model and training data, then create an explanation of the data of interest. The explaination uses permutations of the observations variable values to get an idea of the local influence of the variables.

First, let’s remember what our prediction set is along with a filter for incorrect predictions.

both_predictions

# A tibble: 34 × 7
   sex    automl_predict glm_predict automl…¹ automl…² glm_f…³ glm_m…⁴
   <fct>  <fct>          <fct>          <dbl>    <dbl>   <dbl>   <dbl>
 1 female female         female      1.00e+ 0 4.96e- 6 6.04e-1 0.396  
 2 male   male           male        9.78e-65 1   e+ 0 6.59e-3 0.993  
 3 male   female         female      1.00e+ 0 1.27e- 4 7.74e-1 0.226  
 4 male   male           male        8.56e- 8 1.00e+ 0 3.81e-2 0.962  
 5 female female         female      1   e+ 0 2.50e-28 9.94e-1 0.00637
 6 male   male           male        9.16e-11 1.00e+ 0 8.86e-4 0.999  
 7 female female         female      1.00e+ 0 4.41e- 6 8.69e-1 0.131  
 8 male   male           male        4.75e-11 1.00e+ 0 1.39e-1 0.861  
 9 female female         female      1   e+ 0 5.89e-24 9.57e-1 0.0434 
10 female female         female      1.00e+ 0 6.61e- 7 8.93e-1 0.107  
# … with 24 more rows, and abbreviated variable names ¹automl_female,
#   ²automl_male, ³glm_female, ⁴glm_male

both_predictions %>% 
  mutate_if(is.factor, as.character) %>%
  filter(!(sex == automl_predict) | !(sex == glm_predict))

# A tibble: 4 × 7
  sex    automl_predict glm_predict automl_f…¹ autom…² glm_f…³ glm_m…⁴
  <chr>  <chr>          <chr>            <dbl>   <dbl>   <dbl>   <dbl>
1 male   female         female        1.00     1.27e-4   0.774  0.226 
2 female male           female        0.000153 1.00e+0   0.557  0.443 
3 female male           female        0.0117   9.88e-1   0.623  0.377 
4 female male           female        0.00211  9.98e-1   0.952  0.0481
# … with abbreviated variable names ¹automl_female, ²automl_male,
#   ³glm_female, ⁴glm_male

Now on to the autoML DeepLearning leader, let’s look at data point 3 which is incorrectly predicted by both the DL and GLM models.

test_data %>% 
  slice(3) %>% 
  mutate(bill_volume = 1/3 * bill_length_mm * bill_depth_mm^2) %>%
  glimpse()

Rows: 1
Columns: 9
$ species           <fct> Adelie
$ island            <fct> Biscoe
$ bill_length_mm    <dbl> 37.7
$ bill_depth_mm     <dbl> 18.7
$ flipper_length_mm <int> 180
$ body_mass_g       <int> 3600
$ sex               <fct> male
$ year              <int> 2007
$ bill_volume       <dbl> 4394.438

Looking back at the plots, this is a smaller bird, suggesting perhaps why the algorithm would have difficulties.

## ok, now for the fun, lets look at what the importance of each feature is
## build the explainer
explainer_penguins_autoML <- train_data_baked %>% 
  select(-sex) %>% 
  lime::lime(
        model         = best_model_results,
        bin_continous = TRUE,
        n_bins        = 4,
        quantile_bins = TRUE
)
#summary(explainer_penguins)

# explain a data point, should look at one it failed on
explanation_autoML <- test_data_baked %>% 
  slice(3) %>% 
  select(-sex) %>%
  lime::explain(
    explainer = explainer_penguins_autoML,
    n_labels = 1,
    #labels = "female",
    n_features = 4,
    n_permutations = 5000,
    kernel_width = 0.75
)

# plot the importance in explaining the decision
lime::plot_features(explanation_autoML)

# can do it for many predictions
explanations_autoML <- test_data_baked %>% slice(1:20) %>% select(-sex) %>%
  lime::explain(
      explainer = explainer_penguins_autoML,
      n_labels = 1,
      n_features = 8,
      #n_permutations = 5000,
      kernel_width = 0.9
)
lime::plot_explanations(explanations_autoML)

Looking at the variable importance plot shows bill length and body mass contributed to the assignment as female while species contradicted the assignment. Looking at the heatmap shows body mass and species=Adelie contributed the most to assignment of sex in the first 20 test cases.

On to the GLM model:

## ok, now for the fun, lets look at what the importance of each feature is
## build the explainer
explainer_penguins_glm <- train_data_baked %>% 
  select(-sex) %>% 
  lime::lime(
        model         = pengiun_glm,
        bin_continous = TRUE,
        n_bins        = 4,
        quantile_bins = TRUE
)
#summary(explainer_penguins)

# explain a data point
explanation_glm <- test_data_baked %>% slice(3) %>% select(-sex) %>%
  lime::explain(
    explainer = explainer_penguins_glm,
    n_labels = 1,
    #labels = "female",
    n_features = 4,
    n_permutations = 5000,
    kernel_width = 0.75
)

# plot the importance in explaining the decision
lime::plot_features(explanation_glm)

# can do it for many predictions
explanations_glm <- test_data_baked %>% slice(1:20) %>% select(-sex) %>%
  lime::explain(
      explainer = explainer_penguins_glm,
      n_labels = 1,
      n_features = 8,
      #n_permutations = 5000,
      kernel_width = 0.9
)
lime::plot_explanations(explanations_glm)

Fun with H2O and LIME

EDA