Skip to content

⚠️ work-in-progress

A score class object

Predictor importance can be assessed using three different random forest models via the following objects:

score_imp_rf
score_imp_rf_conditional
score_imp_rf_oblique

These models are powered by the following packages:

score_imp_rf@engine
#> [1] "ranger"
score_imp_rf_conditional@engine
#> [1] "partykit"
score_imp_rf_oblique@engine
#> [1] "aorsf"

Regarding score types:

  • The {ranger} random forest computes the importance scores.

  • The {partykit} conditional random forest computes the conditional importance scores.

  • The {aorsf} oblique computes the permutation importance scores.

A random forest scoring example

The {modeldata} package contains a data set used to predict which cells in a high content screen were well segmented. It has 57 predictor columns and a factor variable class (the outcome).

Since case is only used to indicate Train/Test, not for data analysis, it will be set to NULL. Furthermore, for efficiency, we will use a small sample of 50 from the original 2019 observations.

cells_subset <- modeldata::cells |> 
  # Use a small example for efficiency
  dplyr::slice(1:50)
cells_subset$case <- NULL

# cells_subset |> str() # uncomment to see the structure of the data

First, we create a score class object to specify a {ranger} random forest, and then use the fit() method with the standard formula to compute the importance scores.

The data frame of results can be accessed via object@results.

# Specify random forest and fit score
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset, 
    seed = 42 
  )
cells_imp_rf_res@results
#> # A tibble: 56 × 4
#>    name       score outcome predictor                   
#>    <chr>      <dbl> <chr>   <chr>                       
#>  1 imp_rf -0.000441 class   angle_ch_1                  
#>  2 imp_rf  0.00114  class   area_ch_1                   
#>  3 imp_rf  0.00428  class   avg_inten_ch_1              
#>  4 imp_rf  0.00663  class   avg_inten_ch_2              
#>  5 imp_rf -0.000641 class   avg_inten_ch_3              
#>  6 imp_rf  0.00199  class   avg_inten_ch_4              
#>  7 imp_rf  0.00769  class   convex_hull_area_ratio_ch_1 
#>  8 imp_rf  0.000719 class   convex_hull_perim_ratio_ch_1
#>  9 imp_rf  0.000438 class   diff_inten_density_ch_1     
#> 10 imp_rf -0.000265 class   diff_inten_density_ch_3     
#> # ℹ 46 more rows

Like {parsnip}, the argument names are harmonized. For example, to set the number of trees: num.trees in {ranger}, ntree in {partykit}, and n_tree in {aorsf} are all standardized to a single name, trees, so users only need to remember a single name. The same applies to the number of variables to split at each node, mtry, and the minimum node size for splitting, min_n.

# Set hyperparameters
cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    trees = 100,
    mtry = 2,
    min_n = 1, 
    seed = 42 
  )
cells_imp_rf_res@results
#> # A tibble: 56 × 4
#>    name       score outcome predictor                   
#>    <chr>      <dbl> <chr>   <chr>                       
#>  1 imp_rf -0.00159  class   angle_ch_1                  
#>  2 imp_rf  0.00261  class   area_ch_1                   
#>  3 imp_rf  0.0100   class   avg_inten_ch_1              
#>  4 imp_rf -0.000803 class   avg_inten_ch_2              
#>  5 imp_rf  0.000801 class   avg_inten_ch_3              
#>  6 imp_rf  0.000823 class   avg_inten_ch_4              
#>  7 imp_rf  0.00410  class   convex_hull_area_ratio_ch_1 
#>  8 imp_rf  0.00577  class   convex_hull_perim_ratio_ch_1
#>  9 imp_rf  0.00255  class   diff_inten_density_ch_1     
#> 10 imp_rf -0.00263  class   diff_inten_density_ch_3     
#> # ℹ 46 more rows

However, there is one argument name specific to {ranger}. For reproducibility, instead of using the standard set.seed() method, users must use the seed argument.

cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    trees = 100,
    mtry = 2,
    min_n = 1, 
    seed = 42 # set seed for reproducibility
  )

If users use the argument names from {ranger}, that’s fine too. We have handled the necessary adjustments automatically. The following code chunk can be used to obtain the same fitted score:

cells_imp_rf_res <- score_imp_rf |>
  fit(
    class ~ .,
    data = cells_subset,     
    num.trees = 100,
    mtry = 2,
    min.node.size = 1, 
    seed = 42 
  )

A conditional random forest scoring example

For the {partykit} conditional random forest, we again create a score class object to specify the model, then use the fit() method to compute the importance scores.

The data frame of results can be accessed via object@results.

# Set seed for reproducibility
set.seed(42)

# Specify conditional random forest and fit score
cells_imp_rf_conditional_res <- score_imp_rf_conditional |>
  fit(class ~ ., data = cells_subset, trees = 100)
cells_imp_rf_conditional_res@results
#> # A tibble: 56 × 4
#>    name                 score outcome predictor                   
#>    <chr>                <dbl> <chr>   <chr>                       
#>  1 imp_rf_conditional -0.0306 class   angle_ch_1                  
#>  2 imp_rf_conditional  0.178  class   area_ch_1                   
#>  3 imp_rf_conditional  0.158  class   avg_inten_ch_1              
#>  4 imp_rf_conditional  0.132  class   avg_inten_ch_2              
#>  5 imp_rf_conditional  0      class   avg_inten_ch_3              
#>  6 imp_rf_conditional  0      class   avg_inten_ch_4              
#>  7 imp_rf_conditional  0.0927 class   convex_hull_area_ratio_ch_1 
#>  8 imp_rf_conditional  0.963  class   convex_hull_perim_ratio_ch_1
#>  9 imp_rf_conditional -0.0842 class   diff_inten_density_ch_1     
#> 10 imp_rf_conditional  0.0688 class   diff_inten_density_ch_3     
#> # ℹ 46 more rows

Note that when a predictor’s importance score is 0, partykit::cforest() may exclude its name from the output. In such cases, a score of 0 is assigned to the missing predictors, and we have handled this automatically.

An oblique random forest scoring example

For the {aorsf} oblique random forest, we again create a score class object to specify the model, then use the fit() method to compute the importance scores.

The data frame of results can be accessed via object@results.

# Set seed for reproducibility
set.seed(42)

# Specify oblique random forest and fit score
cells_imp_rf_oblique_res <- score_imp_rf_oblique |>
  fit(class ~ ., data = cells_subset, trees = 100, mtry = 2)
cells_imp_rf_oblique_res@results
#> # A tibble: 56 × 4
#>    name               score outcome predictor                   
#>    <chr>              <dbl> <chr>   <chr>                       
#>  1 imp_rf_oblique -0.00140  class   angle_ch_1                  
#>  2 imp_rf_oblique  0.00334  class   area_ch_1                   
#>  3 imp_rf_oblique  0.00344  class   avg_inten_ch_1              
#>  4 imp_rf_oblique  0.00516  class   avg_inten_ch_2              
#>  5 imp_rf_oblique -0.00135  class   avg_inten_ch_3              
#>  6 imp_rf_oblique -0.000552 class   avg_inten_ch_4              
#>  7 imp_rf_oblique  0.00159  class   convex_hull_area_ratio_ch_1 
#>  8 imp_rf_oblique  0.00406  class   convex_hull_perim_ratio_ch_1
#>  9 imp_rf_oblique  0.00526  class   diff_inten_density_ch_1     
#> 10 imp_rf_oblique  0.00284  class   diff_inten_density_ch_3     
#> # ℹ 46 more rows