⚠️ work-in-progress
A score class object
Predictor importance can be assessed using three different random forest models via the following objects:
score_imp_rf
score_imp_rf_conditional
score_imp_rf_oblique
These models are powered by the following packages:
score_imp_rf@engine
#> [1] "ranger"
score_imp_rf_conditional@engine
#> [1] "partykit"
score_imp_rf_oblique@engine
#> [1] "aorsf"
Regarding score types:
The {ranger} random forest computes the importance scores.
The {partykit} conditional random forest computes the conditional importance scores.
The {aorsf} oblique computes the permutation importance scores.
A random forest scoring example
The {modeldata} package contains a data set used to predict which cells in a high content screen were well segmented. It has 57 predictor columns and a factor variable class
(the outcome).
Since case
is only used to indicate Train/Test, not for data analysis, it will be set to NULL
. Furthermore, for efficiency, we will use a small sample of 50 from the original 2019 observations.
First, we create a score class object to specify a {ranger} random forest, and then use the fit()
method with the standard formula to compute the importance scores.
The data frame of results can be accessed via object@results
.
# Specify random forest and fit score
cells_imp_rf_res <- score_imp_rf |>
fit(
class ~ .,
data = cells_subset,
seed = 42
)
cells_imp_rf_res@results
#> # A tibble: 56 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf -0.000441 class angle_ch_1
#> 2 imp_rf 0.00114 class area_ch_1
#> 3 imp_rf 0.00428 class avg_inten_ch_1
#> 4 imp_rf 0.00663 class avg_inten_ch_2
#> 5 imp_rf -0.000641 class avg_inten_ch_3
#> 6 imp_rf 0.00199 class avg_inten_ch_4
#> 7 imp_rf 0.00769 class convex_hull_area_ratio_ch_1
#> 8 imp_rf 0.000719 class convex_hull_perim_ratio_ch_1
#> 9 imp_rf 0.000438 class diff_inten_density_ch_1
#> 10 imp_rf -0.000265 class diff_inten_density_ch_3
#> # ℹ 46 more rows
Like {parsnip}, the argument names are harmonized. For example, to set the number of trees: num.trees
in {ranger}, ntree
in {partykit}, and n_tree
in {aorsf} are all standardized to a single name, trees
, so users only need to remember a single name. The same applies to the number of variables to split at each node, mtry
, and the minimum node size for splitting, min_n
.
# Set hyperparameters
cells_imp_rf_res <- score_imp_rf |>
fit(
class ~ .,
data = cells_subset,
trees = 100,
mtry = 2,
min_n = 1,
seed = 42
)
cells_imp_rf_res@results
#> # A tibble: 56 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf -0.00159 class angle_ch_1
#> 2 imp_rf 0.00261 class area_ch_1
#> 3 imp_rf 0.0100 class avg_inten_ch_1
#> 4 imp_rf -0.000803 class avg_inten_ch_2
#> 5 imp_rf 0.000801 class avg_inten_ch_3
#> 6 imp_rf 0.000823 class avg_inten_ch_4
#> 7 imp_rf 0.00410 class convex_hull_area_ratio_ch_1
#> 8 imp_rf 0.00577 class convex_hull_perim_ratio_ch_1
#> 9 imp_rf 0.00255 class diff_inten_density_ch_1
#> 10 imp_rf -0.00263 class diff_inten_density_ch_3
#> # ℹ 46 more rows
However, there is one argument name specific to {ranger}. For reproducibility, instead of using the standard set.seed()
method, users must use the seed
argument.
cells_imp_rf_res <- score_imp_rf |>
fit(
class ~ .,
data = cells_subset,
trees = 100,
mtry = 2,
min_n = 1,
seed = 42 # set seed for reproducibility
)
If users use the argument names from {ranger}, that’s fine too. We have handled the necessary adjustments automatically. The following code chunk can be used to obtain the same fitted score:
cells_imp_rf_res <- score_imp_rf |>
fit(
class ~ .,
data = cells_subset,
num.trees = 100,
mtry = 2,
min.node.size = 1,
seed = 42
)
A conditional random forest scoring example
For the {partykit} conditional random forest, we again create a score class object to specify the model, then use the fit()
method to compute the importance scores.
The data frame of results can be accessed via object@results
.
# Set seed for reproducibility
set.seed(42)
# Specify conditional random forest and fit score
cells_imp_rf_conditional_res <- score_imp_rf_conditional |>
fit(class ~ ., data = cells_subset, trees = 100)
cells_imp_rf_conditional_res@results
#> # A tibble: 56 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf_conditional -0.0306 class angle_ch_1
#> 2 imp_rf_conditional 0.178 class area_ch_1
#> 3 imp_rf_conditional 0.158 class avg_inten_ch_1
#> 4 imp_rf_conditional 0.132 class avg_inten_ch_2
#> 5 imp_rf_conditional 0 class avg_inten_ch_3
#> 6 imp_rf_conditional 0 class avg_inten_ch_4
#> 7 imp_rf_conditional 0.0927 class convex_hull_area_ratio_ch_1
#> 8 imp_rf_conditional 0.963 class convex_hull_perim_ratio_ch_1
#> 9 imp_rf_conditional -0.0842 class diff_inten_density_ch_1
#> 10 imp_rf_conditional 0.0688 class diff_inten_density_ch_3
#> # ℹ 46 more rows
Note that when a predictor’s importance score is 0, partykit::cforest()
may exclude its name from the output. In such cases, a score of 0 is assigned to the missing predictors, and we have handled this automatically.
An oblique random forest scoring example
For the {aorsf} oblique random forest, we again create a score class object to specify the model, then use the fit()
method to compute the importance scores.
The data frame of results can be accessed via object@results
.
# Set seed for reproducibility
set.seed(42)
# Specify oblique random forest and fit score
cells_imp_rf_oblique_res <- score_imp_rf_oblique |>
fit(class ~ ., data = cells_subset, trees = 100, mtry = 2)
cells_imp_rf_oblique_res@results
#> # A tibble: 56 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf_oblique -0.00140 class angle_ch_1
#> 2 imp_rf_oblique 0.00334 class area_ch_1
#> 3 imp_rf_oblique 0.00344 class avg_inten_ch_1
#> 4 imp_rf_oblique 0.00516 class avg_inten_ch_2
#> 5 imp_rf_oblique -0.00135 class avg_inten_ch_3
#> 6 imp_rf_oblique -0.000552 class avg_inten_ch_4
#> 7 imp_rf_oblique 0.00159 class convex_hull_area_ratio_ch_1
#> 8 imp_rf_oblique 0.00406 class convex_hull_perim_ratio_ch_1
#> 9 imp_rf_oblique 0.00526 class diff_inten_density_ch_1
#> 10 imp_rf_oblique 0.00284 class diff_inten_density_ch_3
#> # ℹ 46 more rows