⚠️ work-in-progress
This document demonstrates some basic uses of filtro.
A scoring example
The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable Sale_Price
(the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation.
To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the fit()
method with the standard formula to compute the scores.
ames_aov_pval_res <-
score_aov_pval |>
fit(Sale_Price ~ ., data = ames)
The data frame of results can be accessed via object@results
.
ames_aov_pval_res@results
#> # A tibble: 73 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 aov_pval 237. Sale_Price MS_SubClass
#> 2 aov_pval 130. Sale_Price MS_Zoning
#> 3 aov_pval NA Sale_Price Lot_Frontage
#> 4 aov_pval NA Sale_Price Lot_Area
#> 5 aov_pval 5.75 Sale_Price Street
#> 6 aov_pval 19.2 Sale_Price Alley
#> 7 aov_pval 71.3 Sale_Price Lot_Shape
#> 8 aov_pval 21.4 Sale_Price Land_Contour
#> 9 aov_pval 1.38 Sale_Price Utilities
#> 10 aov_pval 12.0 Sale_Price Lot_Config
#> # ℹ 63 more rows
A copule of notes here:
Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:
The predictors are numeric and the outcome is categorical, or
The predictors are categorical and the outcome is numeric.
Because the outcome is numeric, any predictor that is not a factor will result in an NA
. In case where NA
is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value
.
By default, the filter computes -log10(p_value)
, so that larger values indicate more important predictors. If users prefer raw p-values, a helper function dont_log_pvalues()
is available.
For this specific filter, i.e., score_aov_*
, case weights are supported.
Ranking and filtering
There are two ways to rank and select a top proportion or number of features
For a single score, built-in methods are available.
show_best_score_*()
rank_best_score_*()
For multiple scores, users can use API calls adapted from {desirability} for multi-parameter optimization.
show_best_desirability_*()
A filtering exmple for score singular
# Show best score, based on proportion of predictors
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
#> # A tibble: 14 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 aov_pval Inf Sale_Price Neighborhood
#> 2 aov_pval 288. Sale_Price Garage_Finish
#> 3 aov_pval 243. Sale_Price Garage_Type
#> 4 aov_pval 242. Sale_Price Foundation
#> 5 aov_pval 237. Sale_Price MS_SubClass
#> 6 aov_pval 183. Sale_Price Heating_QC
#> 7 aov_pval 173. Sale_Price BsmtFin_Type_1
#> 8 aov_pval 132. Sale_Price Mas_Vnr_Type
#> 9 aov_pval 130. Sale_Price Overall_Cond
#> 10 aov_pval 130. Sale_Price MS_Zoning
#> 11 aov_pval 127. Sale_Price Exterior_1st
#> 12 aov_pval 116. Sale_Price Exterior_2nd
#> 13 aov_pval 116. Sale_Price Bsmt_Exposure
#> 14 aov_pval 100.0 Sale_Price Garage_Cond
A filtering example for scores plural
To handle multiple scores, we first create multiple score class objects, and then use the fit()
method with the standard formula to compute the scores.
# ANOVA raw p-value
natrual_units <- score_aov_pval |> dont_log_pvalues()
ames_aov_pval_natrual_res <-
natrual_units |>
fit(Sale_Price ~ ., data = ames)
# Pearson correlation
ames_cor_pearson_res <-
score_cor_pearson |>
fit(Sale_Price ~ ., data = ames)
# Forest importance
ames_imp_rf_reg_res <-
score_imp_rf |>
fit(Sale_Price ~ ., data = ames, seed = 42)
# Information gain
ames_info_gain_reg_res <-
score_info_gain |>
fit(Sale_Price ~ ., data = ames)
Next, we create a list to store these score class objects, including their associated metadata and scores.
# Create a list
class_score_list <- list(
ames_aov_pval_natrual_res,
ames_cor_pearson_res,
ames_imp_rf_reg_res,
ames_info_gain_reg_res
)
Then, we fill the safe value specific to each method and remove the outcome
column.
# Fill safe values
ames_scores_results <- class_score_list |>
fill_safe_values() |>
# Remove outcome
dplyr::select(-outcome)
ames_scores_results
#> # A tibble: 73 × 5
#> predictor aov_pval cor_pearson imp_rf infogain
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 MS_SubClass 1.68e-237 1 0.000449 0.266
#> 2 MS_Zoning 2.75e-130 1 0.000386 0.113
#> 3 Lot_Frontage 1.11e- 16 0.165 0.000194 0.146
#> 4 Lot_Area 1.11e- 16 0.255 0.000736 0.140
#> 5 Street 1.77e- 6 1 0.00000263 0.00365
#> 6 Alley 6.06e- 20 1 0.00000782 0.0254
#> 7 Lot_Shape 5.17e- 72 1 0.0000880 0.0675
#> 8 Land_Contour 3.79e- 22 1 0.0000480 0.0212
#> 9 Utilities 4.16e- 2 1 0 0.00165
#> 10 Lot_Config 1.04e- 12 1 0.0000138 0.0133
#> # ℹ 63 more rows
Analogous to show_best_desirability()
, we can jointly optimize multiple scores.
A desirability function maps values of a metric to a range where is most desirable and is unacceptable. When the verb maximize()
is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain. For examples:
# Single and multi-parameter optimization using desirability functions
# Optimize correlation
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1)
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#> predictor .d_max_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Street 1 1
#> 4 Alley 1 1
#> 5 Lot_Shape 1 1
#> 6 Land_Contour 1 1
#> 7 Utilities 1 1
#> 8 Lot_Config 1 1
#> 9 Land_Slope 1 1
#> 10 Neighborhood 1 1
#> # ℹ 63 more rows
# Optimize correlation and forest importance
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf)
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 4
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_overall
#> <chr> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.834
#> 2 Year_Built 0.615 0.877 0.735
#> 3 Total_Bsmt_SF 0.626 0.594 0.610
#> 4 Year_Remod_Add 0.586 0.549 0.567
#> 5 Garage_Type 1 0.308 0.555
#> 6 First_Flr_SF 0.603 0.474 0.534
#> 7 Garage_Cars 0.675 0.417 0.530
#> 8 Garage_Area 0.651 0.432 0.530
#> 9 Full_Bath 0.577 0.308 0.421
#> 10 Foundation 1 0.151 0.388
#> # ℹ 63 more rows
# Optimize correlation, forest importance and information gain
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain)
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 5
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.832 0.833
#> 2 Year_Built 0.615 0.877 0.709 0.726
#> 3 Total_Bsmt_SF 0.626 0.594 0.625 0.615
#> 4 Garage_Cars 0.675 0.417 0.708 0.584
#> 5 Garage_Area 0.651 0.432 0.684 0.577
#> 6 Year_Remod_Add 0.586 0.549 0.514 0.549
#> 7 First_Flr_SF 0.603 0.474 0.551 0.540
#> 8 Garage_Type 1 0.308 0.453 0.519
#> 9 Neighborhood 1 0.127 1 0.503
#> 10 Full_Bath 0.577 0.308 0.527 0.454
#> # ℹ 63 more rows
Additionally, show_best_desirability_prop()
has a argument called prop_terms
that lets us control the proportion of predictors to keep.
# Same as above, but retain only a proportion of predictors
ames_scores_results |>
show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain),
prop_terms = 0.2
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 14 × 5
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.832 0.833
#> 2 Year_Built 0.615 0.877 0.709 0.726
#> 3 Total_Bsmt_SF 0.626 0.594 0.625 0.615
#> 4 Garage_Cars 0.675 0.417 0.708 0.584
#> 5 Garage_Area 0.651 0.432 0.684 0.577
#> 6 Year_Remod_Add 0.586 0.549 0.514 0.549
#> 7 First_Flr_SF 0.603 0.474 0.551 0.540
#> 8 Garage_Type 1 0.308 0.453 0.519
#> 9 Neighborhood 1 0.127 1 0.503
#> 10 Full_Bath 0.577 0.308 0.527 0.454
#> 11 Foundation 1 0.151 0.454 0.409
#> 12 MS_SubClass 1 0.109 0.576 0.398
#> 13 Garage_Finish 1 0.0837 0.501 0.347
#> 14 Fireplaces 0.489 0.241 0.331 0.339
Besides maximize()
, additional verbs that are available are: minimize()
, target()
, and constrain()
. They are used in different situations:
maximize()
when larger values are better.minimize()
when smaller values are better.target()
when a specific value of the metric is important.constrain()
when a range of values is equally desirable.
For examples:
ames_scores_results |>
show_best_desirability_prop(
minimize(aov_pval, low = 0, high = 1)
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#> predictor .d_min_aov_pval .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Alley 1 1
#> 4 Lot_Shape 1 1
#> 5 Land_Contour 1 1
#> 6 Neighborhood 1 1
#> 7 Condition_1 1 1
#> 8 Bldg_Type 1 1
#> 9 House_Style 1 1
#> 10 Overall_Cond 1 1
#> # ℹ 63 more rows
ames_scores_results |>
show_best_desirability_prop(
target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#> predictor .d_target_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 Lot_Area 1.000 1.000
#> 2 Second_Flr_SF 0.969 0.969
#> 3 Bsmt_Full_Bath 0.969 0.969
#> 4 Latitude 0.952 0.952
#> 5 Half_Bath 0.921 0.921
#> 6 Open_Porch_SF 0.899 0.899
#> 7 Wood_Deck_SF 0.879 0.879
#> 8 Mas_Vnr_Area 0.709 0.709
#> 9 Fireplaces 0.637 0.637
#> 10 TotRms_AbvGrd 0.632 0.632
#> # ℹ 63 more rows
ames_scores_results |>
show_best_desirability_prop(
constrain(cor_pearson, low = 0.2, high = 1)
) |>
dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#> predictor .d_box_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Lot_Area 1 1
#> 4 Street 1 1
#> 5 Alley 1 1
#> 6 Lot_Shape 1 1
#> 7 Land_Contour 1 1
#> 8 Utilities 1 1
#> 9 Lot_Config 1 1
#> 10 Land_Slope 1 1
#> # ℹ 63 more rows
List of score objects and filter methods provided by the package
A comprehensive list of score class objects included in the package:
score_aov_pval
score_aov_fstat
score_cor_pearson
score_cor_spearman
score_imp_rf
score_imp_rf_conditional
score_imp_rf_oblique
score_info_gain
score_gain_ratio
score_sym_uncert
score_roc_auc
score_xtab_pval_chisq
score_xtab_pval_fisher
The filter methods for score singular:
The filter methods for scores plural: