Skip to content

⚠️ work-in-progress

This document demonstrates some basic uses of filtro. We’ll need to load a few packages:

A scoring example

The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable Sale_Price (the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation.

ames <- modeldata::ames
ames <- ames |>
  dplyr::mutate(Sale_Price = log10(Sale_Price))

# ames |> str() # uncomment to see the structure of the data

To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the fit() method with the standard formula to compute the scores.

ames_aov_pval_res <-
  score_aov_pval |>
  fit(Sale_Price ~ ., data = ames)

The data frame of results can be accessed via object@results.

ames_aov_pval_res@results
#> # A tibble: 73 × 4
#>    name      score outcome    predictor   
#>    <chr>     <dbl> <chr>      <chr>       
#>  1 aov_pval 237.   Sale_Price MS_SubClass 
#>  2 aov_pval 130.   Sale_Price MS_Zoning   
#>  3 aov_pval  NA    Sale_Price Lot_Frontage
#>  4 aov_pval  NA    Sale_Price Lot_Area    
#>  5 aov_pval   5.75 Sale_Price Street      
#>  6 aov_pval  19.2  Sale_Price Alley       
#>  7 aov_pval  71.3  Sale_Price Lot_Shape   
#>  8 aov_pval  21.4  Sale_Price Land_Contour
#>  9 aov_pval   1.38 Sale_Price Utilities   
#> 10 aov_pval  12.0  Sale_Price Lot_Config  
#> # ℹ 63 more rows

A couple of notes here:

Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:

  • The predictors are numeric and the outcome is categorical, or

  • The predictors are categorical and the outcome is numeric.

Because the outcome is numeric, any predictor that is not a factor will result in an NA. In case where NA is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value.

By default, this filter computes -log10(p_value), so that larger values indicate more important predictors. If users prefer raw p-values, a helper function dont_log_pvalues() is available.

For this specific filter, i.e., score_aov_*, case weights are supported. For other filters, you can check the property object@case_weights to see if they can use case weights.

Filtering and ranking

There are two main ways to rank and select a top proportion or number of features.

To filter or rank a single score, we can use built-in methods:

  • show_best_score_*()

  • rank_best_score_*()

For multi-parameter optimization, we can use API calls adapted from {desirability}:

  • show_best_desirability_*()

A filtering exmple for score singular

The show_best_score_prop() function returns the best score for a single metric. The prop_terms argument lets us control the proportion of predictors to keep.

# Show best score, based on proportion of predictors
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
#> # A tibble: 14 × 4
#>    name     score outcome    predictor     
#>    <chr>    <dbl> <chr>      <chr>         
#>  1 aov_pval Inf   Sale_Price Neighborhood  
#>  2 aov_pval 288.  Sale_Price Garage_Finish 
#>  3 aov_pval 243.  Sale_Price Garage_Type   
#>  4 aov_pval 242.  Sale_Price Foundation    
#>  5 aov_pval 237.  Sale_Price MS_SubClass   
#>  6 aov_pval 183.  Sale_Price Heating_QC    
#>  7 aov_pval 173.  Sale_Price BsmtFin_Type_1
#>  8 aov_pval 132.  Sale_Price Mas_Vnr_Type  
#>  9 aov_pval 130.  Sale_Price Overall_Cond  
#> 10 aov_pval 130.  Sale_Price MS_Zoning     
#> 11 aov_pval 127.  Sale_Price Exterior_1st  
#> 12 aov_pval 116.  Sale_Price Exterior_2nd  
#> 13 aov_pval 116.  Sale_Price Bsmt_Exposure 
#> 14 aov_pval 100.0 Sale_Price Garage_Cond

A filtering example for scores plural

To handle multiple scores, we first create multiple score class objects, and then use the fit() method with the standard formula to compute the scores.

# ANOVA raw p-value 
natrual_units <- score_aov_pval |> dont_log_pvalues()
ames_aov_pval_natrual_res <-
  natrual_units |>
  fit(Sale_Price ~ ., data = ames)

# Pearson correlation
ames_cor_pearson_res <-
  score_cor_pearson |>
  fit(Sale_Price ~ ., data = ames)

# Forest importance
ames_imp_rf_reg_res <-
  score_imp_rf |>
  fit(Sale_Price ~ ., data = ames, seed = 42)

# Information gain
ames_info_gain_reg_res <-
  score_info_gain |>
  fit(Sale_Price ~ ., data = ames)

Next, we create a list to collect these score class objects, including their associated metadata and scores.

# Create a list
class_score_list <- list(
  ames_aov_pval_natrual_res, 
  ames_cor_pearson_res,
  ames_imp_rf_reg_res,
  ames_info_gain_reg_res
)

Then, we fill the safe value specific to each method, and then remove the outcome column.

# Fill safe values
ames_scores_results <- class_score_list |>
  fill_safe_values() |>
  # Remove outcome
  dplyr::select(-outcome)
ames_scores_results
#> # A tibble: 73 × 5
#>    predictor     aov_pval cor_pearson     imp_rf infogain
#>    <chr>            <dbl>       <dbl>      <dbl>    <dbl>
#>  1 MS_SubClass  1.68e-237       1     0.000449    0.266  
#>  2 MS_Zoning    2.75e-130       1     0.000386    0.113  
#>  3 Lot_Frontage 1.11e- 16       0.165 0.000194    0.146  
#>  4 Lot_Area     1.11e- 16       0.255 0.000736    0.140  
#>  5 Street       1.77e-  6       1     0.00000263  0.00365
#>  6 Alley        6.06e- 20       1     0.00000782  0.0254 
#>  7 Lot_Shape    5.17e- 72       1     0.0000880   0.0675 
#>  8 Land_Contour 3.79e- 22       1     0.0000480   0.0212 
#>  9 Utilities    4.16e-  2       1     0           0.00165
#> 10 Lot_Config   1.04e- 12       1     0.0000138   0.0133 
#> # ℹ 63 more rows

Analogous to show_best_desirability(), the show_best_desirability_prop() function allows joint optimization of multiple metrics using desirability functions.

A desirability function maps values of a metric to a [0,1][0, 1] range where 11 is most desirable and 00 is unacceptable. When the verb maximize() is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain.

For examples:

# Optimize correlation alone
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1)
  ) |> 
  # Show predictor and desirability only
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_max_cor_pearson .d_overall
#>    <chr>                     <dbl>      <dbl>
#>  1 MS_SubClass                   1          1
#>  2 MS_Zoning                     1          1
#>  3 Street                        1          1
#>  4 Alley                         1          1
#>  5 Lot_Shape                     1          1
#>  6 Land_Contour                  1          1
#>  7 Utilities                     1          1
#>  8 Lot_Config                    1          1
#>  9 Land_Slope                    1          1
#> 10 Neighborhood                  1          1
#> # ℹ 63 more rows

# Optimize correlation and forest importance
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 4
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_overall
#>    <chr>                       <dbl>         <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696         1          0.834
#>  2 Year_Built                  0.615         0.877      0.735
#>  3 Total_Bsmt_SF               0.626         0.594      0.610
#>  4 Year_Remod_Add              0.586         0.549      0.567
#>  5 Garage_Type                 1             0.308      0.555
#>  6 First_Flr_SF                0.603         0.474      0.534
#>  7 Garage_Cars                 0.675         0.417      0.530
#>  8 Garage_Area                 0.651         0.432      0.530
#>  9 Full_Bath                   0.577         0.308      0.421
#> 10 Foundation                  1             0.151      0.388
#> # ℹ 63 more rows

# Optimize correlation, forest importance and information gain
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 5
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#>    <chr>                       <dbl>         <dbl>           <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696         1               0.832      0.833
#>  2 Year_Built                  0.615         0.877           0.709      0.726
#>  3 Total_Bsmt_SF               0.626         0.594           0.625      0.615
#>  4 Garage_Cars                 0.675         0.417           0.708      0.584
#>  5 Garage_Area                 0.651         0.432           0.684      0.577
#>  6 Year_Remod_Add              0.586         0.549           0.514      0.549
#>  7 First_Flr_SF                0.603         0.474           0.551      0.540
#>  8 Garage_Type                 1             0.308           0.453      0.519
#>  9 Neighborhood                1             0.127           1          0.503
#> 10 Full_Bath                   0.577         0.308           0.527      0.454
#> # ℹ 63 more rows

In show_best_desirability_prop(), there is a argument called prop_terms that lets us control the proportion of predictors to keep.

# Same as above, but retain only a proportion of predictors
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain),
    prop_terms = 0.2
  ) |>
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 14 × 5
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#>    <chr>                       <dbl>         <dbl>           <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696        1                0.832      0.833
#>  2 Year_Built                  0.615        0.877            0.709      0.726
#>  3 Total_Bsmt_SF               0.626        0.594            0.625      0.615
#>  4 Garage_Cars                 0.675        0.417            0.708      0.584
#>  5 Garage_Area                 0.651        0.432            0.684      0.577
#>  6 Year_Remod_Add              0.586        0.549            0.514      0.549
#>  7 First_Flr_SF                0.603        0.474            0.551      0.540
#>  8 Garage_Type                 1            0.308            0.453      0.519
#>  9 Neighborhood                1            0.127            1          0.503
#> 10 Full_Bath                   0.577        0.308            0.527      0.454
#> 11 Foundation                  1            0.151            0.454      0.409
#> 12 MS_SubClass                 1            0.109            0.576      0.398
#> 13 Garage_Finish               1            0.0837           0.501      0.347
#> 14 Fireplaces                  0.489        0.241            0.331      0.339

Besides maximize(), additional verbs that are available are: minimize(), target(), and constrain(). They are used in different situations:

  • maximize() when larger values are better.

  • minimize() when smaller values are better.

  • target() when a specific value of the metric is important.

  • constrain() when a range of values is equally desirable.

For examples:

ames_scores_results |>
  show_best_desirability_prop(
    minimize(aov_pval, low = 0, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_min_aov_pval .d_overall
#>    <chr>                  <dbl>      <dbl>
#>  1 MS_SubClass                1          1
#>  2 MS_Zoning                  1          1
#>  3 Alley                      1          1
#>  4 Lot_Shape                  1          1
#>  5 Land_Contour               1          1
#>  6 Neighborhood               1          1
#>  7 Condition_1                1          1
#>  8 Bldg_Type                  1          1
#>  9 House_Style                1          1
#> 10 Overall_Cond               1          1
#> # ℹ 63 more rows

ames_scores_results |>
  show_best_desirability_prop(
    target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor      .d_target_cor_pearson .d_overall
#>    <chr>                          <dbl>      <dbl>
#>  1 Lot_Area                       1.000      1.000
#>  2 Second_Flr_SF                  0.969      0.969
#>  3 Bsmt_Full_Bath                 0.969      0.969
#>  4 Latitude                       0.952      0.952
#>  5 Half_Bath                      0.921      0.921
#>  6 Open_Porch_SF                  0.899      0.899
#>  7 Wood_Deck_SF                   0.879      0.879
#>  8 Mas_Vnr_Area                   0.709      0.709
#>  9 Fireplaces                     0.637      0.637
#> 10 TotRms_AbvGrd                  0.632      0.632
#> # ℹ 63 more rows

ames_scores_results |>
  show_best_desirability_prop(
    constrain(cor_pearson, low = 0.2, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_box_cor_pearson .d_overall
#>    <chr>                     <dbl>      <dbl>
#>  1 MS_SubClass                   1          1
#>  2 MS_Zoning                     1          1
#>  3 Lot_Area                      1          1
#>  4 Street                        1          1
#>  5 Alley                         1          1
#>  6 Lot_Shape                     1          1
#>  7 Land_Contour                  1          1
#>  8 Utilities                     1          1
#>  9 Lot_Config                    1          1
#> 10 Land_Slope                    1          1
#> # ℹ 63 more rows

Available score objects and filter methods

The list of score class objects included:

#>  [1] "score_aov_fstat"          "score_aov_pval"
#>  [3] "score_cor_pearson"        "score_cor_spearman"
#>  [5] "score_gain_ratio"         "score_imp_rf"
#>  [7] "score_imp_rf_conditional" "score_imp_rf_oblique"
#>  [9] "score_info_gain"          "score_roc_auc"
#> [11] "score_sym_uncert"         "score_xtab_pval_chisq"
#> [13] "score_xtab_pval_fisher"

The list of filter methods for score singular:

#> [1] "show_best_score_cutoff" "show_best_score_dual"   "show_best_score_num"
#> [4] "show_best_score_prop"

The list of filter methods for scores plural:

#> [1] "show_best_desirability_num"  "show_best_desirability_prop"