--- title: "Exploring variable importance with varPro" author: "John Ehrlinger" date: today format: html: fig-format: png fig-dpi: 96 toc: true toc-depth: 3 html-math-method: mathjax code-fold: false bibliography: ggRandomForests.bib vignette: > %\VignetteIndexEntry{Exploring variable importance with varPro} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, fig.width = 7, fig.height = 4.5, cache = TRUE, cache.path = "varpro_cache/" ) options(mc.cores = 1, rf.cores = 1) ``` ```{r libs} library(ggplot2) # Match the pattern used by the regression/survival vignettes: try the # installed package first, fall back to pkgload::load_all() for the # R CMD check vignette rebuild where the package isn't yet on # .libPaths(). All varPro calls below are ::-qualified, so no # library(varPro) is needed. if (requireNamespace("ggRandomForests", quietly = TRUE)) { library(ggRandomForests) } else if (requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE) } else { stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") } ``` ```{r precomputed, include=FALSE, cache=FALSE} # The three gg_partial_varpro() calls and the Boston beta.varpro() fit are # the only expensive parts of this vignette (~34s combined). They are # precomputed by precompute_varpro.R and loaded here so the R CMD check # vignette rebuild stays fast. If the file is absent (e.g. a fresh clone # that hasn't run the script), each chunk falls back to live computation, # so the vignette is reproducible either way. .vp <- tryCatch({ # R CMD check rebuilds the vignette in a copy whose working directory is the # vignette dir; an interactive knit from the package root sees vignettes/ # instead. Try both locations, and fall back to live computation below if # neither is present or readRDS() fails. .cand <- c("varpro_precomputed.rds", file.path("vignettes", "varpro_precomputed.rds")) .hit <- .cand[file.exists(.cand)] if (length(.hit)) readRDS(.hit[[1]]) else list() }, error = function(e) list()) # missing or unreadable -> live fallback below ``` # What varPro is Random-forest variable importance has, for two decades, mostly meant *permutation importance*: shuffle a column, measure the drop in OOB accuracy, repeat. It works, but the score is computed against artificially modified data that never appeared in the training set, and when predictors are correlated the shuffled column is an implausible counterfactual, not a neutral baseline. Knockoff-style methods clean up some of that, but they introduce their own synthetic features and rely on explicit distributional assumptions about the predictors. varPro [@Lu2024varpro] takes a different path: one grounded entirely in *observed* data. Think of each decision tree as a long chain of "if/then" clauses. varPro harvests those clauses as *rules*, and for each rule it identifies a specific region of the predictor space where a handful of variables jointly constrain the response. To measure one variable's contribution, it compares a *local estimator* (the response summary restricted to that rule's region) against a "released" estimator where the constraint on the tested variable has been lifted. That comparison, averaged over rules and trees, is the variable's importance z-score. Every observation used in the contrast is real training data; nothing is permuted, nothing is synthesised. The "release" metaphor is the right one. Imagine holding a rope at several points along its length (the rule constraints). Releasing your grip at one point (while keeping the others in place) tells you how taut that section was. Variables that go slack when released were carrying load; variables that stay taut were redundant given the rest. Operationally, varPro's *guided splitting* steers tree growth so that rules concentrate in informative regions, and the subsequent *rule harvesting* step selects the rules that survived the importance pre-filter. Variables below a noise threshold are dropped before reporting (the importance z is pre-filtered), so the final ranking contains only variables that earned a seat at the table (see the [varPro tools reference](https://www.varprotools.org/articles/getstarted.html) for worked implementation details). Split-weights encode each variable's relative importance at each tree node, propagating that signal through the release-rule contrasts and into the partial-dependence curves. Cross-validation via `cv.varpro()` returns three views of the importance ranking: `$imp` (the default, point-estimate ranking), `$imp.conserve` (one-SD rule: only variables whose lower confidence bound is positive), and `$imp.liberal` (any variable with a positive point estimate). The conservative cut is the right default when you want to hand a short list to a collaborator; the liberal cut is better for exploratory screening when you'd rather miss nothing. For classification, the importance decomposes into an `$unconditional` score (averaged over all classes) and per-class `$conditional.z` scores. When a variable separates *one* class from the others but is uninformative overall, the unconditional z can be near zero while a single conditional z is large; the conditional view catches that. One-hot consolidation via `get.orgvimp()` and `get.topvars()` maps the per-level scores back to the original factor, so downstream reporting stays on familiar ground. That core machinery feeds six ggRandomForests wrappers. **`gg_varpro()`** summarises the per-tree importance distribution. **`gg_beta_varpro()`** refines those release-rule contrasts with a per-rule lasso. **`gg_partial_varpro()`** turns the release machinery into partial-dependence curves. **`gg_udependent()`** reads cross-variable dependency off a `uvarpro()` fit. **`gg_isopro()`** scores observations for anomaly using an isolation-forest variant. **`gg_ivarpro()`** computes per-observation local importance. This vignette walks all six wrappers on three worked examples: a regression problem (Boston housing), a classification problem (iris, binary and multi-class), and a survival problem (PBC). The closing section is a one-page reference matrix mapping each wrapper to the forest families it supports. # Regression: Boston housing We start with the Boston housing data: 506 census tracts, 13 numeric predictors, median home value as the response. ```{r boston-fit} data("Boston", package = "MASS") set.seed(20260527L) v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) v_boston ``` `varpro()` builds the importance machinery (release rules, per-tree scores, the splitweight vector) once. Every subsequent figure in this section reads off that single fit. The formula interface is intentional: varPro is model-independent in the sense that it doesn't assume a linear or parametric relationship between predictors and response, but it does require you to name a response so the guided splitting has something to optimise toward. ## Per-tree importance with `gg_varpro()` The first question varPro answers is the same one permutation importance answers: which variables matter, ranked. `gg_varpro()` extracts the per-tree importance scores and draws their distribution as a horizontal boxplot, one row per variable, sorted by median z-score. Variables above a configurable cutoff (default 0.79) are highlighted. ```{r boston-gg-varpro} plot(gg_varpro(v_boston)) ``` The narrow boxes near the top are the variables varPro is confident about: every tree agrees they matter. Wide boxes that straddle the cutoff line are the ones to look at twice; the forest disagrees with itself. That disagreement isn't noise to suppress: it can mean the variable matters in some rule regions but not others, which is exactly the kind of structured heterogeneity that model-independent methods are built to detect. Variables that fall entirely below the cutoff were pre-filtered by the importance z threshold; they're gone before the plot is drawn. ## Partial dependence with `gg_partial_varpro()` The ranking view tells you *which* variables matter. Partial dependence asks the next question: *how* does the response change with a variable, holding the others fixed? `gg_partial_varpro()` wraps `varPro::partialpro()` and returns a tidy frame of parametric, non-parametric, and causal partial-dependence curves. ```{r boston-gg-partial-varpro} # Precomputed offline (see precompute_varpro.R); falls back to a live fit. gg_pd <- if (is.null(.vp$pd_boston)) { gg_partial_varpro(object = v_boston) } else { .vp$pd_boston } plot(gg_pd) ``` Each panel is a single predictor. The three curves correspond to the three estimators varPro carries (parametric, non-parametric, and causal); read them as a sensitivity analysis. When all three agree, you have a stable signal; when the causal curve diverges from the others, that's a hint that the variable's observed relationship with the response may be partly driven by its correlation with other predictors in the rule regions, not by a direct effect. Because `partialpro()` operates within the *local* neighborhoods defined by the release rules, these curves reflect what happens in the part of the predictor space the forest actually visited, not a global extrapolation from a fitted model. That is both a strength (no out-of-distribution extrapolation) and a limit (sparse regions of the predictor space will have wider effective intervals, even if the plot doesn't always show them explicitly). ## Per-rule lasso refinement with `gg_beta_varpro()` `gg_varpro()`'s ranking is built from the release-rule contrast. `gg_beta_varpro()` re-asks the question variable by variable inside each rule: it fits a one-predictor lasso of the response on the released variable, restricted to the OOB observations in that rule's region. The fitted coefficient (call it a "local β") captures the variable's linear effect *within that rule's neighborhood*, not globally. Aggregating `mean(|β|)` across rules gives one number per variable: a regression-coefficient-flavoured importance, not a VIMP score, and not a global slope. That distinction matters in practice. A variable with a strong nonlinear global relationship may have locally small β values inside any single rule (the local-standardisation step within each rule normalises the scale), but many rules will fire on it, so the aggregated mean is still large. Conversely, a variable with a nearly linear global effect will concentrate most of its weight in a handful of rules, and the between-rule variability in β will be low. Because `beta.varpro()` is expensive (a `glmnet` per rule), the wrapper accepts a pre-computed `beta_fit` so you can iterate on selection, cutoff, or class choice without re-fitting. ```{r boston-beta-fit, cache=TRUE} # Precomputed offline (see precompute_varpro.R); falls back to a live fit. b_boston <- if (is.null(.vp$b_boston)) { varPro::beta.varpro(v_boston) } else { .vp$b_boston } ``` ```{r boston-gg-beta-varpro} plot(gg_beta_varpro(v_boston, beta_fit = b_boston)) ``` Compare against `gg_varpro()` above. Disagreement is the diagnostic signal: a variable that ranks high here but low there is one whose local linear effect inside many rules is real even when the release contrast is modest. ## Cross-variable dependency with `gg_udependent()` The three views so far take one variable at a time. `gg_udependent()` reads cross-variable structure off a `uvarpro()` fit and draws the result as a network: nodes are variables, edges are dependencies above a configurable threshold. The visual is built with `ggraph`, which is in `Suggests` rather than `Imports`; install it if you want this view. ```{r boston-uvarpro, cache=TRUE} u_boston <- varPro::uvarpro(Boston[, setdiff(names(Boston), "medv")], ntree = 50) ``` ```{r boston-gg-udependent, eval = requireNamespace("ggraph", quietly = TRUE)} plot(gg_udependent(u_boston)) ``` Clusters of mutually-connected variables are worth checking for redundancy: they may be several views of the same underlying quantity. `uvarpro()` operates on the predictor matrix alone; there is no response in the fit, which is what makes the network genuinely unsupervised. Variables that cluster here are correlated in feature space regardless of whether any of them matters for prediction. That information is complementary to the ranking from `gg_varpro()`: a variable can be important for prediction and still sit in a tight cluster with a near-duplicate; in that situation, model parsimony may favour dropping one of the cluster members without losing much predictive accuracy. ## Anomaly scoring with `gg_isopro()` Variable importance is one axis; *observation* outlierness is another. `gg_isopro()` wraps `varPro::isopro()` (an isolation-forest variant that scores how anomalous each training row looks) and renders the result as a ranked elbow plus a density of the scores. The score is on `[0, 1]`; the wrapper's convention is "higher = more anomalous" (opposite of varPro's native polarity; the wrapper flips it for consistency). ```{r boston-isopro, cache=TRUE} iso_boston <- varPro::isopro(data = Boston[, setdiff(names(Boston), "medv")], method = "rnd", sampsize = 256, ntree = 50) ``` ```{r boston-gg-isopro} plot(gg_isopro(iso_boston)) ``` The elbow flags rows that diverge from the bulk. Pair with the domain: anomalous in feature space is not the same as wrong, but it's often where the most interesting cases live. In a regression context, a high-scoring row is one that the isolation forest isolated quickly because its feature combination is rare; it may or may not be an outlier in the response. Anomaly scoring and residual analysis answer different questions, and it's worth doing both. Note that `isopro()` scores are not calibrated to a universal scale: a score of 0.7 in one dataset is not comparable to 0.7 in another. What matters is the relative ordering within a dataset and the shape of the elbow: a sharp kink at a small number of observations is a cleaner signal than a gradual slope that never levels off. ## Local importance with `gg_ivarpro()` The wrappers so far aggregate across observations. `gg_ivarpro()` does the opposite: it returns one value per (observation, variable) pair, capturing how much variable *v* contributed to predicting observation *i*. The aggregate view is a distribution of those local importances per variable; the per-observation view is a horizontal bar of one row's local importances across variables. `ivarpro()` is the most expensive call in varPro, so the wrapper accepts a pre-computed `ivarpro_fit` for reuse across views. ```{r boston-ivarpro-fit, cache=TRUE} iv_boston <- varPro::ivarpro(v_boston) ``` ```{r boston-gg-ivarpro-distribution} plot(gg_ivarpro(v_boston, ivarpro_fit = iv_boston)) ``` ```{r boston-gg-ivarpro-which-obs} plot(gg_ivarpro(v_boston, ivarpro_fit = iv_boston, which_obs = 1L)) ``` The distribution view tells you which variables drive predictions *across* observations. The per-observation view answers the same question for a specific case, useful for explaining one prediction back to whoever asked. The per-observation importance is computed by replaying the release-rule contrasts restricted to the rules that fire for that observation. It is not a Shapley value, but it shares the local-attribution spirit: each bar shows how much that variable's release shifted the local estimator for this specific row, averaged over the rules that covered it. Cases that are well-represented in the training data (dense rule coverage) will have sharper attribution than cases in sparse regions of the predictor space. # Classification: iris Iris is a small data set (150 rows, four predictors, three response classes), and that's a feature here, not a flaw: every figure renders in under a second, and the structure is well-understood enough that any strange behaviour stands out. It is also a good stress-test for the conditional importance path: petal length and petal width separate *setosa* from everything else very cleanly, but the *versicolor*/*virginica* boundary is much softer. A method that only reports unconditional importance would lump both cases together; the conditional decomposition should show the asymmetry. Two fits: a binary problem (drop *setosa*, positive class = *virginica*) and the full three-class problem. ```{r iris-fit-binary} iris_binary <- iris[iris$Species != "setosa", ] iris_binary$Species <- droplevels(iris_binary$Species) set.seed(20260527L) v_iris_binary <- varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) ``` ```{r iris-fit-multi} set.seed(20260527L) v_iris_multi <- varPro::varpro(Species ~ ., data = iris, ntree = 50) ``` ## Class-conditional importance with `gg_varpro(conditional = TRUE)` For a classification forest, `gg_varpro()` can split the importance view into one facet per class. Variables keep the unconditional sort order so rows line up across facets; read along a row to see which class a variable is informative for. The conditional importance uses `$conditional.z` from the varPro fit: per-class release-rule contrasts computed within the same rule regions as the unconditional score, but with the local estimator replaced by a per-class probability. A variable that separates one class from the rest will have a high `$conditional.z` for that class and a near-zero or even negative value for the others. The unconditional score in `$unconditional` is the average, and can mask class-specific effects, which is why the `conditional = TRUE` faceted view exists. ```{r iris-multi-gg-varpro-conditional} plot(gg_varpro(v_iris_multi, conditional = TRUE)) ``` ## Partial dependence: `gg_partial_varpro()` on classification `gg_partial_varpro()` works the same way on a classification fit, but each predictor's curve becomes per-class probabilities. The plot panels are organised by predictor with class encoded as colour or facet (the wrapper defaults handle this). ```{r iris-multi-gg-partial-varpro} # Precomputed offline (see precompute_varpro.R); falls back to a live fit. gg_pd_iris <- if (is.null(.vp$pd_iris_multi)) { gg_partial_varpro(object = v_iris_multi) } else { .vp$pd_iris_multi } plot(gg_pd_iris) ``` Read each panel as: "as predictor X changes from low to high, how does the probability of each class shift?" Patterns that match the underlying biology (e.g. petal length separating setosa from the others) act as a sanity check on the forest. ## Per-class lasso refinement with `gg_beta_varpro()` On a classification fit, `gg_beta_varpro()` returns one row per (variable, class) pair. For a binary fit, `which_class = NULL` defaults to the last factor level (the positive class), so the headline view is a single panel of that class.[^mortality] [^mortality]: The same code pattern applies to clinical binary outcomes such as 30-day mortality: drop the negative class, set the event class as the last factor level, and read the figure for the event panel. ```{r iris-binary-beta-fit, cache=TRUE} b_iris_binary <- varPro::beta.varpro(v_iris_binary) ``` ```{r iris-binary-gg-beta-varpro} plot(gg_beta_varpro(v_iris_binary, beta_fit = b_iris_binary)) ``` For a multi-class fit, the default view is faceted by class with each class sharing the row order set by the unified ranking, same trick as `gg_varpro(conditional = TRUE)`. ```{r iris-multi-beta-fit, cache=TRUE} b_iris_multi <- varPro::beta.varpro(v_iris_multi) ``` ```{r iris-multi-gg-beta-varpro} plot(gg_beta_varpro(v_iris_multi, beta_fit = b_iris_multi)) ``` `which_class = ""` collapses the faceted view to a single class when you want it. # Survival: PBC Survival is the family where the varPro toolchain shows its limits, and being explicit about those limits is more useful than pretending they don't exist. The forest-fitting and the family-agnostic wrappers all work; the lasso-refined and individual-importance wrappers don't, because the underlying `varPro::beta.varpro()` and `varPro::ivarpro()` calls don't yet extend to right-censored outcomes. A per-rule local lasso for a censored response requires a local partial-likelihood or Nelson-Aalen estimator in place of the regression/classification local estimator; the design work for that is tracked but not yet landed. What does work is the core release-rule importance, the C-path partial dependence (routed through the embedded `$rf` random survival forest), and anomaly scoring on the predictor matrix. For many applied problems, those three views cover the questions you actually want to answer. The PBC (primary biliary cirrhosis) dataset from `randomForestSRC` has 418 patients, seven predictors, and a Surv-encoded outcome of days to event (death or transplant, status ∈ {0, 1, 2}). We use a small seven-variable subset so the vignette fits quickly. For a full analysis including time-dependent covariates, @Lee:2021 demonstrates the boosted nonparametric hazard framework that varPro's survival path draws on. ```{r pbc-data} library(survival) data(pbc, package = "randomForestSRC") pbc_small <- pbc[, c("days", "status", "age", "albumin", "bili", "edema", "platelet")] pbc_small <- na.omit(pbc_small) set.seed(20260527L) v_pbc <- varPro::varpro( Surv(days, status) ~ ., data = pbc_small, ntree = 50 ) ``` ## Variable importance: `gg_varpro()` ```{r pbc-gg-varpro} plot(gg_varpro(v_pbc)) ``` ## Partial dependence: `gg_partial_varpro()` (C-path) For survival, `gg_partial_varpro()` dispatches into the `randomForestSRC`-backed partial-dependence machinery via the underlying `$rf` survival forest (what the package calls the *C-path* because it routes through the `randomForestSRC` C-code layer rather than the varPro release-rule engine). The output shape is the same as the regression case (parametric / non-parametric / causal panels); the y-axis carries an "ensemble mortality" interpretation [@Ishwaran:2007a], not a survival probability. Higher values mean higher predicted mortality at a given predictor value, integrated over the event-time horizon in the training data. ```{r pbc-gg-partial-varpro} # Precomputed offline (see precompute_varpro.R); falls back to a live fit. gg_pd_pbc <- if (is.null(.vp$pd_pbc)) { gg_partial_varpro(object = v_pbc) } else { .vp$pd_pbc } plot(gg_pd_pbc) ``` ## Anomaly scoring: `gg_isopro()` on the X-matrix Because `isopro()` only sees the predictor matrix, it doesn't care about the family. The same call from section 3 works here. ```{r pbc-isopro, cache=TRUE} iso_pbc <- varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", "platelet")], method = "rnd", sampsize = 256, ntree = 50) plot(gg_isopro(iso_pbc)) ``` ## Not available for survival: `gg_beta_varpro`, `gg_ivarpro` `varPro::beta.varpro()` errors on survival fits in the current release (it only supports `regr` and `class`). `gg_ivarpro()` for survival is similarly deferred pending design work on the per-rule risk-scaling story. Both are tracked for v3.1.0. If you call either on a survival fit you'll get a clear error message pointing at the deferred work, not a silent miscalculation. The family-support matrix in §6 records this; the rest of the toolkit that *does* work on survival (`gg_varpro`, `gg_partial_varpro`, `gg_isopro` above; `gg_udependent` shown in §3 on the X-matrix). # Cross-cutting reference ## Family-support matrix | Wrapper | regr | class | surv | regr+ | |---|---|---|---|---| | `gg_partial_varpro` | ✓ | ✓ | ✓ (C-path via `$rf`) | ✗ (not audited) | | `gg_varpro` | ✓ | ✓ (`conditional = TRUE`) | ✓ | ✗ (errors) | | `gg_udependent` | ✓ (uvarpro on X) | ✓ (X) | ✓ (X) | ✓ (X) | | `gg_isopro` | ✓ (X) | ✓ (X) | ✓ (X) | ✓ (X) | | `gg_beta_varpro` | ✓ | ✓ | ✗ (upstream stop) | ✗ (deferred) | | `gg_ivarpro` | ✓ | ✓ | ✗ (deferred) | ✗ (deferred) | The four wrappers in the lower-right are the v3.1.0 work surface. ## Factor-level ordering Across `gg_beta_varpro()` and `gg_ivarpro()`, the `variable` column is stored as a factor whose levels are set by descending aggregate importance (`mean(|imp|)` summed across classes for classification). The default plot inherits that ordering, so faceted views show variables in the same row order across panels. If you re-shape the frame downstream and want the order preserved, keep `variable` as a factor rather than coercing to character. This convention will be propagated to `gg_vimp` and `plot.gg_varpro(conditional = TRUE)` in a follow-up release. ## Caching the expensive calls `varPro::beta.varpro()` and `varPro::ivarpro()` are the two heavy calls. Both wrappers accept a pre-computed fit (`beta_fit`, `ivarpro_fit`) so you can iterate on selection, observation index, or cutoff without re-fitting the lasso or the local-importance machinery. The vignette uses this throughout: every section computes the heavy fit once in a `cache: true` chunk and re-uses it for every figure. Provenance carries `precomputed = TRUE` when the cached path was used, so downstream tooling can tell the two paths apart. ## Provenance shape `attr(., "provenance")$cutoff` is always a *named numeric vector* across the toolkit: - regression: length 1, named `"regr"` - classification: length K, named with the response factor levels Downstream code that picks a value should read it as a vector (`prov$cutoff[[class_name]]` or `prov$cutoff[[1]]`), not as a scalar. This contract was established in v2.7.3.9012 (PR #98). # Further reading The release-rule framework is laid out in @Lu2024varpro. Worked implementation examples and the full API are at the [varPro tools reference site](https://www.varprotools.org/articles/getstarted.html). The ensemble mortality scale used by survival partial dependence is introduced in @Ishwaran:2007a (the Annals of Applied Statistics methods paper for random survival forests); the R-package side is described in @Ishwaran:2008. The boosted nonparametric hazard framework that informs varPro's survival path (including the treatment of time-dependent covariates) is in @Lee:2021. Each wrapper's help page carries a "What this is doing" section that goes one level deeper than this vignette. The cross-cutting reference in §6 maps each wrapper to the forest families it supports and notes which capabilities are deferred to v3.1.0. # References