--- title: "Random Forest Regression with ggRandomForests" author: "John Ehrlinger" date: today format: html: fig-format: png fig-dpi: 96 toc: true toc-depth: 3 html-math-method: mathjax code-fold: false bibliography: ggRandomForests.bib vignette: > %\VignetteIndexEntry{Random Forest Regression with ggRandomForests} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ::: {.callout-warning} ## Work in progress This vignette is under active development. Code examples and narrative may change before the next release. ::: ```{r setup, include=FALSE} knitr::opts_chunk$set( fig.align = "center", fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE, comment = "#>" ) options(mc.cores = 1, rf.cores = 1) ``` # Introduction Random forests [@Breiman:2001] are a non-parametric ensemble method that requires no distributional or functional assumptions on how covariates relate to the response. The method builds a large collection of de-correlated decision trees via bootstrap aggregation (bagging) and random feature selection, then averages their predictions to produce a stable, low-variance estimator. The **randomForestSRC** package [@Ishwaran:RFSRC:2014] provides a unified implementation for survival, regression, and classification forests. **ggRandomForests** extracts tidy data objects from `rfsrc` fits and renders them with **ggplot2** [@Wickham:2009], making it straightforward to explore how a forest is constructed, which variables matter, and how the response depends on individual predictors. This vignette demonstrates a complete random forest regression workflow on the Boston Housing data set [@Harrison:1978; @Belsley:1980]: 1. **Data exploration** --- EDA scatter panels, variable descriptions 2. **Growing the forest** --- fitting an RF, checking OOB error convergence 3. **Variable selection** --- VIMP and minimal depth via `max.subtree()` 4. **Dependence plots** --- variable dependence and partial dependence via `gg_variable()` and `gg_partial_rfsrc()` 5. **Variable interactions** --- conditioning plots and partial dependence surfaces ```{r packages} library(ggplot2) library(dplyr) library(tidyr) library(randomForestSRC) if (requireNamespace("ggRandomForests", quietly = TRUE)) { library(ggRandomForests) } else if (requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE) } else { stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") } theme_set(theme_bw()) ``` # Data: Boston Housing Values The Boston Housing data [@Harrison:1978; @Belsley:1980] is a standard benchmark for regression. It contains data for 506 census tracts of Boston from the 1970 census. The objective is to predict the median value of owner-occupied homes (`medv`, in \$1000s) from 13 predictors covering crime, zoning, industry, environmental quality, and housing characteristics. We use the copy from the **MASS** package [@mass:2002]. ```{r data-load} data(Boston, package = "MASS") Boston$chas <- as.logical(Boston$chas) # nolint: object_name_linter ``` ```{r variable-labels} st_labs <- c( crim = "Crime rate by town", zn = "Residential land zoned > 25k sq ft (%)", indus = "Non-retail business acres (%)", chas = "Borders Charles River", nox = "Nitrogen oxides (10 ppm)", rm = "Rooms per dwelling", age = "Units built before 1940 (%)", dis = "Distance to employment centers", rad = "Highway accessibility index", tax = "Property tax rate per $10,000", ptratio = "Pupil-teacher ratio", black = "Proportion of Black residents", lstat = "Lower status population (%)", medv = "Median home value ($1000s)" ) ``` ## Exploratory data analysis We plot each predictor against the response (`medv`), coloring by the sole categorical variable (`chas`, whether the tract borders the Charles River). A loess smooth highlights the marginal trend. ```{r eda, fig.cap="EDA: each predictor vs. median home value, colored by Charles River indicator."} dta <- Boston |> pivot_longer(c(-medv, -chas), names_to = "variable", values_to = "value") ggplot(dta, aes(x = medv, y = value, color = chas)) + geom_point(alpha = 0.4) + geom_smooth(aes(x = medv, y = value), color = "grey30", inherit.aes = FALSE, se = FALSE) + labs(y = "", x = st_labs["medv"]) + scale_color_brewer(palette = "Set2") + facet_wrap(~variable, scales = "free_y", ncol = 3) ``` Even from this simple view, two relationships stand out: `medv` against `lstat` (lower status %) and `medv` against `rm` (rooms per dwelling). Keep those two in mind --- we expect the random forest to rank them as the most important predictors, and the rest of the vignette comes back to check. # Growing a Random Forest We grow a regression forest using all 13 predictors. The `rfsrc()` function detects the regression family from the continuous response. ```{r rfsrc-fit} rfsrc_Boston <- rfsrc(medv ~ ., data = Boston, # nolint: object_name_linter ntree = 200, importance = TRUE, err.block = 5) rfsrc_Boston ``` The forest grew `r rfsrc_Boston$ntree` trees, splitting on `r rfsrc_Boston$mtry` randomly selected candidate variables at each node, and stopping at a minimum terminal node size of `r rfsrc_Boston$nodesize`. ## OOB error convergence ```{r error-plot, fig.height=4, fig.cap="OOB mean squared error vs. number of trees."} gg_e <- gg_error(rfsrc_Boston) gg_e <- gg_e |> filter(!is.na(error)) class(gg_e) <- c("gg_error", class(gg_e)) plot(gg_e) ``` The error stabilizes well before 500 trees, indicating the forest is large enough for reliable predictions. ## OOB predictions ```{r rfsrc-predicted, fig.height=4, fig.cap="OOB predicted median home values. Points are jittered; boxplot shows the distribution."} plot(gg_rfsrc(rfsrc_Boston), alpha = 0.5) + coord_cartesian(ylim = c(5, 49)) ``` Each point is a single tract's OOB prediction. The distribution is a sanity check --- we are more interested in the *why* behind these predictions. # Variable Selection ## Variable importance (VIMP) VIMP, computed by [randomForestSRC](https://www.randomforestsrc.org), measures the increase in OOB prediction error when a variable's values are randomly permuted across the out-of-bag observations [@Breiman:2001]. Permutation severs the variable's link to the response on purpose: if breaking that link hurts accuracy, the variable is carrying real signal. Large positive values mean the variable is essential; negative values suggest it is no more informative than noise. ```{r vimp-plot, fig.cap="VIMP ranking. Longer blue bars indicate more important variables."} plot(gg_vimp(rfsrc_Boston), lbls = st_labs) ``` `lstat` and `rm` dominate, with a clear gap to the remaining predictors. All VIMP values are positive, indicating every predictor contributes at least marginally. The permutation approach contrasts with varPro release-rule importance [@Lu2024varpro], available through `gg_varpro()`. Rather than perturbing data synthetically, varPro compares local estimators on the observed data directly: no permutation, no manufactured feature values. Because the two methods measure fundamentally different things, a variable can rank high under one and low under the other. When they agree, the evidence is strong; when they disagree, that disagreement itself is worth investigating, pointing either to a variable whose effect is highly non-linear or to one that matters only in combination with others. ## Minimal depth Minimal depth [@Ishwaran:2010] ranks variables by how close to the root node they first split, on average. Variables that partition large portions of the population early are considered most important. ```{r varsel} md_Boston <- max.subtree(rfsrc_Boston) # nolint: object_name_linter ``` The threshold is `r round(md_Boston$threshold, 2)`, selecting `r length(md_Boston$topvars)` variables: `r paste(md_Boston$topvars, collapse = ", ")`. Both VIMP and minimal depth agree on the dominance of `lstat` and `rm`. We use the minimal depth top variables for the remainder of the analysis. ```{r select-vars} xvar <- md_Boston$topvars ``` # Variable Dependence ## Variable dependence plots Variable dependence shows each tract's OOB predicted `medv` plotted against a predictor, with a loess smooth indicating the trend. ```{r vardep-panel, fig.cap="Variable dependence for top predictors (minimal depth rank order)."} gg_v <- gg_variable(rfsrc_Boston) plot(gg_v, xvar = xvar, panel = TRUE, alpha = 0.5) + labs(y = st_labs["medv"], x = "") ``` The panels confirm what EDA suggested: `medv` decreases sharply with `lstat` and increases with `rm`, both in strongly non-linear ways. The remaining variables show weaker but still discernible trends. ```{r vardep-chas, fig.height=4, fig.cap="Variable dependence for Charles River (categorical)."} plot(gg_v, xvar = "chas", alpha = 0.4) + labs(y = st_labs["medv"]) ``` Most tracts do not border the Charles River, and the predicted value distributions largely overlap --- consistent with `chas` ranking last in both VIMP and minimal depth. ## Partial dependence Partial dependence integrates out the effects of all other covariates, giving a risk-adjusted view of each predictor's marginal effect [@Friedman:2000]: $$ \tilde{f}(x) = \frac{1}{n} \sum_{i=1}^n \hat{f}(x, x_{i,o}) $$ We use `gg_partial_rfsrc()`, which calls `randomForestSRC::partial.rfsrc()` directly and returns a `gg_partial_rfsrc` object. The quickest path to a figure is `plot(pd)`, which sorts out continuous and categorical variables for you. When you want to control the layout yourself, the underlying data is in `pd$continuous`. ```{r partial-dep, fig.cap="Partial dependence for top predictors."} pd <- gg_partial_rfsrc(rfsrc_Boston, xvar.names = xvar) # Quick S3 plot — works out of the box for the standard regression case plot(pd) ``` For a publication-ready layout with custom axis labels, access the underlying data frame directly: ```{r partial-dep-custom, fig.cap="Partial dependence (custom styling)."} ggplot(pd$continuous, aes(x = x, y = yhat)) + geom_line(color = "steelblue", linewidth = 1) + facet_wrap(~name, scales = "free_x") + labs(y = st_labs["medv"], x = "") + theme_bw() ``` `lstat` shows a strongly concave relationship, while `rm` stays flat below about 6 rooms and then climbs sharply. Shapes like these are awkward to capture with a simple parametric transform, since you would have to guess the form in advance, but the random forest picks them up on its own. # Variable Interactions and Conditioning Plots ## Conditioning on a categorical variable The simplest coplot conditions on a categorical variable. Here we examine `medv` vs. `lstat`, split by Charles River status: ```{r coplot-chas, fig.height=4, fig.cap="Variable dependence on lstat, conditional on Charles River."} gg_v$chas_label <- ifelse(gg_v$chas, "Borders Charles River", "Does not border") plot(gg_v, xvar = "lstat", alpha = 0.5) + labs(y = st_labs["medv"], x = st_labs["lstat"]) + theme(legend.position = "none") + facet_wrap(~chas_label) ``` The decreasing trend holds in both groups, with slightly higher values along the Charles River at every `lstat` level. ## Conditioning on a continuous variable To investigate the `lstat`--`rm` interaction, we bin `rm` into six quantile groups using `quantile_pts()` and facet: ```{r coplot-rm, fig.cap="Predicted medv vs. lstat, conditional on rooms-per-dwelling groups."} rm_pts <- quantile_pts(rfsrc_Boston$xvar$rm, groups = 6, intervals = TRUE) gg_v$rm_grp <- cut(rfsrc_Boston$xvar$rm, breaks = rm_pts) levels(gg_v$rm_grp) <- paste("rm in", levels(gg_v$rm_grp)) plot(gg_v, xvar = "lstat", alpha = 0.5) + labs(y = st_labs["medv"], x = st_labs["lstat"]) + theme(legend.position = "none") + scale_color_brewer(palette = "Set3") + facet_wrap(~rm_grp) ``` Median values decrease with `lstat` within every `rm` group, but the intercept shifts upward with more rooms. Smaller homes in low-`lstat` (high-status) neighborhoods still command high prices. The complement view --- `medv` vs. `rm`, conditional on `lstat` groups --- completes the picture: ```{r coplot-lstat, fig.cap="Predicted medv vs. rooms, conditional on lower-status groups."} lstat_pts <- quantile_pts(rfsrc_Boston$xvar$lstat, groups = 6, intervals = TRUE) gg_v$lstat_grp <- cut(rfsrc_Boston$xvar$lstat, breaks = lstat_pts) levels(gg_v$lstat_grp) <- paste("lstat in", levels(gg_v$lstat_grp)) plot(gg_v, xvar = "rm", alpha = 0.5) + labs(y = st_labs["medv"], x = st_labs["rm"]) + theme(legend.position = "none") + scale_color_brewer(palette = "Set3") + facet_wrap(~lstat_grp) ``` The `rm` effect is strongest in low-`lstat` tracts (bottom-left panels) and nearly flat in high-`lstat` tracts, confirming a meaningful interaction. # Partial Dependence Surface To visualize the joint partial dependence of `medv` on `lstat` and `rm`, we compute partial dependence on a grid: 10 values of `rm`, each evaluated at 25 points along `lstat`. ```{r surface-data, cache=TRUE} rm_grid <- quantile_pts(rfsrc_Boston$xvar$rm, groups = 10) surface_list <- lapply(rm_grid, function(rm_val) { newx <- rfsrc_Boston$xvar newx$rm <- rm_val pd_rm <- gg_partial_rfsrc(rfsrc_Boston, xvar.names = "lstat", newx = newx) df <- pd_rm$continuous df$rm <- rm_val df }) surface_df <- bind_rows(surface_list) ``` ```{r pd-surface, fig.cap="Partial dependence surface: median home value as a function of lstat and rm. Fill colour is the predicted median value."} ggplot(surface_df, aes(x = x, y = rm, fill = yhat)) + geom_tile() + scale_fill_viridis_c(name = "Median Value\n($1000s)") + labs(x = "Lower Status (%)", y = "Rooms per Dwelling") + theme_bw() ``` The surface confirms the strong interaction: home values are highest when `lstat` is low and `rm` is high (upper-left corner), dropping steeply along both axes. The non-planar curvature --- particularly the sharp step near `rm` = 7 --- demonstrates the kind of complex, non-linear structure that random forests capture naturally. # Conclusion We have walked a full random forest regression analysis with **randomForestSRC** and **ggRandomForests**, and the pieces line up: - `gg_error()` showed the OOB error settling well before 500 trees. - VIMP (`gg_vimp()`) and minimal depth (`max.subtree()`) agreed on the same story: `lstat` and `rm` dominate, with a clear gap to everything else. - `gg_variable()` traced strongly non-linear predictor--response curves, the same shapes the raw-data EDA hinted at. - Partial dependence from `gg_partial_rfsrc()` gave the risk-adjusted version of those curves: concave for `lstat`, threshold-like for `rm`. - Conditioning plots and the partial dependence surface pulled out the `lstat`--`rm` interaction, with the room-size effect strongest in high-status tracts. Notice the pattern in all of this. Each `gg_*()` function returns a tidy object (often a data frame, sometimes a small list of data frames); the plotting is a separate step. Use the package's `plot()` methods when the default figure is what you want, and reach for `ggplot2` directly when it is not. # References