--- title: "Exploring Random Forests with ggRandomForests" author: "John Ehrlinger" date: today format: html: fig-format: png fig-dpi: 96 toc: true html-math-method: mathjax editor: markdown: wrap: 80 vignette: > %\VignetteIndexEntry{Vignette's Title} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- A fitted random forest carries a lot of information, but getting at it usually means digging through list structures that were never meant to be plotted directly. **ggRandomForests** does that digging for you: it pulls tidy data objects out of a `randomForestSRC` or `randomForest` fit, and those objects drop straight into the `ggplot2` workflows you already know. A second engine, `varPro`, powers a parallel family of functions for release-rule importance and related diagnostics; that family is covered in the companion vignette referenced at the end. This vignette walks through the three objects you will reach for most often (`gg_error`, `gg_variable`, and `gg_vimp`), plus a small helper for cutting a predictor into evenly populated groups. ```{r pkg-setup, include=FALSE} if (requireNamespace("ggRandomForests", quietly = TRUE)) { library(ggRandomForests) } else if (requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE) } else { stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") } ``` ## Error trajectories with `gg_error()` ```{r error-demo} library(randomForest) set.seed(42) rf_iris <- randomForest(Species ~ ., data = iris, ntree = 200, keep.forest = TRUE) err_df <- ggRandomForests::gg_error(rf_iris, training = TRUE) head(err_df) ``` A forest's error rate settles down as trees are added, and the `gg_error()` object lets you watch that happen. It holds the cumulative out-of-bag (OOB) error rate for each outcome column, indexed by the `ntree` counter. Ask for `training = TRUE` and the function reconstructs the original model frame and adds the in-bag error trajectory (`train`) as well, so you can see both curves at once: ```{r error-plot, fig.height=4} plot(err_df) ``` ## Marginal dependence via `gg_variable()` ```{r variable-demo} set.seed(99) boston <- MASS::Boston rf_boston <- randomForest(medv ~ ., data = boston, ntree = 150) var_df <- ggRandomForests::gg_variable(rf_boston) str(var_df[, c("lstat", "yhat")]) ``` `gg_variable()` recovers the training data straight from the model call, so it still works when the forest was fit inside a helper function or against a `subset()` expression, cases where the data is not sitting in the global environment. The object you get back keeps the raw predictors alongside the prediction: a single `yhat` column for regression, or one `yhat.` column per class for classification. To plot one predictor, name it with `xvar`: ```{r variable-plot, fig.height=4} plot(var_df, xvar = "lstat") ``` Survival forests can request multiple horizons using the `time` argument; non-OOB predictions are available by setting `oob = FALSE`. ## Variable importance with `gg_vimp()` ```{r vimp-demo} vimp_df <- ggRandomForests::gg_vimp(rf_boston) head(vimp_df) plot(vimp_df) ``` `gg_vimp()` measures permutation importance: each predictor is permuted in turn, and the drop in OOB accuracy gives its score. This contrasts with the `gg_varpro` family, which uses release-rule importance from the `varPro` engine. Variable importance is not always stored on the fitted object. If a `randomForest` fit is missing its importance scores, `gg_vimp()` will try to compute them for you. When even that is not possible (the forest was grown with `importance = FALSE` and the predictors are no longer reachable), the function warns and returns `NA` in place of the scores, so a plot still draws rather than failing outright. ## Balanced conditioning cuts with `quantile_pts()` ```{r quantile-demo} rm_breaks <- ggRandomForests::quantile_pts(boston$rm, groups = 6, intervals = TRUE) rm_groups <- cut(boston$rm, breaks = rm_breaks) table(rm_groups) ``` When you build a coplot, you want each conditioning group to hold a roughly equal share of the data --- equal-width bins leave the sparse tails nearly empty. `quantile_pts()` wraps `stats::quantile()` to give you break points that do exactly that, and they pass straight to `cut()` for the grouping or facet labels. ## Next steps * The full API reference lives at . * `?gg_error`, `?gg_variable`, `?gg_vimp`, and `?quantile_pts` cover the remaining arguments and have their own examples. * The `gg_error`, `gg_variable`, and `gg_vimp` objects shown here are tidy data frames underneath, so you can skip the `plot()` methods entirely and build the figure yourself with `ggplot2`. * For the full varPro toolkit (release-rule importance, lasso-refined importance, per-observation local importance, anomaly scores, and the dependency graph) walked across regression, classification, and survival examples, see `vignette("varpro", package = "ggRandomForests")`.