--- title: "Backend Guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Backend Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(ReproStat) set.seed(20260324) ``` ## Overview ReproStat supports multiple model-fitting backends through the same high-level API. That means you can often keep the same reproducibility workflow while changing only the modeling engine. Supported backends are: - `"lm"` for ordinary least squares - `"glm"` for generalized linear models - `"rlm"` for robust regression via `MASS` - `"glmnet"` for penalized regression via `glmnet` This article explains when to use each one and what changes in the returned diagnostics. ## Common interface The same entry point is used across backends: ```r run_diagnostics( formula, data, B = 200, method = "bootstrap", backend = "lm" ) ``` The key differences are in: - how the model is fit - which quantities are available - how to interpret selection-related outputs ## Backend: lm `"lm"` is the default backend and is the best place to start for standard linear regression. ```{r lm-example} diag_lm <- run_diagnostics( mpg ~ wt + hp + disp, data = mtcars, B = 100, backend = "lm" ) reproducibility_index(diag_lm) ``` Use `"lm"` when: - the response is continuous - ordinary least squares is the intended analysis - you want the simplest interpretation of all components ## Backend: glm Use `"glm"` when you need a generalized linear model, such as logistic or Poisson regression. ```{r glm-example} diag_glm <- run_diagnostics( am ~ wt + hp + qsec, data = mtcars, B = 100, backend = "glm", family = stats::binomial() ) reproducibility_index(diag_glm) ``` Notes: - if you provide `family = ...` while leaving `backend = "lm"`, the function promotes the fit to `"glm"` - prediction stability for GLMs uses response-scale predictions - p-value and selection summaries remain available ## Backend: rlm Use `"rlm"` when you want robustness against outliers or heavy-tailed error behavior. ```{r rlm-example, eval = requireNamespace("MASS", quietly = TRUE)} if (requireNamespace("MASS", quietly = TRUE)) { diag_rlm <- run_diagnostics( mpg ~ wt + hp + disp, data = mtcars, B = 100, backend = "rlm" ) reproducibility_index(diag_rlm) } ``` Use `"rlm"` when: - a few influential observations may distort OLS results - you want a more robust regression baseline - you still want coefficient, selection, prediction, and RI summaries in a familiar regression framework ## Backend: glmnet Use `"glmnet"` when you want penalized regression such as LASSO, ridge, or elastic net. ```{r glmnet-example, eval = requireNamespace("glmnet", quietly = TRUE)} if (requireNamespace("glmnet", quietly = TRUE)) { diag_glmnet <- run_diagnostics( mpg ~ wt + hp + disp + qsec, data = mtcars, B = 100, backend = "glmnet", en_alpha = 1 ) reproducibility_index(diag_glmnet) } ``` The `en_alpha` argument controls the penalty mix: - `1` gives LASSO - `0` gives ridge - values in between give elastic net Important differences for `"glmnet"`: - p-values are not defined, so the `pvalue` component is `NA` - selection stability measures non-zero selection frequency - RI values are therefore based on a different component set than the non-penalized backends ## Backend comparison summary | Backend | Best for | P-values available? | Selection meaning | |---------|----------|---------------------|-------------------| | `"lm"` | standard linear regression | yes | sign consistency | | `"glm"` | logistic / GLM use cases | yes | sign consistency | | `"rlm"` | robust regression | yes | sign consistency | | `"glmnet"` | penalized regression | no | non-zero frequency | ## Choosing a backend in practice A simple decision pattern is: 1. Start with `"lm"` if a standard linear model is appropriate. 2. Move to `"glm"` when the response distribution requires it. 3. Use `"rlm"` when outlier resistance matters. 4. Use `"glmnet"` when shrinkage, regularization, or sparse selection is the main modeling goal. ## Comparing RI values across backends Be careful when comparing RI values between penalized and non-penalized backends. For `"glmnet"`, the p-value component is unavailable, so the composite score is formed from a different set of ingredients. That makes cross-backend RI comparisons descriptive at best, not strictly apples-to-apples. ## Model comparison with repeated CV All backends can also be used in `cv_ranking_stability()`: ```{r cv-example} models <- list( compact = mpg ~ wt + hp, fuller = mpg ~ wt + hp + disp ) cv_obj <- cv_ranking_stability( models, mtcars, v = 5, R = 20, backend = "lm" ) cv_obj$summary ``` This is especially valuable when you are choosing between competing formulas and want to know not just which model is best on average, but which one is consistently best. ## Next steps For a broader conceptual explanation, read the interpretation article. For a complete first analysis, start with `vignette("ReproStat-intro")`.