--- title: "4. Fit and Explore Selected Models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{4. Fit and Explore Selected Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6 ) ``` # Summary - [Description](#description) - [Getting ready](#getting-ready) - [Fitting selected models](#fitting-selected-models) - [Response curves](#response-curves) - [All response curves](#all-response-curves) - [Customized response curves](#customized-response-curves) - [Bivariate response curves](#bivariate-response-curves) - [Variable importance](#variable-importance) - [Evaluation with independent data](#evaluation-with-independent-data) - [Saving a fitted_models object](#saving-a-fitted_models-object)
# Description After the best performing models have been selected, users need to fit this models (using `fit_selected()`) in order to explore their characteristics and continue with the next steps. Fitted models can then be used to assess variable importance in models, as well as to explore variable response curves. Selected models can also be evaluated using independent records that were not used during calibration. This vignette contains examples to explore the multiple options available to fit and explore selected models.
# Getting ready At this point it is assumed that `kuenm2` is installed (if not, see the [Main guide](index.html)). Load `kuenm2` and any other required packages, and define a working directory (if needed). Note: functions from other packages (i.e., not from base R or `kuenm2`) used in this guide will be displayed as `package::function()`. ```{r, results='hide', message=FALSE, warning=FALSE} # Load packages library(kuenm2) library(terra) # Current directory getwd() # Define new directory #setwd("YOUR/DIRECTORY") # uncomment and modify if setting a new directory # Saving original plotting parameters original_par <- par(no.readonly = TRUE) ```
# Fitting selected models To fit the selected models, we need a `calibration_results` object. For more details in model calibration, please refer to the vignette [Model Calibration](model_calibration.html). The `calibration_results` object generated in this vignette is available as a data example in the package. Let's load it. ```{r import calibration results, warning=FALSE} # Import an example of calibration results data("calib_results_maxnet", package = "kuenm2") # Print calibration result calib_results_maxnet ```
This object contains the results of candidate models calibrated using the `maxnet` algorithm. The package also provides a similar example wit models created using the `glm` algorithm. See below how to load the `calib_results_glm` object in case you want to explore it. ```{r import glm} # Import calib_results_glm data("calib_results_glm", package = "kuenm2") # Print calibration result calib_results_glm ```
Note that the `calibration_results` object stores only the information related to the calibration process and model evaluation, it does not include the fitted `maxnet` (or `glm`) models. To obtain fitted models, we need to use the `fit_selected()` function. By default, this function fits a full model (i.e., without replicates and without splitting the data into training and testing sets). However, you can configure it to fit final models with replicates if desired. In this example, we'll fit the final models with 4 k-fold replicates (leaving one fold out), as in the [Model Calibration](model_calibration.html) vignette. ```{r fit_selected not echo, warning=FALSE, echo=FALSE} # Fit models using calibration results fm <- fit_selected(calibration_results = calib_results_maxnet, replicate_method = "kfolds", n_replicates = 4, verbose = FALSE, progress_bar = FALSE) ``` ```{r fit_selected not eval, warning=FALSE, eval=FALSE} # Fit selected models using calibration results fm <- fit_selected(calibration_results = calib_results_maxnet, replicate_method = "kfolds", n_replicates = 4) # Fitting replicates... # |========================================================================| 100% # Fitting full models... # |========================================================================| 100% ```
The `fit_selected()` function returns a `fitted_models` object, a list that contains essential information from the fitted models, which is required for the subsequent steps. You can explore the contents of the `fitted_models` object by indexing its elements. For example, the fitted `maxnet` (or `glm`) model objects are stored within the `Models` element. Note that `Models` is a nested list: for each selected model (in this case, models 192 and 219), it includes both the replicates (if fitted with replicates) and the full model. ```{r explore fitted} # See names of selected models names(fm$Models) # See models inside Model 192 names(fm$Models$Model_192) ```
The `fitted_models` object also stores the thresholds that can be used to binarize the models into suitable and unsuitable areas. These thresholds correspond to the omission error used during model selection (e.g., 5% or 10%). You can access the omission error used to calculate the thresholds directly from the object: ```{r omission error} # Get omission error used to select models and calculate the theshold values fm$omission_rate ```
The omission error used to calculate the thresholds was 10%, meaning that when the predictions are binarized, approximately 10% of the presence records used to calibrate the models will fall into cells with predicted values below the threshold. The thresholds are summarized in two ways: the mean and median across replicates for each model, and the consensus mean and median across all selected models (when more than one model is selected): ```{r thresholds} fm$thresholds ```
# Response curves The selected models in the `fitted_models` object can be used to generate response curves. Individual variable response curves illustrate how each environmental variables influences suitability, while keeping all other variables constant. By default, the curves are generated with all other variables set to their mean values (or the mode for categorical variables). The mean or mode are calculated from the combined set of presence and background data (`averages_from = "pr_bg"`). You can change this behavior to use only the presence localities by setting `averages_from = "pr"`.
## All response curves An easy way to explore response curves for all variables is as follows: ```{r all var resps} # Produce response curves for all variables in all fitted models all_response_curves(fm) ```
In the previous plot, the dashed lines indicate the range of variables values in the data used to fit models. If multiple models were fitted, variability is shown via a Generalized Additive Model (GAM) using the mean and the 95% confidence interval. For a more detailed view of what the response curve for each of the models fitted look like, the argument `show_lines` can be set to `TRUE`. ```{r all var resps1} # All response curves showing each model as a different line all_response_curves(fm, show_lines = TRUE) ```
To explore response curves of all variables for each model independently, the argument `model_ID` can be used. The plot will show variable response curves for the full model. ```{r all var resps2} # All response curves for model 219 all_response_curves(fm, modelID = "Model_219") ```
To get a view of variability in response curves for a model produced with replicates, use `show_variability = TRUE`. Both, the GAM or the multiple-line approaches can be used to show variability. ```{r all var resps3} # All response curves for model 219 (GAM for variability) all_response_curves(fm, modelID = "Model_219", show_variability = TRUE) # All response curves for model 219 (each curve is a replicate) all_response_curves(fm, modelID = "Model_219", show_variability = TRUE, show_lines = TRUE) ```
## Customized response curves The previous plots help to gain a quick understanding of response curves for fitted models, but little can be done make each panel look better. To produce nicer plots, response curves for each variable at a time can be produced. Let's check which variables are available to plot by examining the coefficients of the full models: ```{r Get variables} # Get variables with non-zero coefficients in the models fm$Models[[1]]$Full_model$betas # From the first model selected fm$Models[[2]]$Full_model$betas # From the second model selected ```
The variables `bio_1`, `bio_7`, `bio_12` and `bio_15 `have non-zero coefficient values, which means they contribute to the model and are available for generating response curves. Remember that response curves are computed using all selected models. This time let's change the margins and labels for each plot ```{r response curve} par(mfrow = c(2, 2), mar = c(4, 4, 1, 0.5)) # Set grid of plot response_curve(models = fm, variable = "bio_1", las = 1) response_curve(models = fm, variable = "bio_7", ylab = "", las = 1) response_curve(models = fm, variable = "bio_12", las = 1) response_curve(models = fm, variable = "bio_15", ylab = "", las = 1) ```
We can also specify which of the selected models should be used to generate the response curves: ```{r response ID} par(mfrow = c(2, 2), mar = c(4, 4, 2.5, 0.5)) # Set grid of plot response_curve(models = fm, variable = "bio_1", modelID = "Model_192", main = "Model_192", las = 1) response_curve(models = fm, variable = "bio_1", modelID = "Model_219", main = "Model_219", ylab = "", las = 1) response_curve(models = fm, variable = "bio_7", modelID = "Model_192", main = "Model_192", las = 1) response_curve(models = fm, variable = "bio_7", modelID = "Model_219", main = "Model_219", ylab = "", las = 1) ```
By default, the response curve extends beyond the range of model fitting limits (dashed lines) based on the observed range of values and an `extrapolation_factor` (when `extrapolation = TRUE`). The default extrapolation factor is set to 10% of the range. If `extrapolation = FALSE`, no extrapolation occurs, and the plot limits match the fitting data range exactly. We can increase the extrapolation factor to allow a broader range beyond the observed data. Below is the response curve plotted with an extrapolation factor of 2: ```{r extrapolation factor} par(mfrow = c(2, 2), mar = c(4, 4, 1, 0.5)) # Set grid of plot response_curve(models = fm, variable = "bio_1", extrapolation_factor = 2, las = 1) response_curve(models = fm, variable = "bio_7", extrapolation_factor = 2, ylab = "", las = 1) response_curve(models = fm, variable = "bio_12", extrapolation_factor = 2, las = 1) response_curve(models = fm, variable = "bio_15", extrapolation_factor = 2, ylab = "", las = 1) ```
Note that the response curve now extends further beyond the observed data range. Optionally, we can manually set the lower and upper limits of the variables. For example, since `bio_12` represents annual precipitation and negative values are not realistic, we can set its lower limit to 0: ```{r lower limit} response_curve(models = fm, variable = "bio_12", extrapolation_factor = 0.1, l_limit = 0) ```
Now, the lower limit of the plot for `bio_12` is set to 0. Since we did not specify an upper limit, the plot uses the extrapolation factor (here, 0.1) to define the upper limit. The other aspect to notice after increasing the extrapolation factor is that the curves looked like they were not perfectly unimodal. Unfortunately, using a GAM to summarize and represent variability can some times generate this effect. However, plotting responses for each model as independent curves will show the real shape of the curve, see the differences after setting `show_lines = TRUE`. ```{r GAM vs lines} par(mfrow = c(1, 2), mar = c(4, 4, 1, 0.5)) # Set grid of plot response_curve(models = fm, variable = "bio_1", extrapolation_factor = 0.5, las = 1) response_curve(models = fm, variable = "bio_1", extrapolation_factor = 0.5, show_lines = TRUE, ylab = "", las = 1) ```
Optionally, we can add the presence and background points used to fit the models to the response curve plot by setting `add_points = TRUE`: ```{r add points} response_curve(models = fm, variable = "bio_1", show_lines = TRUE, add_points = TRUE) ```
## Bivariate response curves All previous curve plots are showing responses to individual variables. Exploring responses to two variables at a time can help understand the small differences between models fitted with distinct terms. Let's see an example below: ```{r bivariate} # First, let's check the terms in the models fitted ## Model 192 fm$Models$Model_192$Full_model$betas ## Model 219 fm$Models$Model_219$Full_model$betas # Example of a bivariate response plot bivariate_response(models = fm, variable1 = "bio_1", variable2 = "bio_15", modelID = "Model_192") ```
In the previous plot, the dashed lines represent the limits of the data used for model fitting, and darker colors represent higher suitability. Let's now compare the two models and check if there are any differences. Multiple bivariate response plots can be put together in a single plot if the suitability bar legend is not used. ```{r bivariate1} par(mfrow = c(1, 2), mar = c(4, 4, 2.5, 0.5)) # Bivariate response model 192 bivariate_response(models = fm, variable1 = "bio_1", variable2 = "bio_15", modelID = "Model_192", add_bar = FALSE, main = "Model 192") # Bivariate response model 219 bivariate_response(models = fm, variable1 = "bio_1", variable2 = "bio_15", modelID = "Model_219", add_bar = FALSE, main = "Model 219", ylab = "") ```
These two models look very similar, the only noticeable difference is that suitability values above the threshold extend farther on variable bio_15 for model 219. More noticeable effects can be seen in bivariate representations when fitted models include or not product terms. However, as in this example, even a small variation on the number or type of terms included in a model can change how predictions look like.
# Variable importance The relative importance of model predictors can be calculated exploring the deviance using the `var_importance()` function. This process starts by fitting the full model (`maxnet` or `glm`), which includes all predictors. Then, the function fits separate models excluding one predictor at a time, and quantifies how the removal affects model performance. The values of contribution are computed for variables by default, but explorations by terms are also possible using the argument `by_terms` (i.e., if the model includes linear and quadratic terms of a variable, the process will run for each of them separately). By default, the function runs on a single core. You can enable parallel processing by setting `parallel = TRUE` and specifying the number of cores with `ncores`. Note that parallelization only speeds up the computation when there are many variables (e.g., >17) and the calibration dataset is large (e.g., over 15,000 total points). Variable importance is computed for all selected models by default: ```{r var importance not echo, echo=FALSE} # Calculate variable importance imp <- variable_importance(models = fm, progress_bar = FALSE, verbose = FALSE) ``` ```{r var importance not eval, eval=FALSE} # Calculate variable importance imp <- variable_importance(models = fm) # Calculating variable contribution for model 1 of 2 # |======================================================================| 100% # Calculating variable contribution for model 2 of 2 # |======================================================================| 100% ```
The function returns a `data.frame` with the relative contribution of each variable. If multiple models are included in the `fitted` object, an additional column identifies each model. ```{r print importance} imp ```
We can visualize variable importance using the `plot_importance()` function. When the `fitted_models` object contains more than one selected model, contributions are displayed as a boxplot, along with the mean contribution, and the number (N) of fitted models with that predictor. ```{r plot importance} plot_importance(imp) ```
If variable importance is computed for a single model, the plot displays a bars instead of boxes: ```{r one model importance} # Calculate variable importance for a specific selected Model imp_192 <- variable_importance(models = fm, modelID = "Model_192", progress_bar = FALSE) # Plot variable contribution for model 192 plot_importance(imp_192, main = "Variable importance: Model 192") ```
The previous examples show how important are variables by checking the change in models if we remove a variable completely from the equation. To compute contribution of all variable terms (i.e., how important is to keep distinct terms of the same variable) set `by_terms = TRUE`: ```{r term importance} # Calculate variable importance for a specific selected Model imp_terms <- variable_importance(models = fm, by_terms = TRUE, progress_bar = FALSE) # Check results imp_terms # Plot variable contribution for model 192 par(cex = 0.7, mar = c(3, 4, 2.5, 0.5)) # Set grid of plot plot_importance(imp_terms, main = "Importance of variable terms") ```
# Evaluation with independent data Selected models can be evaluated using an independent set of presence records (not used when fitting selected models). This approach is especially useful when new records become available. This can be useful to assess models for invasive species, for which model fitting is usually done in native areas with transfers to potential invasive areas. The `independent_evaluation()` function computes omission rates (i.e., the proportion of independent records predicted as below the suitability threshold) and partial ROC. The function also assesses whether environmental conditions in independent data are analogous those under which models were fit using the mobility oriented-parity metric (MOP; [Cobos et al. 2024](https://biogeography.pensoft.net/article/132916/)). Let's use the `new_occ` as an example of independent data (provided in the package). This dataset contains coordinates of *Myrcia hatschbachii* sourced from NeotropicTree ([Oliveira-Filho, 2017](http://www.neotroptree.info/)), and they were not part of the data used to fit the models. ```{r import new occ} # Import independent records data("new_occ", package = "kuenm2") # See data structure str(new_occ) ```
In order to evaluate predictive ability of models and how analogous conditions in independent data are to fitting conditions, environmental values need to be extracted to these new records. Let's import the raster variables and extract values for `new_occ`: ```{r extract var} # Import raster layers var <- terra::rast(system.file("extdata", "Current_variables.tif", package = "kuenm2")) # Extract variables to occurrences new_data <- extract_occurrence_variables(occ = new_occ, x = "x", y = "y", raster_variables = var) # See data structure str(new_data) ```
Finding non-analogous conditions in independent records is not uncommon, especially in regions outside model calibration areas. To better illustrate this case, let's add three fake records, in which some variables have non-analogous values, either higher than the upper limit or lower than the lower limit observed in the data used to fit models. ```{r add fake data} # Add some fake data beyond the limits of calibration ranges fake_data <- data.frame("pr_bg" = c(1, 1, 1), "x" = c(NA, NA, NA), "y" = c(NA, NA, NA), "bio_1" = c(10, 15, 23), "bio_7" = c(12, 16, 20), "bio_12" = c(2300, 2000, 1000), "bio_15" = c(30, 40, 50), "SoilType" = c(1, 1, 1)) # Bind data new_data <- rbind(new_data, fake_data) ```
Now, let's evaluate models with this independent dataset (keep in mind that the last three records are fake): ```{r} # Evaluate models with independent data res_ind <- independent_evaluation(fitted_models = fm, new_data = new_data) ```
The output is a list with three elements: 1. **evaluation** metrics (omission rate and pROC) for each selected model, as well as for the overall consensus. 2. **mop_results**, a list containing the output of MOP analysis comparing conditions in independent data and those used to fit models. This happens because the `perform_mop` argument is set to `TRUE` (the default). 3. **predictions** of continuous and binary suitability for each of the independent records (the default; `return_predictions = TRUE`). In addition, MOP results are included for all the records. ```{r} res_ind$evaluation ```
All selected models have significant pROC values, but show higher omission rates than expected based on the 10% omission threshold used for calibration (~35% of the independent records are predicted as unsuitable) . For each records, the following MOP results are provided: - **mop_distance**: the environmental distance (i.e., dissimilarity) of the independent record to the nearest set of conditions considered to fit the models. - **inside_range**: whether the environmental conditions at the location of the independent record fall within the model fitting range. - **n_var_out**: the number of variables for which the independent record is outside the model fitting range. - **towards_low**: the names of variables with values lower than the minimum observed in the model fitting data. - **towards_high**: the names of variables with values higher than the maximum observed in the model fitting data. ```{r} # Show the mop results for the last 5 independent records res_ind$predictions$continuous[81:85 , c("mop_distance", "inside_range", "n_var_out", "towards_low", "towards_high")] ```
Note that two of the three fake records we added to `new_data` have non-analogous environmental conditions. One of them falls in a location where *bio_7* and *bio_1* have values lower than the minimum observed in the model fitting data, whereas *bio_12* has a value higher than the maximum. Another record is in a location where *bio_12* is below the model fitting range, and *bio_1* exceeds the upper limit.
When we set `return_predictions = TRUE` (the default), the function also returns the predictions for each selected model and for the general consensus: ```{r} # Show the continuous predictions for the last 5 independent records # Round to two decimal places round(res_ind$predictions$continuous[81:85, 1:6], 2) # Show the binary predictions for the last 5 independent records res_ind$predictions$binary[81:85, 1:6] ```
These results can help determine whether independent data should be incorporated into the calibration dataset to re-run models. If omission rates for independent records are too high, pROC values are non-significant, or most of the records show non-analogous environmental conditions, it might be a good idea to re-run the models including independent records.
```{r par_reset} # Reset plotting parameters par(original_par) ```
# Saving a fitted_models object After fitting the best-performing models with `fit_selected()`, we can proceed to predict the models for single or multiple scenarios. As this object is essentially a list, users can save it to a local directory using `saveRDS()`. Saving the object facilitates loading it back into your R session later using `readRDS()`. See an example below: ```{r save data, eval=FALSE} # Set directory to save (here, in a temporary directory) dir_to_save <- file.path(tempdir()) # Save the data: saveRDS(fm, file.path(dir_to_save, "fitted_models.rds")) # Import data fm <- readRDS(file.path(dir_to_save, "fitted_models.rds")) ```