--- title: "2. Prepare Data for Model Calibration" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{2. Prepare Data for Model Calibration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6, dpi=100, out.width = "95%" ) ``` # Summary - [Description](#description) - [Getting ready](#getting-ready) - [Preparing data](#preparing-data) - [Example data](#example-data) - [First steps in preparing data](#first-steps-in-preparing-data) - [Using pre-processed data](#using-pre-processed-data) - [Exploring prepared data](#exploring-prepared-data) - [Comparative histograms](#comparative-histograms) - [Distribution of data for models](#distribution-of-data-for-models) - [Similarity assessment in partitions](#similarity-assessment-in-partitions) - [More options to prepare data](#more-options-to-prepare-data) - [Custom formulas](#custom-formulas) - [Using a bias file](#using-a-bias-file) - [PCA for variables](#pca-for-variables) - [Internal PCA](#internal-pca) - [External PCA](#external-pca) - [Using external data partitions](#using-external-data-partitions) - [ENMeval partitions](#enmeval-partitions) - [flexsdm partitions](#flexsdm-partitions) - [Saving prepared data](#saving-prepared-data)

# Description Before starting the ENM process, data must be formatted in a specific structure required by functions in `kuenm2`. This vignette guides users through the steps necessary to prepare occurrence data and environmental predictors using built-in tools. It covers the use of the main functions, `prepare_data()` and `prepare_user_data()`, to generate standardized objects that are essential for model calibration. The guide also demonstrates options to compute principal components from variables (PCA), incorporating sampling bias, integrating data partitioning schemes from external methods, exploring prepared data, and saving the data object for later use.
# Getting ready If `kuenm2` has not been installed yet, please do so. See the [Main guide](index.html) for installation instructions. See also the [basic data cleaning guide](basic_data_cleaning.html) for some data cleaning steps. Use the following lines of code to load `kuenm2` and any other required packages, and define a working directory (if needed). In general, setting a working directory in R is considered good practice, as it provides better control over where files are read from or saved to. If users are not working within an R project, we recommend setting a working directory, since at least one file will be saved at later stages of this guide. Note: functions from other packages (i.e., not from base R or `kuenm2` ) used in this guide will be displayed as `package::function()`. ```{r, results='hide', message=FALSE, warning=FALSE} # Load packages library(kuenm2) library(terra) # Current directory getwd() # Define new directory #setwd("YOUR/DIRECTORY") # uncomment and modify if setting a new directory # Saving original plotting parameters original_par <- par(no.readonly = TRUE) ```
# Preparing data ## Example data We will use occurrence records provided within the `kuenm2` package. Most example data in the package is derived from [Trindade & Marques (2024)](https://doi.org/10.1111/ddi.13931). The `occ_data` object contains 51 occurrences of *Myrcia hatschbachii*, a tree endemic to Southern Brazil. Although this example data set has three columns (species, x, and y), only two numeric columns with longitude and latitude coordinates are required. ```{r Import occurrence data} # Import occurrences data(occ_data, package = "kuenm2") # Check data structure str(occ_data) ```
As predictor variables, we will use another dataset included in the package. This dataset comprises four bioclimatic variables from [WorldClim 2.1](https://worldclim.org/data/bioclim.html) at 10 arc-minute resolution, and a categorical variable (SoilType) from [SoilGrids](https://soilgrids.org/) aggregated to 10 arc-minutes. All variables have been masked using a polygon that delimits the area for model calibration. This polygon was generated by drawing a minimum convex polygon around the records, with a 300 km buffer. Note: At later steps, `kuenm2` helps users automate model transfers to multiple future and/or past scenarios. If using bioclimatic variables from WorldClim, part of that process involves renaming variables, and renaming options are limited. Try to limit variable names to the following formats: `"bio_1", "bio_12"`; `"bio_01", "bio_12"`; `"Bio_1", "Bio_12"`; or `"Bio_01", "Bio_12"`. If variable names have other formats, using automated steps are still possible but require a few extra steps. ```{r Load variables, dpi = 80, out.width = "70%"} # Import raster layers var <- terra::rast(system.file("extdata", "Current_variables.tif", package = "kuenm2")) # Check variables terra::plot(var) ```
Visualize occurrences records in geography: ```{r, dpi = 80, out.width = "70%"} # Visualize occurrences on one variable terra::plot(var[["bio_1"]], main = "Bio 1") points(occ_data[, c("x", "y")], col = "black") ```
## First steps in preparing data The functions `prepare_data()` and `prepare_user_data()` are central to getting data ready for model calibration. They handles several key steps: * **Defining the algorithm**: Users can choose between `maxnet` or `glm`. * **Generating background points**: Background points are sampled from raster layers (`prepare_data()`), unless provided by the user (`prepare_user_data()`). Background points serve as a reference to contrast presence records. The default number of points is 1000; we encourage users to define appropriate numbers considering the calibration area extension, the number of presence records, and the number of pixels available (this can change depending on layer resolution). * **Principal component analysis (PCA)**: An optional step that can be done with the variables provided. * **Preparing calibration data**: Presence records and background points are associate with predictor values and put together in a `data.frame` to be used in the ENM. Calibration data is provided by the user in `prepare_user_data()`. * **Data partitioning**: The function divides your data to prepare training and testing sets via a cross-validation process. The partitioning methods directly available include `kfolds`, `subsample`, and `bootstrap`. * **Defining grid of model parameters**: This helps setting up combinations of feature classes (FCs), regularization multiplier (RM) values (for Maxnet), and sets of predictor variables. An explanation of the roles of RMs and FCs in Maxent models see [Merow et al. 2013](https://doi.org/10.1111/j.1600-0587.2013.07872.x). As with any function, we recommend checking the documentation with `help(prepare_data)` for more detailed explanations. Now, let's prepare the data for model calibration, using 4 k-folds to partition training and testing datasets: ```{r simple prepare data} # Prepare data for maxnet model d <- prepare_data(algorithm = "maxnet", occ = occ_data, x = "x", y = "y", raster_variables = var, species = "Myrcia hatschbachii", categorical_variables = "SoilType", partition_method = "kfolds", n_partitions = 4, n_background = 1000, features = c("l", "q", "lq", "lqp"), r_multiplier = c(0.1, 1, 2)) ```
The `prepare_data()` function returns a `prepared_data` object, a list that contains several essential components for model calibration. Below is an example of the object’s printed output, which provides a summary of its contents. ```{r print prepared data} print(d) ```
The parts of the `prepared_data` object can be explored in further detail by indexing them as in the following example. ```{r explore some data} # Check the algorithm selected d$algorithm # See first rows of calibration data head(d$calibration_data) # See first rows of formula grid head(d$formula_grid) ```
The algorithms that can be selected are `"maxnet"` or `"glm"`. When using GLMs, regularization multipliers are not used. Now, let's run an example using `glm`, this time using the `subsample` partitioning method, with a total of 10 partitions, and 70% of the dataset used for training in every iteration. ```{r prepare glm} # Prepare data selecting GLM as the algorithm d_glm <- prepare_data(algorithm = "glm", occ = occ_data, x = "x", y = "y", raster_variables = var, species = "Myrcia hatschbachii", categorical_variables = "SoilType", partition_method = "subsample", n_partitions = 10, train_proportion = 0.7, n_background = 300, features = c("l", "q", "p", "lq", "lqp"), r_multiplier = NULL) # Not necessary with GLMs # Print object d_glm ```
## Using pre-processed data In some cases, users already have data that has been prepared for model calibration (e.g., when data is prepared for time-specific applications). When that is the case, the function `prepare_user_data()` can take the pre-processed data and get them ready for the next analyses. User-prepared data must be a `data.frame` that includes a column with zeros and ones, indicating **presence (1)** and **background (0)** records, along with columns with values for each of the **variables**. The package includes an example of such a `data.frame` for reference (see below). ```{r import user data} data("user_data", package = "kuenm2") head(user_data) ```
The `prepare_user_data()` function operates similarly to `prepare_data()`, but with key differences. The main difference is that instead of requiring a table with coordinates and a `SpatRaster` of variables, it takes the already prepared `data.frame`. See full documentation with `help(prepare_user_data)`. ```{r prepare user data} # Prepare data for maxnet model data_user <- prepare_user_data(algorithm = "maxnet", user_data = user_data, # user-prepared data.frame pr_bg = "pr_bg", species = "Myrcia hatschbachii", categorical_variables = "SoilType", partition_method = "bootstrap", features = c("l", "q", "p", "lq", "lqp"), r_multiplier = c(0.1, 1, 2, 3, 5)) data_user ```
This function also allows users to provide a list identifying which points should be used for testing in each cross-validation iteration. This can be useful to keep data partitions stable among distinct model calibration routines. If `user_folds` is `NULL` (the default), the function partitions the data according to `partition_method`, `n_partitions`, and `train_proportion`.
# Exploring prepared data In the following examples, we'll use the object `d`, prepared using the `maxnet` algorithm. The same can be done with `prepared_data` using `glm` as de algorithm.
## Comparative histograms Users can visualize the distribution of predictor values for occurrence records, background points, and the entire calibration area using histograms. An example is presented below. See full documentation with `help(explore_calibration_hist)` and `help(plot_explore_calibration)`. ```{r explore histogram, out.width = "70%", fig.width = 7, fig.height = 4.5} # Prepare histogram data calib_hist <- explore_calibration_hist(data = d, raster_variables = var, include_m = TRUE) # Plot histograms plot_calibration_hist(explore_calibration = calib_hist) ```
In the previous plot, gray represents values across the entire calibration area, blue background values, and green values at presence records (magnified by a factor of 2 to enhance visualization). Both the colors and the magnification factor can be customized. If `raster_variables` are not available, exclude that argument and `include_m` when running the function `explore_calibration_hist()`. This could happen when users have pre-processed data and run `prepare_user_data()`.
## Distribution of data for models Additionally, users can explore the geographic distribution of occurrences and background, as well as how they were partitioned. See full documentation with `help(explore_partition_geo)`. ```{r explore spatial, out.width = "75%"} # Explore spatial distribution pbg <- explore_partition_geo(data = d, raster_variables = var[[1]]) # Plot exploration results in geography terra::plot(pbg) ```
Note that, by default, background points are selected randomly within the calibration area. However, users can influence this selection by providing a bias file (see section [More options to prepare data](#more-options-to-prepare-data)).
## Similarity assessment in partitions When partitioning data, some testing points may fall into environments that are different from those in which training points are. This forces the model to evaluate under extrapolation, testing its predictions on conditions outside its training range. The `explore_partition_extrapolation()` function calculates dissimilarity and detects non-analogous conditions in testing points after comparing them to the training data for each partition. Dissimilarity tests are performed using the mobility-oriented parity metric (MOP; [Owens et al. 2013](https://doi.org/10.1016/j.ecolmodel.2013.04.011)) as in [Cobos et al. (2024)](https://biogeography.pensoft.net/article/132916/). This analysis only requires a `prepared_data` object. By default, the function `explore_partition_extrapolation()` uses all training data (presences and backgrounds) to define the environmental space of reference, and computes MOP for testing records. If computing MOP for test background points is needed, set `include_test_background = TRUE`. ```{r} # Run extrapolation risk analysis mop_partition <- explore_partition_extrapolation(data = d, include_test_background = TRUE) ```
The previous run returns a list in which the main outcome is `Mop_results`. This is a `data.frame`, in which each row is a testing record; the columns are: - **mop_distance**: MOP distances; - **inside_range** an indicator of whether environmental conditions at each testing record fall within the training range; - **n_var_out**: number of variables outside the training range; - **towards_low** and **towards_high **: names of variables with values lower or higher than the training range; - **SoilType**: because the `prepared_data` object includes a categorical variable, it also contains a column indicating which categories in testing data were not present in training data. ```{r mop_results} # Check some of the results head(mop_partition$Mop_results) # Number of testing records with values outside training ranges nrow(mop_partition$Mop_results[mop_partition$Mop_results$n_var_out > 1, ]) ```
Now we can use the function `plot_explore_partition()` to visualize the records from each partition in both geographic and environmental spaces. As comparisons were performed in environmental space, to visualize results in geographic space the plotting functions uses a simplified map of the world, but another spatial object can be defined in `calibration_area` if needed. The type MOP result to plot can be specified as: "simple" to show records in a partition within or out of environmental range of the other partitions; or "distance" to display the distance of each record to the nearest set of conditions in the other partitions. ```{r, fig.width = 8, fig.height = 3.5, out.width = "75%"} # Simple plot in geographic space plot_explore_partition(explore_partition = mop_partition, space = "G", type_of_plot = "simple") # Distance plot in geographic space plot_explore_partition(explore_partition = mop_partition, space = "G", type_of_plot = "distance", lwd_legend = 4) # Simple plot in environmental space plot_explore_partition(explore_partition = mop_partition, space = "E", type_of_plot = "simple", variables = c("bio_7", "bio_15")) # Distance plot in environmental space plot_explore_partition(explore_partition = mop_partition, space = "E", type_of_plot = "distance", variables = c("bio_7", "bio_15"), lwd_legend = 4) ```
The data used in this example was partitioned using k-folds which reduces the chances of finding novel conditions when comparing testing vs training sets of data. However, that might not be the case when using data partitioning methods such as spatial blocks (see [Using external data partitions](#using-external-data-partitions)).
# More options to prepare data The examples of data preparation above show a few of the options that can be used to get data ready to start the ENM process. The next sections demonstrate other options available for data preparation, including: (1) customizing the list of model formulas to test in model calibration; (2) principal component analysis for variables; and (3) using external data partitioning methods.
## Custom formulas By default, `kuenm2` builds the formula grid automatically using the variables provided and the feature classes defined. For instance, if `raster_variables` or `user_data` contain *bio_1* and *bio_12*, and you set the features to `lq` (linear + quadratic), the formula will include linear and quadratic terms for each variable. In this example, the resulting formula would be: ```{r formula example} "~ bio_1 + bio_12 + I(bio_1^2) + I(bio_12^2)" ```
Instead of letting the functions build formulas based on selected features, users can provide custom formulas. This is useful when full control over which terms are included is needed (for example, including the quadratic version of specific variables): ```{r user_formula} # Set custom formulas my_formulas <- c("~ bio_1 + bio_12 + I(bio_1^2) + I(bio_12^2)", "~ bio_1 + bio_12 + I(bio_1^2)", "~ bio_1 + bio_12 + I(bio_12^2)", "~ bio_1 + I(bio_1^2) + I(bio_12^2)") # Prepare data using custom formulas d_custom_formula <- prepare_data(algorithm = "maxnet", occ = occ_data, x = "x", y = "y", raster_variables = var, species = "Myrcia hatschbachii", categorical_variables = "SoilType", partition_method = "kfolds", n_partitions = 4, n_background = 1000, user_formulas = my_formulas, # Custom formulas r_multiplier = c(0.1, 1, 2)) # Check formula grid d_custom_formula$formula_grid ```
## Using a bias file A bias file is a `SpatRaster` object that contains values that influence the selection of background points within the calibration area. This can be particularly useful for mitigating sampling bias. For instance, if we want the selection of background points to be informed by how historical sampling has been, a layer of the density of records from a target group can be used (see [Ponder et al. 2001](https://doi.org/10.1046/j.1523-1739.2001.015003648.x), [Anderson et al. 2003](https://doi.org/10.1046/j.1365-2699.2003.00867.x), and [Barber et al. 2020](https://doi.org/10.1111/ddi.13442)). The bias file must have the same extent, resolution, and number of cells as your raster variables. Let's illustrate this process with an example bias file included in the package. This `SpatRaster` has lower values in the center and higher values towards the borders of the area: ```{r import bias file, out.width = "70%"} # Import a bias file bias <- terra::rast(system.file("extdata", "bias_file.tif", package = "kuenm2")) # Check the bias values terra::plot(bias) ```
This bias layer will be used to prepare two new datasets: one with a "direct" bias effect (with higher probability of selecting background points in regions with higher bias values) and another with an "inverse" effect (the opposite). ```{r bias data, results='hide'} # Using a direct bias effect in sampling d_bias_direct <- prepare_data(algorithm = "maxnet", occ = occ_data, x = "x", y = "y", raster_variables = var, species = "Myrcia hatschbachii", categorical_variables = "SoilType", n_background = 1000, partition_method = "kfolds", bias_file = bias, bias_effect = "direct", # bias parameters features = c("l", "q", "p", "lq", "lqp"), r_multiplier = c(0.1, 1, 2, 3, 5)) # Using an indirect bias effect d_bias_inverse <- prepare_data(algorithm = "maxnet", occ = occ_data, x = "x", y = "y", raster_variables = var, species = "Myrcia hatschbachii", categorical_variables = "SoilType", n_background = 1000, partition_method = "kfolds", bias_file = bias, bias_effect = "inverse", # bias parameters features = c("l", "q", "p", "lq", "lqp"), r_multiplier = c(0.1, 1, 2, 3, 5)) ```
Let's use the `explore_partition_geo` function to see the effect of using a bias file. ```{r exp_part_geobias} # Explore spatial distribution of points ## No bias geo_dist <- explore_partition_geo(data = d, raster_variables = var) ## Direct bias effect geo_dist_bias <- explore_partition_geo(data = d_bias_direct, raster_variables = var) ## Inverse bias effect geo_dist_bias_inv <- explore_partition_geo(data = d_bias_inverse, raster_variables = var) ## Adjusting plotting grid par(mfrow = c(2, 2)) ## The plots to show sampling bias effects terra::plot(bias, main = "Bias file") terra::plot(geo_dist$Background, main = "Random Background", plg = list(cex = 0.75)) # Decrease size of legend text) terra::plot(geo_dist_bias$Background, main = "Direct Bias Effect", plg = list(cex = 0.75)) # Decrease size of legend text) terra::plot(geo_dist_bias_inv$Background, main = "Inverse Bias Effect", plg = list(cex = 0.75)) # Decrease size of legend text) ```
## PCA for variables A common approach in ENM involves summarizing the information from a set of variables into a smaller set of orthogonal variables using Principal Component Analysis (PCA) (see [Trindade et al. 2025](https://onlinelibrary.wiley.com/doi/epdf/10.1111/jbi.70103) for an example). `kuenm2` has options to perform a PCA internally or use variables that have been externally prepared as PCs.
### Internal PCA `kuenm2` can perform all PCA transformations internally, which eliminates the need of transforming raw variables into PCs when producing projections later on. This is particularly advantageous when projecting model results across multiple scenarios (e.g., various Global Climate Models for different future periods). By performing PCA internally, you only need to provide the raw environmental variables (e.g., `bio_1`, `bio_2`, etc.) when making projections; helper functions will handle the PCA transformation internally. Let's explore how to implement this: ```{r prepare data pca} # Prepare data for maxnet models using PCA parameters d_pca <- prepare_data(algorithm = "maxnet", occ = occ_data, x = "x", y = "y", raster_variables = var, do_pca = TRUE, center = TRUE, scale = TRUE, # PCA parameters species = "Myrcia hatschbachii", categorical_variables = "SoilType", n_background = 1000, partition_method = "kfolds", features = c("l", "q", "p", "lq", "lqp"), r_multiplier = c(0.1, 1, 2, 3, 5)) print(d_pca) ```
The elements `calibration_data` and `formula_grid` have now been generated considering the principal components (PCs). By default, all continuous variables were included in the PCA, while categorical variables (e.g., "SoilType") were excluded. The default settings to define the number of PCs to retain keeps as many PCs as needed to collectively explain 95% of the total variance, and then further filter these, keeping only those axes that individually explain at least 5% of the variance. These parameters can be changed using the arguments in the function `prepare_data()`. ```{r explore hist, out.width = "70%", fig.width = 7, fig.height = 4.5} # Check calibration data head(d_pca$calibration_data) # Check formula grid head(d_pca$formula_grid) # Explore variables distribution calib_hist_pca <- explore_calibration_hist(data = d_pca, raster_variables = var, include_m = TRUE, breaks = 7) plot_calibration_hist(explore_calibration = calib_hist_pca) ```
### External PCA Users can choose to perform a PCA with their data by using the `perform_pca()` function, or one of their preference. Se an example with `perform_pca()` below: ```{r do PCA, out.width = "70%"} # PCA with raw raster variables pca_var <- perform_pca(raster_variables = var, exclude_from_pca = "SoilType", center = TRUE, scale = TRUE) # Plot terra::plot(pca_var$env) ```
Now, let's use the PCs generated by `perform_pca()` to prepare the data: ```{r prepare data pca external} # Prepare data for maxnet model using PCA variables d_pca_extern <- prepare_data(algorithm = "maxnet", occ = occ_data, x = "x", y = "y", raster_variables = pca_var$env, # Output of perform_pca() do_pca = FALSE, # Set to FALSE because variables are PCs species = "Myrcia hatschbachii", categorical_variables = "SoilType", n_background = 1000, partition_method = "kfolds", features = c("l", "q", "p", "lq", "lqp"), r_multiplier = c(0.1, 1, 2, 3, 5)) # Check the object print(d_pca_extern) # Check formula grid head(d_pca_extern$formula_grid) ```
Note that since PCA was performed externally, `do_pca = FALSE` is set in `prepare_data()`. This is crucial because setting it to `TRUE` would incorrectly apply PCA to variables that are *already* PCs. The `prepared_data` object in this scenario does not store any PCA-related information. Therefore, users **must provide PCs** instead of raw raster variables when projecting models.
## Using external data partitions The functions `prepare_data()` and `prepare_user_data()` in the `kuenm2` package include four built-in methods for data partitioning: - **"kfolds"**: Splits the dataset into *K* subsets (folds) of approximately equal size. In each partition, one fold is used as the test set, while the remaining folds are combined to form the training set. - **"bootstrap"**: Creates the training dataset by sampling observations from the original dataset *with replacement* (i.e., the same observation can be selected multiple times). The test set consists of the observations that were not selected in that specific replicate. - **"subsample"**: Similar to bootstrap, but the training set is created by sampling *without replacement* (i.e., each observation is selected once). The test set includes the observations not selected for training. Other methods for data partitioning are also available, including those based on spatial rules ([Radosavljevic and Anderson, 2014](https://onlinelibrary.wiley.com/doi/pdf/10.1111/jbi.12227)). Although `kuenm2` does not currently implement spatial partitioning techniques, it is possible to use the ones implemented in other R packages. After partitioning data using other packages, those results can be used to replace the `part_data` section in the `prepared_data` object from `kuenm2`.
### ENMeval partitions The ENMeval package ([Kass et al. 2021](https://besjournals.onlinelibrary.wiley.com/doi/pdfdirect/10.1111/2041-210X.13628)) provides three spatial partitioning methods: - **Spatial block**: Divides occurrence data into four groups based on latitude and longitude, aiming for groups of roughly equal number of occurrences. - **Checkerboard 1 (basic)**: Generates a checkerboard grid over the study area and assigns points to groups based on their location in the grid. - **Checkerboard 2 (hierarchical)**: Aggregates the input raster at two hierarchical spatial scales using specified aggregation factors. Points are then grouped based on their position in the resulting hierarchical checkerboard. Let's use the **spatial block** method as an example. We will use the object `d`, `prepared_data` created in previous steps. ```{r import enmeval_block, echo=FALSE} data(enmeval_block, package = "kuenm2") ``` ```{r spatial block enmeval, eval=FALSE} # Install ENMeval if not already installed if(!require("ENMeval")){ install.packages("ENMeval") } # Extract calibration data from the prepared_data object and # separate presence and background records calib_occ <- d$data_xy[d$calibration_data$pr_bg == 1, ] # Presences calib_bg <- d$data_xy[d$calibration_data$pr_bg == 0, ] # Background # Apply spatial block partitioning using ENMeval enmeval_block <- ENMeval::get.block(occs = calib_occ, bg = calib_bg) # Inspect the structure of the output str(enmeval_block) #> List of 2 #> $ occs.grp: num [1:51] 1 1 2 2 1 4 2 2 4 2 ... #> $ bg.grp : num [1:957] 2 4 4 1 3 3 3 1 1 3 ... ```
The output of `get.block()` is a list with two elements: `occs.grp` and `bg.grp`. The `occs.grp` vector is for occurrence records and `bg.grp` for background points. Both vectors contain numeric values from 1 to 4, indicating the spatial group. `kuenm2` stores partitioned data as a **list of vectors**, in which each vector is a partition, containing the **indices of points left out for testing**. The indices include both presence and background points. ```{r index part kuenm2} str(d$part_data) ```
We can convert the spatial group information stored in `enmeval_block` into a list compatible with `kuenm2`: ```{r transform enmeval_block} # Identify unique spatial blocks id_blocks <- sort(unique(unlist(enmeval_block))) # Create a list of test indices for each spatial block new_part_data <- lapply(id_blocks, function(i) { # Indices of occurrence records in group i rep_i_presence <- which(enmeval_block$occs.grp == i) # Indices of background records in group i rep_i_bg <- which(enmeval_block$bg.grp == i) # To get the right indices for background, # we need to sum the total number of records to background indices rep_i_bg <- rep_i_bg + nrow(occ_data) # Combine presence and background indices for the test set c(rep_i_presence, rep_i_bg) }) # Assign names to each partition names(new_part_data) <- paste0("Partition_", id_blocks) # Inspect the structure of the new partitioned data str(new_part_data) ```
We now have a list containing four vectors, each storing the indices of test data defined using the `get.block()` function from the `ENMeval` package. The final step is to replace the original `part_data` in the `prepared_data` object with `new_part_data`. We should also update the partitioning method to reflect this change. ```{r replace part_data} # Replace the original partition data with the new spatial blocks d_spatial_block <- d d_spatial_block$part_data <- new_part_data # Update the partitioning method to reflect the new approach d_spatial_block$partition_method <- "Spatial block (ENMeval)" # Any name works # Print the updated prepared_data object print(d_spatial_block) ```
Let's check the spatial distribution of the partitioned data: ```{r plot spatial block, fig.width = 9, fig.height = 4} # Explore data partitioning in geography geo_block <- explore_partition_geo(d_spatial_block, raster_variables = var[[1]]) # Plot data partition in geography terra::plot(geo_block[[c("Presence", "Background")]]) ```
Because environmental conditions often vary by region, using spatial blocks increases the chances of having testing records outside training environmental ranges. Let's explore this effect using the `prepared_data` object, partitioned with `ENMeval` spatial blocks. ```{r, fig.width = 8, fig.height = 3.5, out.width = "75%"} # Run extrapolation risk analysis mop_blocks <- explore_partition_extrapolation(data = d_spatial_block, include_test_background = TRUE) # Check some testing records with values outside training ranges head(mop_blocks$Mop_results[mop_blocks$Mop_results$n_var_out > 1, ]) # Check simple extrapolation in geographic space plot_explore_partition(explore_partition = mop_blocks, space = "G", type_of_plot = "simple") # Now in environmental space plot_explore_partition(explore_partition = mop_blocks, space = "E", type_of_plot = "simple", variables = c("bio_7", "bio_15")) ```
Note that in partition 1, some occurrence records fall outside the environmental range of the other partitions (the same happens with many background records).
### flexsdm partitions The package `flexsdm` ([Velazco et al. 2022](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13874)) also offers multiple partitioning methods. We will use the method spatial block cross-validation. In this method, the number and size of blocks is optimized internally considering spatial autocorrelation, environmental similarity, and the number of presence and background records within each partition. For more details, see [Data partitioning at the package website](https://sjevelazco.github.io/flexsdm/articles/v01_pre_modeling.html#data-partitioning). ```{r import flexsdm_block, echo=FALSE} data(flexsdm_block, package = "kuenm2") ``` ```{r spatial block flexsdm, eval=FALSE} # Install flexsdm if not already installed if (!require("flexsdm")) { if (!require("remotes")) { install.packages("remotes") } remotes::install_github("sjevelazco/flexsdm") # needs compilation tools to work } # Combine calibration data with spatial coordinates calib_data <- cbind(d$data_xy, d$calibration_data) # Split data in test and train using flexsdm flexsdm_block <- flexsdm::part_sblock(env_layer = var, data = calib_data, x = "x", y = "y", pr_ab = "pr_bg", min_res_mult = 1, max_res_mult = 500, num_grids = 30, n_part = 4, prop = 0.5) #> The following grid cell sizes will be tested: #> 0.17 | 3.03 | 5.9 | 8.77 | 11.64 | 14.51 | 17.37 | 20.24 | 23.11 | 25.98 | #> 28.84 | 31.71 | 34.58 | 37.45 | 40.32 | 43.18 | 46.05 | 48.92 | 51.79 | #> 54.66 | 57.52 | 60.39 | 63.26 | 66.13 | 68.99 | 71.86 | 74.73 | 77.6 | #> 80.47 | 83.33 #> #> Creating basic raster mask... #> #> Searching for the optimal grid size... # See the structure of the object str(flexsdm_block$part) #> Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1008 obs. of 4 variables: #> $ x : num -51.3 -50.6 -49.3 -49.8 -50.2 ... #> $ y : num -29 -27.6 -27.8 -26.9 -28.2 ... #> $ pr_ab: num 1 1 1 1 1 1 1 1 1 1 ... #> $ .part: int 3 1 3 2 3 3 1 4 4 3 ... ```
The output from `flexsdm` differs from that of `ENMeval`. Instead of returning a list with separate vectors of spatial group IDs, `flexsdm` appends the group assignments as a new column in the `part` element of its output. As with the ENMeval example, we can transform the spatial group information stored in `flexsdm_block` into a format compatible with `kuenm2`: ```{r transform flexsdm_block} # Identify unique spatial blocks from flexsdm output id_blocks <- sort(unique(flexsdm_block$part$.part)) # Create a list of test indices for each spatial block new_part_data_flexsdm <- lapply(id_blocks, function(i) { # Indices of occurrences and background points in group i which(flexsdm_block$part$.part == i) }) # Assign names to each partition partition names(new_part_data_flexsdm) <- paste0("Partition_", id_blocks) # Inspect the structure of the new partitioned data str(new_part_data_flexsdm) # Replace the partition data in the prepared_data object d_block_flexsdm <- d d_block_flexsdm$part_data <- new_part_data_flexsdm # Update the partition method description d_block_flexsdm$partition_method <- "Spatial block (flexsdm)" # any name works # Display the updated prepared_data object print(d_block_flexsdm) ```
Let's check the spatial distribution of the partitioned data: ```{r plot spatial block flexsdm, fig.width = 9, fig.height = 4} # Explore data partitioning in geography geo_block_flexsdm <- explore_partition_geo(d_block_flexsdm, raster_variables = var[[1]]) # Plot data partition in geography terra::plot(geo_block_flexsdm[[c("Presence", "Background")]]) ```
```{r par_reset} # Reset plotting parameters par(original_par) ```
# Saving prepared data The `prepared_data` object is crucial for the next step in the ENM workflow in `kuenm2`, model calibration. As this object is essentially a list, users can save it to a local directory using `saveRDS()`. Saving the object facilitates loading it back into your R session later using `readRDS()`. See an example below: ```{r save data, eval=FALSE} # Set directory to save (here, in a temporary directory) dir_to_save <- file.path(tempdir()) # Save the data saveRDS(d, file.path(dir_to_save, "Data.rds")) # Import data d <- readRDS(file.path(dir_to_save, "Data.rds")) ```