--- title: "Getting started with mverse" output: rmarkdown::html_vignette link-citations: yes vignette: > %\VignetteIndexEntry{Getting started with mverse} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} references: - id: datasrc title: "Many analysts, one dataset: Making transparent how variations in anlytical choices affect results" type: entry issued: year: 2014 month: 4 day: 24 accessed: year: 2019 URL: https://osf.io/gvm2z/ author: - given: Raphael family: Silberzahn - given: "Eric Luis" family: Uhlmann - given: Dan family: Martin - given: Pasquale family: Anselmi - given: Frederik family: Aust - given: "Eli C." family: Awtrey - given: Štěpán family: Bahník - given: Feng family: Bai - given: Colin family: Bannard - given: Evelina family: Bonnier - given: Rickard family: Carlsson - given: Felix family: Cheung - given: Garret family: Christensen - given: Russ family: Clay - given: "Maureen A." family: Craig - given: Anna family: "Dalla Rosa" - given: Lammertjan family: Dam - given: "Mathew H." family: Evans - given: "Ismael Flores" family: Cervantes - given: Nathan family: Fong - given: Monica family: Gamez-Djokic - given: Andreas family: Glenz - given: Shauna family: Gordon-McKeon - given: Tim family: Heaton - given: "Karin Hederos" family: Eriksson - given: Moritz family: Heene - given: "Alicia Hofelich" family: Mohr - given: Kent family: Hui - given: Magnus family: Johannesson - given: Jonathan family: Kalodimos - given: Erikson family: Kaszubowski - given: Deanna family: Kennedy - given: Ryan family: Lei - given: "Thomas Andrew" family: Lindsay - given: Silvia family: Liverani - given: Christopher family: Madan - given: "Daniel C." family: Molden - given: Eric family: Molleman - given: "Richard D." family: Morey - given: Laetitia family: Mulder - given: "Bernard A." family: Nijstad - given: Bryson family: Pope - given: Nolan family: Pope - given: "Jason M." family: Prenoveau - given: Floor family: Rink - given: Egidio family: Robusto - given: Hadiya family: Roderique - given: Anna family: Sandberg - given: Elmar family: Schlueter - given: Felix family: S - given: "Martin F." family: Sherman - given: "S. Amy" family: Sommer - given: "Kristin Lee" family: Sotak - given: "Seth M." family: Spain - given: Christoph family: Spörlein - given: Tom family: Stafford - given: Luca family: Stefanutti - given: Susanne family: Täuber - given: Johannes family: Ullrich - given: Michelangelo family: Vianello - given: Eric-Jan family: Wagenmakers - given: Maciej family: Witkowiak - given: Sangsuk family: Yoon - given: Brian A. family: Nosek --- ```{r setup, include = FALSE, message=FALSE, warning=FALSE} knitr::opts_chunk$set(collapse = TRUE, fig.width = 7, fig.height = 4) ``` ## Simple Example: Transform and Summarise One Numeric Column Suppose that we have a column `col1` that we wish to transform in three different ways and compute the five number summary of the column after the transformations. ```{r} library(mverse) library(tibble) library(dplyr) library(ggplot2) set.seed(6) df <- tibble(col1 = rnorm(5, 0, 1), col2 = col1 + runif(5)) ``` ### Step 1: `create_multiverse` of the data frame ```{r} mv <- create_multiverse(df) ``` ### Step 2: `mutate_branch` to transform `col1` ```{r} # Step 2: create a branch - each branch corresponds to a universe transformation_branch <- mutate_branch(col1 = col1, col1_t1 = log(abs(col1 + 1)), col1_t2 = abs(col1)) ``` ### Step 3: `add_mutate_branch` to `mv` ```{r} mv <- mv |> add_mutate_branch(transformation_branch) ``` ### Step 4: `execute_multiverse` to execute the transformations ```{r} mv <- execute_multiverse(mv) ``` ### Step 5: Extract Transformed Values from `mv` `extract` to add the column to `df_transformed` that labels transformations. ```{r echo=FALSE} extract <- mverse::extract ``` ```{r} df_transformed <- extract(mv) df_transformed |> head() ``` ### Step 5: use `tidyverse` to compute the summary and plot the distribution of each transformation (universe) ```{r} df_transformed |> group_by(transformation_branch_branch) |> summarise(n = n(), mean = mean(transformation_branch), sd = sd(transformation_branch), median = median(transformation_branch), IQR = IQR(transformation_branch)) df_transformed |> ggplot(aes(x = transformation_branch)) + geom_histogram(bins = 3) + facet_wrap(vars(transformation_branch_branch)) ``` ## Simple Example: Using `mverse` to Fit Three Simple Linear Regression of a Transformed Column ### Step 1: `create_multiverse` of the data frame ```{r} mv1 <- create_multiverse(df) ``` ### Step 2: Create `formula_branch` of the linear regression models ```{r} formulas <- formula_branch(col2 ~ col1, col2 ~ log(abs(col1 + 1)), col2 ~ abs(col1)) ``` ### Step 3: `add_formula_branch` to multiverse of data frame ```{r} mv1 <- mv1 |> add_formula_branch(formulas) ``` ### Step 3: `lm_mverse` to compute linear regression models across the multiverse ```{r} lm_mverse(mv1) ``` ### Step 4: Use `summary` to extract regression output ```{r} summary(mv1) ``` Let's compare using `mverse` to using `tidyverse` and base R to fit the three models. One way to do this using `tidyverse` is to create a list of the model formulas then map the list to `lm`. ```{r } mod1 <- formula(col2 ~ col1) mod2 <- formula(col2 ~ log(abs(col1 + 1))) mod3 <- formula(col2 ~ abs(col1)) models <- list(mod1, mod2, mod3) models |> purrr::map(lm, data = df) |> purrr::map(broom::tidy) |> bind_rows() ``` Using base R we can use `lappy` instead of ```{r} modfit <- lapply(models, function(x) lm(x, data = df)) lapply(modfit, function(x) summary(x)[4]) ``` ## Are Soccer Referees Biased? In this example, we use a real dataset that demonstrates how `mverse` makes it easy to define multiple definitions for a column and compare the results of the different definitions. We combine soccer player skin colour ratings by two independent raters (`rater1` and `rater2`) from `soccer` dataset included in `mverse`. The data comes from @datasrc and contains `r format(nrow(soccer), big.mark = ",")` rows of player-referee pairs. For each player, two independent raters coded their skin tones on a 5-point scale ranging from _very light skin_ (`0.0`) to _very dark skin_ (`1.0`). For the purpose of demonstration, we only use a unique record per player and consider only those with both ratings. ```{r load, message=FALSE} library(mverse) soccer_bias <- soccer[!is.na(soccer$rater1) & !is.na(soccer$rater2), c("playerShort", "rater1", "rater2")] soccer_bias <- unique(soccer_bias) head(soccer_bias) ``` We would like to study the distribution of the player skin tones but the two independent rating do not always match. To combine the two ratings, we may choose to consider the following options: 1. the mean numeric value 2. the darker rating of the two 3. the lighter rating of the two 4. the first rating only 5. the second rating only ## Analysis using Base R and `Tidyverse` Let's first consider how you might study the five options using R without `mverse`. First, we define the five options as separate variables in R. ```{r base_r} skin_option_1 <- (soccer_bias$rater1 + soccer_bias$rater2) / 2 skin_option_2 <- ifelse(soccer_bias$rater1 > soccer_bias$rater2, soccer_bias$rater1, soccer_bias$rater2) skin_option_3 <- ifelse(soccer_bias$rater1 < soccer_bias$rater2, soccer_bias$rater1, soccer_bias$rater2) skin_option_4 <- soccer_bias$rater1 skin_option_5 <- soccer_bias$rater2 ``` We can plot a histogram to study the distribution of the resulting skin tone value for each option. Below is the histogram for the first option (`skin_option_1`). ```{r hist_base} library(ggplot2) ggplot(mapping = aes(x = skin_option_1)) + geom_histogram(breaks = seq(0, 1, 0.2), colour = "white") + labs(title = "Histogram of player skin tones (Option 1: Mean).", x = "Skin Tone", y = "Count") ``` For the remaining four options, we can repeat the step above to examine the distributions, or create a new data frame combining all five options to use in a ggplot as shown below. In both cases, users need to take care of plotting all five manually. ```{r hist_base_overlaid} skin_option_all <- data.frame( x = c(skin_option_1, skin_option_2, skin_option_3, skin_option_4, skin_option_5), Option = rep( c("Option 1: Mean", "Option 2: Max", "Option 3: Min", "Option 4: Rater 1", "Option 5: Rater 2"), each = nrow(df) ) ) ggplot(data = skin_option_all) + geom_histogram(aes(x = x), binwidth = 0.1) + labs(title = "Histogram of player skin tones for each option.", x = "Skin Tone", y = "Count") + facet_wrap(. ~ Option) ``` ## Analysis Using `mverse` ### Branching Using `mverse` We now turn to `mverse` to create the five options above. First, we define an `mverse` object with the dataset. Note that `mverse` assumes a single dataset for each multiverse analysis. ```{r create_mv} soccer_bias_mv <- create_multiverse(soccer_bias) ``` A _branch_ in `mverse` refers to different modelling or data wrangling decisions. For example, a mutate branch - analogous to `mutate` method in `tidyverse`'s data manipulation grammar, lets you define a set of options for defining a new column in your dataset. You can create a mutate branch with `mutate_branch()`. The syntax for defining the options inside `mutate_branch()` follows the `tidyverse`'s grammar as well. ```{r mutate_branch} skin_tone <- mutate_branch( (rater1 + rater2) / 2, ifelse(rater1 > rater2, rater1, rater2), ifelse(rater1 < rater2, rater1, rater2), rater1, rater2 ) ``` Then add the newly defined mutate branch to the `mv` object using `add_mutate_branch()`. ```{r add_vb} soccer_bias_mv <- soccer_bias_mv |> add_mutate_branch(skin_tone) ``` Adding a branch to a `mverse` object multiplies the number of environments defined inside the object so that the environments capture all unique analysis paths. Without any branches, a `mverse` object has a single environment. We call these environments _universes_. For example, adding the `skin_tone` mutate branch to `mv` results in $1 \times 5 = 5$ universes inside `mv`. In each universe, the analysis dataset now has a new column named `skin_tone` - the name of the mutate branch object. You can check that the mutate branch was added with `summary()` method for the `mv` object. The method prints a _multiverse table_ that lists all universes with branches as columns and corresponding options as values defined in the `mv` object. ```{r check_multiverse} summary(soccer_bias_mv) ``` At this point, the values of the new column `skin_tone` are only populated in the first universe. To populate the values for all universes, we call `execute_multiverse`. ```{r exec} execute_multiverse(soccer_bias_mv) ``` ### Summarizing The Distribution Of Each Branch Option In this section, we now examine and compare the distributions of `skin_tone` values between different options. You can extract the values in each universe using `extract()`. By default, the method returns all columns created by a mutate branch across all universes. In this example, we only have one column - `skin_tone`. ```{r extract_multiverse} branched <- mverse::extract(soccer_bias_mv) ``` `branched` is a dataset with `skin_tone` values. If we want to extract the `skin_tone` values that were computed using the average of the two raters then we can filter `branched` by `skin_tone_branch` values equal to `(rater1 + rater2) / 2`. Alternatively, we could filter by `universe == 1`. ```{r head_skin_tone} branched |> filter(skin_tone_branch == "(rater1 + rater2) / 2") |> head() ``` The distribution of each method for calculating skin tone can be computed by grouping the levels of `skin_tone_branch`. ```{r} branched |> group_by(skin_tone_branch) |> summarise(n = n(), mean = mean(skin_tone), sd = sd(skin_tone), median = median(skin_tone), IQR = IQR(skin_tone)) ``` Selecting a random subset of rows data is useful when the multiverse is large. The `frow` parameter in `extract()` provides the option to extract a random subset of rows in each universe. It takes a value between 0 and 1 that represent the fraction of values to extract from each universe. For example, setting `frow = 0.05` returns approximately 5\% of values from each universe (i.e., `skin_tone_branch` in this case). ```{r extract_fraction} frac <- extract(soccer_bias_mv, frow = 0.05) ``` So, each universe is a 20% of the random sample. ```{r} frac |> group_by(universe) |> tally() |> mutate(percent = (n / sum(n)) * 100) ``` Finally, we can construct plots to compare the distributions of `skin_tone` in different universes. For example, you can overlay density lines on a single plot. ```{r compare_universe, warning=FALSE} branched |> ggplot(mapping = aes(x = skin_tone, color = universe)) + geom_density(alpha = 0.2) + labs(title = "Density of player skin tones for each option.", x = "Skin Tone", y = "Density") + scale_color_discrete( labels = c("Option 1: Mean", "Option 2: Max", "Option 3: Min", "Option 4: Rater 1", "Option 5: Rater 2"), name = NULL ) ``` Another option is the use `ggplot`'s `facet_grid` function to generate multiple plots in a grid. `facet_wrap(. ~ universe)` generates individual plots for each universe. ```{r compare_universe_hist, warning=FALSE} branched |> ggplot(mapping = aes(x = skin_tone)) + geom_histogram(position = "dodge", bins = 21) + labs(title = "Histogram of player skin tones for each option.", y = "Count", x = "Skin Tone") + facet_wrap( . ~ universe, labeller = labeller( universe = c(`1` = "Option 1: Mean", `2` = "Option 2: Max", `3` = "Option 3: Min", `4` = "Option 4: Rater 1", `5` = "Option 5: Rater 2") ) ) ``` ## References