---
title: "Getting started with mverse"
output: rmarkdown::html_vignette
link-citations: yes
vignette: >
  %\VignetteIndexEntry{Getting started with mverse}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
references:
  - id: datasrc
    title: "Many analysts, one dataset: Making transparent how variations in anlytical choices affect results"
    type: entry
    issued:
      year: 2014
      month: 4
      day: 24
    accessed:
      year: 2019
    URL: https://osf.io/gvm2z/
    author:
    - given: Raphael 
      family: Silberzahn 
    - given: "Eric Luis" 
      family: Uhlmann 
    - given: Dan 
      family: Martin 
    - given: Pasquale 
      family: Anselmi 
    - given: Frederik 
      family: Aust 
    - given: "Eli C."
      family: Awtrey
    - given: Štěpán
      family: Bahník 
    - given: Feng
      family: Bai 
    - given: Colin
      family: Bannard
    - given: Evelina
      family: Bonnier
    - given: Rickard
      family: Carlsson
    - given: Felix
      family: Cheung
    - given: Garret
      family: Christensen
    - given: Russ
      family: Clay
    - given: "Maureen A."
      family: Craig
    - given: Anna 
      family: "Dalla Rosa"
    - given: Lammertjan
      family: Dam
    - given: "Mathew H."
      family: Evans
    - given: "Ismael Flores"
      family: Cervantes
    - given: Nathan
      family: Fong
    - given: Monica
      family: Gamez-Djokic
    - given: Andreas
      family: Glenz
    - given: Shauna
      family: Gordon-McKeon
    - given: Tim
      family: Heaton
    - given: "Karin Hederos" 
      family: Eriksson 
    - given: Moritz
      family: Heene
    - given: "Alicia Hofelich"
      family: Mohr 
    - given: Kent
      family: Hui
    - given: Magnus
      family: Johannesson
    - given: Jonathan
      family: Kalodimos
    - given: Erikson
      family: Kaszubowski
    - given: Deanna
      family: Kennedy
    - given: Ryan
      family: Lei
    - given: "Thomas Andrew"
      family: Lindsay
    - given: Silvia
      family: Liverani
    - given: Christopher
      family: Madan
    - given: "Daniel C."
      family: Molden 
    - given: Eric 
      family: Molleman
    - given: "Richard D."
      family: Morey
    - given: Laetitia
      family: Mulder
    - given: "Bernard A." 
      family: Nijstad
    - given: Bryson
      family: Pope
    - given: Nolan
      family: Pope
    - given: "Jason M."
      family: Prenoveau
    - given: Floor
      family: Rink
    - given: Egidio
      family: Robusto
    - given: Hadiya
      family: Roderique
    - given: Anna
      family: Sandberg
    - given: Elmar
      family: Schlueter
    - given: Felix
      family: S
    - given: "Martin F." 
      family: Sherman
    - given: "S. Amy"
      family: Sommer
    - given: "Kristin Lee"
      family: Sotak
    - given: "Seth M."
      family: Spain
    - given: Christoph
      family: Spörlein 
    - given: Tom
      family: Stafford
    - given: Luca
      family: Stefanutti
    - given: Susanne
      family: Täuber
    - given: Johannes
      family: Ullrich
    - given: Michelangelo 
      family: Vianello 
    - given: Eric-Jan 
      family: Wagenmakers 
    - given: Maciej 
      family: Witkowiak 
    - given: Sangsuk 
      family: Yoon 
    - given: Brian A. 
      family: Nosek
---

```{r setup, include = FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(collapse = TRUE, fig.width = 7, fig.height = 4)
```

## Simple Example: Transform and Summarise One Numeric Column

Suppose that we have a column `col1` that we wish to transform in three different ways and compute the five number summary of the column after the transformations.

```{r}
library(mverse)
library(tibble)
library(dplyr)
library(ggplot2)
set.seed(6)
df <- tibble(col1 = rnorm(5, 0, 1), col2 = col1 + runif(5))
```


### Step 1: `create_multiverse` of the data frame

```{r}
mv <- create_multiverse(df)
```


### Step 2: `mutate_branch` to transform `col1`

```{r}
# Step 2: create a branch - each branch corresponds to a universe

transformation_branch <- mutate_branch(col1 = col1,
                                       col1_t1 = log(abs(col1 + 1)),
                                       col1_t2 = abs(col1))
```


### Step 3: `add_mutate_branch` to `mv`

```{r}
mv <- mv |> add_mutate_branch(transformation_branch)
```


### Step 4: `execute_multiverse` to execute the transformations

```{r}
mv <- execute_multiverse(mv)
```

### Step 5: Extract Transformed Values from `mv`

`extract` to add the column to `df_transformed` that labels transformations.

```{r echo=FALSE}
extract <- mverse::extract
```

```{r}
df_transformed <- extract(mv)

df_transformed |> head()
```


### Step 5: use `tidyverse` to compute the summary and plot the distribution of each transformation (universe)

```{r}
df_transformed |>
  group_by(transformation_branch_branch) |>
  summarise(n = n(),
            mean = mean(transformation_branch),
            sd = sd(transformation_branch),
            median = median(transformation_branch),
            IQR = IQR(transformation_branch))

df_transformed |>
  ggplot(aes(x = transformation_branch)) + geom_histogram(bins = 3) +
  facet_wrap(vars(transformation_branch_branch))
```

## Simple Example: Using `mverse` to Fit Three Simple Linear Regression of a Transformed Column


### Step 1: `create_multiverse` of the data frame

```{r}
mv1 <- create_multiverse(df)
```


### Step 2: Create `formula_branch` of the linear regression models

```{r}
formulas <- formula_branch(col2 ~ col1,
                           col2 ~ log(abs(col1 + 1)),
                           col2 ~ abs(col1))
```


### Step 3: `add_formula_branch` to multiverse of data frame

```{r}
mv1 <- mv1 |> add_formula_branch(formulas)
```

### Step 3: `lm_mverse` to compute linear regression models across the multiverse

```{r}
lm_mverse(mv1)
```


### Step 4: Use `summary` to extract regression output

```{r}
summary(mv1)
```

Let's compare using `mverse` to using `tidyverse` and base R to fit the three models.

One way to do this using `tidyverse` is to create a list of the model formulas then map the list to `lm`.

```{r }
mod1 <- formula(col2 ~ col1)

mod2 <- formula(col2 ~ log(abs(col1 + 1)))

mod3 <- formula(col2 ~ abs(col1))

models <- list(mod1, mod2, mod3)

models |>
  purrr::map(lm, data = df) |>
  purrr::map(broom::tidy) |>
  bind_rows()
```

Using base R we can use `lappy` instead of 

```{r}
modfit <- lapply(models, function(x) lm(x, data = df))

lapply(modfit, function(x) summary(x)[4])
```

## Are Soccer Referees Biased?

In this example, we use a real dataset that demonstrates how `mverse` makes it easy to define multiple definitions for a column and compare the results of the different definitions. We combine soccer player skin colour ratings by two independent raters (`rater1` and `rater2`) from `soccer` dataset included in `mverse`. 

The data comes from @datasrc and contains `r format(nrow(soccer), big.mark = ",")` rows of player-referee pairs. For each player, two independent raters coded their skin tones on a 5-point scale ranging from _very light skin_ (`0.0`) to _very dark skin_ (`1.0`). For the purpose of demonstration, we only use a unique record per player and consider only those with both ratings.

```{r load, message=FALSE}
library(mverse)
soccer_bias <- soccer[!is.na(soccer$rater1) & !is.na(soccer$rater2),
                      c("playerShort", "rater1", "rater2")]
soccer_bias <- unique(soccer_bias)
head(soccer_bias)
```

We would like to study the distribution of the player skin tones but the two independent rating do not always match. To combine the two ratings, we may choose to consider the following options:

1.  the mean numeric value
2.  the darker rating of the two
3.  the lighter rating of the two
4.  the first rating only
5.  the second rating only

## Analysis using Base R and `Tidyverse`

Let's first consider how you might study the five options using R without `mverse`. First, we define the five options as separate variables in R.

```{r base_r}
skin_option_1 <- (soccer_bias$rater1 + soccer_bias$rater2) / 2
skin_option_2 <- ifelse(soccer_bias$rater1 > soccer_bias$rater2,
                        soccer_bias$rater1,
                        soccer_bias$rater2)
skin_option_3 <- ifelse(soccer_bias$rater1 < soccer_bias$rater2,
                        soccer_bias$rater1,
                        soccer_bias$rater2)
skin_option_4 <- soccer_bias$rater1
skin_option_5 <- soccer_bias$rater2
```

We can plot a histogram to study the distribution of the resulting skin tone value for each option. Below is the histogram for the first option (`skin_option_1`).

```{r hist_base}
library(ggplot2)
ggplot(mapping = aes(x = skin_option_1)) +
  geom_histogram(breaks = seq(0, 1, 0.2),
                 colour = "white") +
  labs(title = "Histogram of player skin tones (Option 1: Mean).",
       x = "Skin Tone", y = "Count")
```


For the remaining four options, we can repeat the step above to examine the distributions, or create a new data frame combining all five options to use in a ggplot as shown below. In both cases, users need to take care of plotting all five manually.

```{r hist_base_overlaid}
skin_option_all <- data.frame(
  x = c(skin_option_1,
        skin_option_2,
        skin_option_3,
        skin_option_4,
        skin_option_5),
  Option = rep(
    c("Option 1: Mean",
      "Option 2: Max",
      "Option 3: Min",
      "Option 4: Rater 1",
      "Option 5: Rater 2"),
    each = nrow(df)
  )
)
ggplot(data = skin_option_all) +
  geom_histogram(aes(x = x), binwidth = 0.1) +
  labs(title = "Histogram of player skin tones for each option.",
       x = "Skin Tone", y = "Count") +
  facet_wrap(. ~ Option)
```

## Analysis Using `mverse`

### Branching Using `mverse`

We now turn to `mverse` to create the five options above. First, we define an `mverse` object with the dataset. Note that `mverse` assumes a single dataset for each multiverse analysis.

```{r create_mv}
soccer_bias_mv <- create_multiverse(soccer_bias)
```

A _branch_ in `mverse` refers to different modelling or data wrangling decisions.  For example, a mutate branch - analogous to `mutate` method in `tidyverse`'s data manipulation grammar, lets you define a set of options for defining a new column in your dataset. 

You can create a mutate branch with `mutate_branch()`. The syntax for defining the options inside `mutate_branch()` follows the `tidyverse`'s grammar as well. 

```{r mutate_branch}
skin_tone <- mutate_branch(
  (rater1 + rater2) / 2,
  ifelse(rater1 > rater2, rater1, rater2),
  ifelse(rater1 < rater2, rater1, rater2),
  rater1,
  rater2
)
```

Then add the newly defined mutate branch to the `mv` object using `add_mutate_branch()`. 

```{r add_vb}
soccer_bias_mv <- soccer_bias_mv |> add_mutate_branch(skin_tone)
```

Adding a branch to a `mverse` object multiplies the number of environments defined inside the object so that the environments capture all unique analysis paths. Without any branches, a `mverse` object has a single environment. We call these environments _universes_. For example, adding the `skin_tone` mutate branch to `mv` results in $1 \times 5 = 5$  universes inside `mv`. In each universe, the analysis dataset now has a new column named `skin_tone` -  the name of the mutate branch object. 

You can check that the mutate branch was added with `summary()` method for the `mv` object. The method prints a _multiverse table_ that lists all universes with branches as columns and corresponding options as values defined in the `mv` object.

```{r check_multiverse}
summary(soccer_bias_mv)
```

At this point, the values of the new column `skin_tone` are only populated in the first universe. To populate the values for all universes, we call `execute_multiverse`.

```{r exec}
execute_multiverse(soccer_bias_mv)
```

### Summarizing The Distribution Of Each Branch Option

In this section, we now examine and compare the distributions of `skin_tone` values between different options. You can extract the values in each universe using `extract()`. By default, the method returns all columns created by a mutate branch across all universes. In this example, we only have one column - `skin_tone`.

```{r extract_multiverse}
branched <- mverse::extract(soccer_bias_mv)
```

`branched` is a dataset with `skin_tone` values. If we want to extract the `skin_tone` values that were computed using the average of the two raters then we can filter `branched` by `skin_tone_branch` values equal to `(rater1 + rater2) / 2`.  Alternatively, we could filter by `universe == 1`.

```{r head_skin_tone}
branched |>
  filter(skin_tone_branch == "(rater1 + rater2) / 2") |>
  head()
```

The distribution of each method for calculating skin tone can be computed by grouping the levels of `skin_tone_branch`.

```{r}
branched |>
  group_by(skin_tone_branch) |>
  summarise(n = n(),
            mean = mean(skin_tone),
            sd = sd(skin_tone),
            median = median(skin_tone),
            IQR = IQR(skin_tone))
```


Selecting a random subset of rows data is useful when the multiverse is large.  The `frow` parameter in `extract()` provides the option to extract a random subset of rows in each universe. It takes a value between 0 and 1 that represent the fraction of values to extract from each universe. For example, setting `frow = 0.05` returns approximately 5\% of values from each universe (i.e., `skin_tone_branch` in this case).

```{r extract_fraction}
frac <- extract(soccer_bias_mv, frow =  0.05)
```

So, each universe is a 20% of the random sample.

```{r}
frac |>
  group_by(universe) |>
  tally() |>
  mutate(percent = (n / sum(n)) * 100)
```


Finally, we can construct plots to compare the distributions of `skin_tone` in different universes. For example, you can overlay density lines on a single plot.

```{r compare_universe, warning=FALSE}
branched |>
  ggplot(mapping = aes(x = skin_tone, color = universe)) +
  geom_density(alpha = 0.2) +
  labs(title = "Density of player skin tones for each option.",
       x = "Skin Tone", y = "Density") +
  scale_color_discrete(
    labels = c("Option 1: Mean",
               "Option 2: Max",
               "Option 3: Min",
               "Option 4: Rater 1",
               "Option 5: Rater 2"),
    name = NULL
  )
```

Another option is the use `ggplot`'s `facet_grid` function to generate multiple plots in a grid. `facet_wrap(. ~ universe)` generates individual plots for each universe.

```{r compare_universe_hist, warning=FALSE}
branched |>
  ggplot(mapping = aes(x = skin_tone)) +
  geom_histogram(position = "dodge", bins = 21) +
  labs(title = "Histogram of player skin tones for each option.",
       y = "Count", x = "Skin Tone") +
  facet_wrap(
    . ~ universe,
    labeller = labeller(
      universe = c(`1` = "Option 1: Mean",
                   `2` = "Option 2: Max",
                   `3` = "Option 3: Min",
                   `4` = "Option 4: Rater 1",
                   `5` = "Option 5: Rater 2")
    )
  )
```

## References