---
title: "Processing the output of outbreaker2 with o2ools"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{my-vignette-name}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  out.width = "100%",
  fig.height = 7,
  dpi = 300
)
```

```{r setup}
library(outbreaker2)
library(o2ools)
```

# Introduction

The *o2ools* package provides helper functions to process and summarise the results from [outbreaker2](https://www.repidemicsconsortium.org/outbreaker2/).\

Below we use the default `fake_outbreak` example in the *outbreaker2* vignette.

```{r inputs}
# The input data 
head(linelist)

# outbreaker2 result
out
```

## Identify

By default *outbreaker2* will return a dataframe where each case is referred to by an integer based on the order of the cases in the linelist. The columns of this dataframe are named according to the parameters of interest. For example, `t_inf_1` refers to the infection time of case 1, `alpha_1` refers to the ancestor of case 1, etc.

To swap-in our real case IDs, we use `identify()`:

```{r}
out_id <- identify(out, ids = linelist$name)
head(out_id)
```

## Get individual trees

To extract the individual trees from the posterior sample, we use `get_trees()`. This function will return a list of trees, one for each sample in the posterior. You can add additional *outbreaker2* results to the trees, such as the infection times (`t_inf`), the number of generations (`kappa`) and other attributes from the linelist.

```{r}
trees <- get_trees(out_id, kappa = TRUE, t_inf = TRUE, group = linelist$group)
#100th sample
head(trees[[100]])
```

## Performance

### Accuracy

Accuracy is the proportion of cases that are correctly identified across the posterior sample. **You can only use this function if you have the true ancestry information for each case**.

```{r}
true_tree <- data.frame(
  from = as.character(outbreaker2::fake_outbreak$ances),
  to   = linelist$id,
  stringsAsFactors = FALSE
)

accuracy <- get_accuracy(out, true_tree)
hist(accuracy)
```

### Entropy

Entropy measures the uncertainty in a case's ancestry. The more potential ancestors a case has, the greater its entropy. By default, entropy is bounded between 0 and 1, where 0 indicates no uncertainty about a case's ancestry, and 1 indicates maximum uncertainty.

```{r}
entropy <- get_entropy(out_id)
barplot(
  entropy,
  horiz = TRUE,
  las = 1
  )

```

## Visualisation
### Augment the linelist with *outbreaker2*'s results

To add the results from *outbreaker2* to the linelist, we can use `augment_linelist()`. This function will add summary statistics for each case, such as the infection times inferred by *outbreaker2*.

```{r}
# augment linelist with summaries of t_inf and kappa
o2linelist <- augment_linelist(
  out, linelist,
  params  = c("t_inf","kappa"),
  summary_fns = list(
    mean = function(x) mean(x, na.rm = TRUE),
    q25  = function(x) quantile(x, .25, na.rm = TRUE),
    q75  = function(x) quantile(x, .75, na.rm = TRUE)
  )
)
head(o2linelist)
```

### Consensus tree

A consensus tree represents, for each case, their most frequent ancestor across the posterior sample. It provides a useful summary of the results but may violate the strict topological constraints of a tree, as it can include cycles (e.g. A is the most frequent ancestor of B, but B is also the most frequent ancestor of A). Here, `frequency` refers to the proportion of posterior samples in which the infector–infectee pair appears.

```{r}
consensus_tree <- get_consensus(out)
head(consensus_tree)
```

We can plot the consensus tree using the augmented linelist and the *epicontacts* package.

```{r}
library(epicontacts)
epi <- make_epicontacts(
  linelist = o2linelist,
  contacts = subset(consensus_tree, !is.na(from)),
  #remove NA edges (i.e. imports)
  directed = TRUE
)
# plot(epi) basic plot

# colour setting
epi$linelist$group <- factor(epi$linelist$group, levels = c("hcw", "patient"))
my_pal <- function(n) {
  c("orange", "purple")[1:n]
}

vis_epicontacts(
  epi,
  thin = FALSE, #show imports
  label = "name",
  node_shape = "group",
  shapes = c("hcw" = "user-md", "patient" = "bed"), # https://fontawesome.com/v4/icons/
  node_color = "group",
  edge_arrow = "to",
  col_pal       = my_pal,
  edge_col_pal  = my_pal,
)


```

Or we can convert the tree to an *igraph*/*tidygraph* object and plot it using the *ggraph* package.

```{r}
library(igraph)
library(tidygraph)
library(ggraph)

g <- epicontacts:::as.igraph.epicontacts(epi) |>
  as_tbl_graph()

layout_data <- create_layout(g, layout = 'kk')
layout_data$x <- layout_data$onset

p <- ggraph(layout_data)+
  geom_edge_link(
    aes(
      edge_width = frequency,
      color = .N()$group[from] # .N() to accesses node data
    ),
    arrow = arrow(length = unit(2.5, 'mm')),
    end_cap = circle(3, 'mm')
  ) +
  geom_node_point(
    aes(fill = group), 
    shape = 21,
    colour = "black",
    size = 5
  ) +
  geom_node_text(
    aes(label = epicontacts_name),
    repel = TRUE, size = 3, nudge_y = 0.1, bg.color = "white", bg.r = 0.1
  ) +
  scale_edge_width("Posterior support",
                   range = c(0.1, 1),
                   breaks = c(0.8, 0.9, 1)) +
  scale_fill_manual(
    "",
    values = c("hcw" = "orange", "patient" = "purple"),
    breaks = c("hcw", "patient")
  ) +
  scale_edge_colour_manual(
    "",
    values = c("hcw" = "orange", "patient" = "purple"),
    breaks = c("hcw", "patient")
  )+
  theme_bw() +
  labs(x = "Onset date",
       y = "")+
  theme(
    axis.line.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "bottom"
  )

print(p)
```

### Maximum posterior tree

The maximum posterior tree is the single most likely transmission tree, corresponding to the MCMC iteration with the highest posterior probability. You can extract it as follows:

```{r}
max_post <- trees[[which.max(out$post)]]

#Same as before, create and plot an epicontacts object
epi <- make_epicontacts(
  linelist = o2linelist,
  contacts = subset(max_post, !is.na(from)), #remove NA edges (i.e. imports)
  directed = TRUE
)
```

You can also use `filter_chain()` to restrict the output to "trees" that meet a specific condition. For instance, to keep only direct transmissions (i.e. `kappa == 1`), run:

```{r}
out_direct <- filter_chain(out, param = "kappa", thresh = 1, comparator = "==", target = "alpha")
```
This **masks** any transmission links involving intermediate cases, allowing you to focus on direct infector–infectee pairs.

# Additional analyses
For additional analyses check the other vignettes on the package's website.