---
title: "Introduction to mlstm"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to mlstm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(123)
```

# Overview

`mlstm` provides tools for fitting:

- Latent Dirichlet Allocation (LDA)
- Supervised Topic Models (STM)
- Multi-output supervised topic models (MLSTM)

This vignette shows a minimal end-to-end workflow using simulated data.

# Simulated corpus

We generate a small document-term representation in triplet form.  
Each row of `count` is `(d, v, c)` where:

- `d`: document index (0-based)
- `v`: vocabulary index (0-based)
- `c`: token count

```{r}
library(mlstm)

D <- 50
V <- 200
K <- 5

NZ_per_doc <- 20
NZ <- D * NZ_per_doc

count <- cbind(
  d = as.integer(rep(0:(D - 1), each = NZ_per_doc)),
  v = as.integer(sample.int(V, NZ, replace = TRUE) - 1L),
  c = as.integer(rpois(NZ, 3) + 1L)
)

Y <- cbind(
  y1 = rnorm(D),
  y2 = rnorm(D)
)

dim(count)
head(count)
dim(Y)
```

# LDA

We first fit an unsupervised LDA model.

```{r}
mod_lda <- run_lda_gibbs(
  count = count,
  K = K,
  alpha = 0.1,
  beta = 0.01,
  n_iter = 20,
  verbose = FALSE
)

str(mod_lda$theta)
str(mod_lda$phi)
```

The output typically includes:

- `theta`: document-topic proportions
- `phi`: topic-word distributions
- additional trace information depending on the implementation

# STM

Next, we fit a supervised topic model using a single response variable.

```{r}
y <- Y[, 1]

set_threads(2)

mod_stm <- run_stm_vi(
  count = count,
  y = y,
  K = K,
  alpha = 0.1,
  beta = 0.01,
  max_iter = 50,
  min_iter = 10,
  verbose = FALSE
)

y_hat <- ((mod_stm$nd / mod_stm$ndsum) %*% mod_stm$eta)[, 1]
cor(y, y_hat)
```

If available in the returned object, you can also inspect optimization traces such as ELBO:

```{r, eval = FALSE}
plot(mod_stm$elbo_trace, type = "l")
plot(mod_stm$label_loglik_trace, type = "l")
```

# MLSTM

Finally, we fit a multi-output supervised topic model.

```{r}
mu <- rep(0, K)
upsilon <- K + 2
Omega <- diag(K)

mod_mlstm <- run_mlstm_vi(
  count = count,
  Y = Y,
  K = K,
  alpha = 0.1,
  beta = 0.01,
  mu = mu,
  upsilon = upsilon,
  Omega = Omega,
  max_iter = 50,
  min_iter = 10,
  verbose = FALSE
)

Y_hat <- ((mod_mlstm$nd / mod_mlstm$ndsum) %*% mod_mlstm$eta)
cor(Y, Y_hat)
```

As with STM, you can inspect fitting diagnostics if stored in the returned object.

```{r, eval = FALSE}
plot(mod_mlstm$elbo_trace, type = "l")
plot(mod_mlstm$label_loglik_trace, type = "l")
```

# Notes

For package checks and documentation builds, it is better to keep examples and vignettes light:

- use small synthetic datasets
- keep the number of iterations modest
- avoid verbose console output

This makes the vignette suitable for local builds, GitHub, and CRAN workflows.

# Session info

```{r}
sessionInfo()
```