---
title: "Joins in Production"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Joins in Production}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library(joinspy)
```

At the console, we see the wrong row count, fix the key, and re-run. In a
scheduled script, the same issue goes unnoticed until someone looks at the
output. This vignette wires joinspy into automated pipelines so that join
problems surface as errors or log entries. Every example uses synthetic data
and `tempfile()` paths, so the whole thing runs end to end.

## Assertions with key_check()

`key_check()` returns a single logical -- `TRUE` if no issues were detected,
`FALSE` otherwise. The simplest assertion wraps it in `stopifnot()`:

```{r}
orders <- data.frame(
  customer_id = c("C01", "C02", "C03", "C04"),
  amount = c(100, 200, 150, 300),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  customer_id = c("C01", "C02", "C03", "C04"),
  region = c("East", "West", "East", "North"),
  stringsAsFactors = FALSE
)

# This passes -- keys are clean
stopifnot(key_check(orders, customers, by = "customer_id", warn = FALSE))
```

When the keys have problems, the script halts:

```{r error = TRUE}
orders_dirty <- data.frame(
  customer_id = c("C01", "C02 ", "C03 ", "C04"),
  amount = c(100, 200, 150, 300),
  stringsAsFactors = FALSE
)

stopifnot(key_check(orders_dirty, customers, by = "customer_id", warn = FALSE))
```

With `warn = FALSE`, the printed diagnostics are suppressed -- in a cron job or
CI pipeline, we want the script to fail hard rather than print warnings. An
explicit `if`/`stop` gives us a custom error message:

```{r error = TRUE}
if (!key_check(orders_dirty, customers, by = "customer_id", warn = FALSE)) {
  stop("Key quality check failed for orders-customers join. ",
       "Run join_spy() interactively for details.", call. = FALSE)
}
```

We can also chain the assertion with a repair step -- run `key_check()` first,
repair on failure, then re-check:

```{r}
ok <- key_check(orders_dirty, customers, by = "customer_id", warn = FALSE)

if (!ok) {
  repaired <- join_repair(
    orders_dirty, customers,
    by = "customer_id",
    trim_whitespace = TRUE,
    remove_invisible = TRUE
  )
  orders_clean <- repaired$x
  customers_clean <- repaired$y

  # Re-check after repair
  stopifnot(key_check(orders_clean, customers_clean,
                       by = "customer_id", warn = FALSE))
}
```

`key_check()` is a binary pass/fail gate; `join_spy()` builds a full report
with match rates, expected row counts, and categorized issues. In production,
`key_check()` runs on every execution. `join_spy()` is what we reach for when
an assertion fails and we need to understand why.

## Silent Joins in Pipelines

The `*_join_spy()` wrappers print diagnostic output by default. In a scheduled
script, `.quiet = TRUE` suppresses all printed output while still computing the
report internally.

```{r}
sensors <- data.frame(
  sensor_id = c("S01", "S02", "S03", "S04"),
  location = c("Roof", "Basement", "Lobby", "Garage"),
  stringsAsFactors = FALSE
)

readings <- data.frame(
  sensor_id = c("S01", "S02", "S03", "S05"),
  temperature = c(22.1, 18.5, 21.0, 19.3),
  stringsAsFactors = FALSE
)

# Nothing printed
result <- left_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
```

The report is still available via `last_report()`:

```{r}
rpt <- last_report()
rpt$match_analysis$match_rate
```

The join runs silently; later, we pull the report and check its contents
programmatically:

```{r}
rpt <- last_report()
if (rpt$match_analysis$match_rate < 0.95) {
  warning(sprintf(
    "Low match rate (%.1f%%) in sensor join -- check for missing sensor IDs",
    rpt$match_analysis$match_rate * 100
  ))
}
```

The report object is a plain list, so standard R subsetting works for
arbitrarily complex validation logic.

One caveat: `last_report()` stores only the most recent report. If a script
performs three joins in sequence, only the third report survives. To retain
earlier reports, capture them explicitly:

```{r}
result1 <- left_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
report1 <- last_report()

# ... later ...
result2 <- inner_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
report2 <- last_report()
```

## Cardinality Guards

A join that was one-to-one in development can become many-to-many in production
when upstream data changes. `join_strict()` enforces a cardinality constraint
and throws an error if it is violated.

In development, we use `detect_cardinality()` to understand the actual
relationship:

```{r}
products <- data.frame(
  product_id = c("P1", "P2", "P3"),
  name = c("Widget", "Gadget", "Gizmo"),
  stringsAsFactors = FALSE
)

line_items <- data.frame(
  product_id = c("P1", "P1", "P2", "P3", "P3"),
  order_id = c(101, 102, 103, 104, 105),
  stringsAsFactors = FALSE
)

detect_cardinality(products, line_items, by = "product_id")
```

One-to-many: each product appears once in `products` but can appear multiple
times in `line_items`. We encode that expectation in production:

```{r}
result <- join_strict(
  products, line_items,
  by = "product_id",
  type = "left",
  expect = "1:m"
)
nrow(result)
```

If someone loads a `products` table with duplicate product IDs, the script
fails immediately:

```{r error = TRUE}
products_bad <- data.frame(
  product_id = c("P1", "P1", "P2", "P3"),
  name = c("Widget", "Widget v2", "Gadget", "Gizmo"),
  stringsAsFactors = FALSE
)

join_strict(
  products_bad, line_items,
  by = "product_id",
  type = "left",
  expect = "1:m"
)
```

The four cardinality levels:

- **1:1** -- lookup to lookup. Each key appears exactly once on both sides.
- **1:m** -- reference on the left, transactions on the right (products to
  line items, stations to hourly readings).
- **m:1** -- transactions on the left, lookup on the right (sales joined to a
  region table).
- **m:m** -- duplicates on both sides. Almost always a bug; requiring an
  explicit `expect = "m:m"` acts as a speed bump.

In practice, `"1:m"` and `"m:1"` cover most production joins.
`detect_cardinality()` confirms the relationship during development; the
`expect` value is then hard-coded in the production script.

`check_cartesian()` solves a different problem: it warns about Cartesian
product explosion when a key has many duplicates on *both* sides. A join can
violate a `"1:1"` constraint without triggering a Cartesian explosion (one
extra duplicate is enough), and a `"m:m"` join can produce a massive product
that `join_strict()` would allow. The two functions complement each other.

## Logging and Audit Trails

### Manual logging

`log_report()` writes a single report to a file. The format depends on the
file extension:

```{r}
report <- join_spy(sensors, readings, by = "sensor_id")

# Text format -- human-readable
txt_log <- tempfile(fileext = ".log")
log_report(report, txt_log)
cat(readLines(txt_log), sep = "\n")
unlink(txt_log)
```

```{r}
# JSON format -- machine-readable
json_log <- tempfile(fileext = ".json")
log_report(report, json_log)
cat(readLines(json_log), sep = "\n")
unlink(json_log)
```

Text format works for tailing logs during a batch run; JSON format feeds into
monitoring systems or downstream scripts. Reports can also be saved as `.rds`
files, which preserves the full R object for later interactive inspection.

### Automatic logging

For scripts with many joins, `set_log_file()` at the top is cleaner than
calling `log_report()` after each one. Every subsequent `*_join_spy()` call
appends its report to the file.

```{r}
auto_log <- tempfile(fileext = ".log")
set_log_file(auto_log, format = "text")

# These joins are automatically logged
result1 <- left_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
result2 <- inner_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)

# Check what got logged
cat(readLines(auto_log), sep = "\n")

# Clean up
set_log_file(NULL)
unlink(auto_log)
```

Automatic logging only triggers from `*_join_spy()` wrappers. `join_strict()`
and bare `merge()` calls are not logged -- the wrappers are the instrumented
path. To combine cardinality enforcement with logging, run
`detect_cardinality()` as a separate check and use a `*_join_spy()` wrapper for
the actual join.

`get_log_file()` returns the current log path (or `NULL` if logging is
disabled):

```{r}
# Only log if logging is configured
if (!is.null(get_log_file())) {
  message("Logging is active at: ", get_log_file())
}
```

## Sampling for Large Datasets

The `sample` parameter in `join_spy()` runs the analysis on a random subset
while the actual join (via a `*_join_spy()` wrapper) still operates on the full
data.

```{r}
# Simulate a large dataset
set.seed(42)
big_orders <- data.frame(
  customer_id = sample(paste0("C", sprintf("%04d", 1:5000)), 50000, replace = TRUE),
  amount = round(runif(50000, 10, 500), 2),
  stringsAsFactors = FALSE
)

big_customers <- data.frame(
  customer_id = paste0("C", sprintf("%04d", 1:6000)),
  region = sample(c("North", "South", "East", "West"), 6000, replace = TRUE),
  stringsAsFactors = FALSE
)

# Full analysis
system.time(report_full <- join_spy(big_orders, big_customers, by = "customer_id"))

# Sampled analysis
system.time(report_sampled <- join_spy(big_orders, big_customers,
                                        by = "customer_id", sample = 5000))
```

The sampled report is approximate -- match rates and duplicate counts are
estimated from the subset. For production monitoring, we typically care whether
the match rate is roughly 95% or roughly 60%, not whether it is 94.7% or
95.1%. Sampling catches systemic problems (wrong key column, widespread
encoding issues, duplicate explosion) with a fraction of the runtime.

Sampling can miss rare issues. If 0.1% of keys have a zero-width space, a
5,000-row sample from a 10-million-row table might not include any. Running
full diagnostics periodically (weekly, or when the upstream source changes)
alongside sampled daily runs covers both speed and thoroughness.

## A Complete Production Pattern

Here is a realistic production workflow: a nightly job loads order and customer
data, validates keys, repairs if needed, joins with cardinality enforcement, and
logs everything.

```{r}
# ============================================================
# Nightly order enrichment pipeline
# ============================================================

# --- Setup logging ---
pipeline_log <- tempfile(fileext = ".log")
set_log_file(pipeline_log, format = "text")

# --- Load data (simulated) ---
orders <- data.frame(
  order_id = 1:6,
  customer_id = c("C001", "C002 ", "C003", "C003", "C004", "C005"),
  amount = c(150, 230, 89, 410, 320, 175),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  customer_id = c("C001", "C002", "C003", "C004", "C005", "C006"),
  name = c("Acme Corp", "Globex", "Initech", "Umbrella", "Soylent", "Wonka"),
  tier = c("gold", "silver", "gold", "bronze", "silver", "gold"),
  stringsAsFactors = FALSE
)

# --- Gate 1: key quality assertion ---
keys_ok <- key_check(orders, customers, by = "customer_id", warn = FALSE)

if (!keys_ok) {
  message("Key issues detected -- attempting repair")
  repaired <- join_repair(
    orders, customers,
    by = "customer_id",
    trim_whitespace = TRUE,
    remove_invisible = TRUE
  )
  orders <- repaired$x
  customers <- repaired$y
}

# --- Gate 2: cardinality check ---
card <- detect_cardinality(orders, customers, by = "customer_id")
if (card == "m:m") {
  set_log_file(NULL)
  unlink(pipeline_log)
  stop("Unexpected m:m cardinality in orders-customers join", call. = FALSE)
}

# --- Join (with auto-logging via *_join_spy) ---
enriched <- left_join_spy(orders, customers, by = "customer_id", .quiet = TRUE)

# --- Gate 3: row count sanity check ---
# A left join should never lose rows from the left table
if (nrow(enriched) < nrow(orders)) {
  set_log_file(NULL)
  unlink(pipeline_log)
  stop("Row count decreased after left join -- possible data corruption",
       call. = FALSE)
}

# --- Output ---
message(sprintf("Pipeline complete: %d enriched orders", nrow(enriched)))
head(enriched)

# --- Review the log ---
if (file.exists(pipeline_log)) {
  cat(readLines(pipeline_log), sep = "\n")
}

# --- Cleanup ---
set_log_file(NULL)
unlink(pipeline_log)
```

The three gates catch different failure modes: `key_check()` catches
string-level problems and attempts repair, `detect_cardinality()` halts on
unexpected many-to-many relationships, and the row count check guards against
anything the first two gates missed. Logging runs throughout because
`set_log_file()` was called at the top. The structure scales -- more tables,
more gates, more joins, same pattern.