---
title: "Validation and Sharing Indexes"
output:
  litedown::html_format:
    meta:
      css: ["@default"]
---

<!--
%\VignetteEngine{litedown::vignette}
%\VignetteIndexEntry{Validation and Sharing Indexes}
%\VignetteEncoding{UTF-8}
-->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)
```

Persisted indexes are most useful when they can be reopened safely later or
shared with collaborators without guessing how they were created.

`bigANNOY` v3 addresses that problem with two ideas:

- each Annoy index has a sidecar metadata file
- persisted indexes can be checked with `annoy_validate_index()` before use

This vignette focuses on those operational safeguards.

## Load the Packages

```{r}
library(bigANNOY)
library(bigmemory)
```

## Create a Small Persisted Example

We will build a small Euclidean Annoy index and keep all of its files inside a
temporary working directory.

```{r}
share_dir <- tempfile("bigannoy-share-")
dir.create(share_dir, recursive = TRUE, showWarnings = FALSE)

ref_dense <- matrix(
  c(
    0.0, 0.0,
    1.0, 0.0,
    0.0, 1.0,
    1.0, 1.0
  ),
  ncol = 2,
  byrow = TRUE
)

ref_big <- as.big.matrix(ref_dense)
index_path <- file.path(share_dir, "ref.ann")

index <- annoy_build_bigmatrix(
  ref_big,
  path = index_path,
  n_trees = 20L,
  metric = "euclidean",
  seed = 77L,
  load_mode = "lazy"
)

index
```

At this point the key persisted assets are:

- the Annoy index file at `index$path`
- the sidecar metadata file at `index$metadata_path`

## What the Metadata Records

The metadata file is a small DCF document that records enough information to
make later reopen and validation steps safer.

```{r}
metadata <- read.dcf(index$metadata_path)
metadata[, c(
  "metadata_version",
  "package_version",
  "annoy_version",
  "index_id",
  "metric",
  "n_dim",
  "n_ref",
  "n_trees",
  "build_seed",
  "build_threads",
  "build_backend",
  "file_size",
  "file_mtime",
  "file_md5",
  "load_mode",
  "index_file"
)]
```

The most important fields operationally are:

- `metric`, `n_dim`, and `n_ref`, which describe what the index represents
- `file_size`, `file_mtime`, and `file_md5`, which summarize the current Annoy
  file
- `index_file`, which records the expected basename of the `.ann` file
- `index_id`, which gives the persisted artifact a stable identifier

## Validate Before You Use a Persisted Index

The safest default is to validate a reopened or long-lived index before using
it for important downstream work.

```{r}
validation <- annoy_validate_index(
  index,
  strict = TRUE,
  load = TRUE
)

validation$valid
validation$checks[, c("check", "passed", "severity")]
```

With `strict = TRUE`, any failed error-severity check stops immediately. With
`load = TRUE`, validation also confirms that the index can actually be opened
successfully.

## What Counts as an Error Versus a Warning

Not every check has the same severity:

- checksum and file-size mismatches are treated as errors
- metric, dimension, and item-count mismatches are treated as errors
- file modification time is currently treated as a warning

That distinction is visible in the validation report.

## Reopen the Index as a Separate Session Object

In a later R session, you would normally reattach the persisted index with
`annoy_open_index()` or `annoy_load_bigmatrix()`.

```{r}
reopened <- annoy_open_index(
  path = index$path,
  load_mode = "lazy"
)

annoy_is_loaded(reopened)
annoy_validate_index(reopened, strict = TRUE, load = TRUE)$valid
annoy_is_loaded(reopened)
```

This gives you a clean session-level controller around the same persisted
files. The reopened object can now be searched, validated again, or explicitly
closed.

## Sharing Checklist

When sharing an index with another user, machine, or later analysis step, keep
the following artifacts together:

- the `.ann` file
- the `.meta` sidecar file
- any `bigmemory` descriptor files needed to reconstruct the reference or query
  workflow around the index

In practice, it is best to think of the `.ann` and `.meta` files as one unit.

## Simulate Sharing by Copying the Persisted Files

To mimic transferring an index to another location, we will copy both files
into a separate directory and reopen the copy.

```{r}
shared_dir <- tempfile("bigannoy-shared-copy-")
dir.create(shared_dir, recursive = TRUE, showWarnings = FALSE)

shared_index_path <- file.path(shared_dir, basename(index$path))
shared_metadata_path <- file.path(shared_dir, basename(index$metadata_path))

file.copy(index$path, shared_index_path, overwrite = TRUE)
file.copy(index$metadata_path, shared_metadata_path, overwrite = TRUE)

shared <- annoy_open_index(
  path = shared_index_path,
  load_mode = "lazy"
)

shared_report <- annoy_validate_index(
  shared,
  strict = TRUE,
  load = TRUE
)

shared_report$valid
```

This is the basic "ship the index and reopen it elsewhere" workflow.

## Non-Strict Validation for Diagnostics

Sometimes you do not want an immediate error. You want a report first so you
can inspect what failed and decide whether to stop, rebuild, or repair the
metadata.

To demonstrate that path, we will deliberately corrupt the copied metadata by
replacing the recorded checksum with a wrong value.

```{r}
bad_metadata <- read.dcf(shared_metadata_path)
bad_metadata[1L, "file_md5"] <- "corrupted"
write.dcf(as.data.frame(bad_metadata, stringsAsFactors = FALSE), file = shared_metadata_path)

shared_bad <- annoy_open_index(shared_index_path, load_mode = "lazy")
bad_report <- annoy_validate_index(
  shared_bad,
  strict = FALSE,
  load = FALSE
)

bad_report$valid
bad_report$checks[, c("check", "passed", "severity")]
```

This pattern is especially helpful in higher-level tools that want to show a
validation report instead of terminating immediately.

## Strict Validation as a Gate

For production-style workflows, `strict = TRUE` is usually the better default
because it turns a failed validation into an immediate hard stop.

```{r}
strict_error <- tryCatch(
  {
    annoy_validate_index(shared_bad, strict = TRUE, load = FALSE)
    NULL
  },
  error = function(e) conditionMessage(e)
)

strict_error
```

The exact message may vary depending on which error-severity check fails first,
but the key point is that the corrupted metadata is no longer silently accepted.

## A Common Sharing Pitfall: Renaming Only the .ann File

The metadata records the expected basename of the Annoy file in `index_file`.
That means you should generally keep the `.ann` file and the `.meta` file
paired and consistent.

If you rename the `.ann` file without updating or regenerating the metadata,
`annoy_open_index()` will reject the mismatch.

```{r}
renamed_path <- file.path(shared_dir, "renamed.ann")
file.copy(shared_index_path, renamed_path, overwrite = TRUE)

rename_error <- tryCatch(
  {
    annoy_open_index(renamed_path, metadata_path = shared_metadata_path)
    NULL
  },
  error = function(e) conditionMessage(e)
)

rename_error
```

That guard is useful because it prevents accidentally pairing the wrong Annoy
file with the wrong metadata file.

## Recommended Sharing Pattern

For practical collaboration, a good pattern is:

1. build the index with `annoy_build_bigmatrix()`
2. keep the generated `.ann` file and `.meta` file together
3. move or copy them as a pair
4. reopen with `annoy_open_index()` or `annoy_load_bigmatrix()`
5. run `annoy_validate_index()` before important analysis
6. only trust the index for downstream search once validation passes

If your larger workflow depends on file-backed `bigmemory` data, keep the
descriptor files alongside the matrices they describe as well.

## Recap

`bigANNOY` v3 makes persisted indexes safer to reuse and share by giving them:

- a sidecar metadata file
- a stable index identifier
- recorded file signatures and build settings
- explicit validation with strict and non-strict modes

The practical takeaway is simple: treat the `.ann` file and the `.meta` file as
a pair, reopen them intentionally, and validate before you trust them.