---
title: "Getting Started with bigANNOY"
output:
  litedown::html_format:
    meta:
      css: ["@default"]
---

<!--
%\VignetteEngine{litedown::vignette}
%\VignetteIndexEntry{Getting Started with bigANNOY}
%\VignetteEncoding{UTF-8}
-->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)
```

`bigANNOY` is an approximate nearest-neighbour package for
`bigmemory::big.matrix` data. It builds a persisted Annoy index from a
reference matrix, searches that index with either self-search or external
queries, and returns results in a shape aligned with `bigKNN`.

This vignette walks through the first workflow most users need:

1. create a small reference matrix
2. build an index on disk
3. run self-search and external-query search
4. inspect the returned neighbours and distances
5. reopen and validate the index in a later step

The examples are intentionally small, but the same API is designed for larger
file-backed `big.matrix` inputs.

## Load the Packages

```{r}
library(bigANNOY)
library(bigmemory)
```

## Create a Small Reference Matrix

`bigANNOY` is built around `bigmemory::big.matrix`, so we will start from a
dense matrix and convert it into a `big.matrix`.

```{r}
ref_dense <- matrix(
  c(
    0.0, 0.1, 0.2, 0.3,
    0.1, 0.0, 0.1, 0.2,
    0.2, 0.1, 0.0, 0.1,
    1.0, 1.1, 1.2, 1.3,
    1.1, 1.0, 1.1, 1.2,
    1.2, 1.1, 1.0, 1.1,
    3.0, 3.1, 3.2, 3.3,
    3.1, 3.0, 3.1, 3.2
  ),
  ncol = 4,
  byrow = TRUE
)

ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
```

The reference matrix has `r nrow(ref_dense)` rows and `r ncol(ref_dense)`
columns. Each row is a candidate neighbour in the final search results.

## Build the First Annoy Index

`annoy_build_bigmatrix()` streams the reference rows into a persisted Annoy
index and writes a sidecar metadata file next to it.

```{r}
index_path <- tempfile(fileext = ".ann")

index <- annoy_build_bigmatrix(
  ref_big,
  path = index_path,
  n_trees = 20L,
  metric = "euclidean",
  seed = 123L,
  load_mode = "lazy"
)

index
```

A few details are worth noticing:

- the Annoy index lives on disk at `index$path`
- metadata is written to `index$metadata_path`
- `load_mode = "lazy"` means the object is initially metadata-only
- the native handle is loaded automatically on first search

You can check the current loaded state directly.

```{r}
annoy_is_loaded(index)
```

## Run a Self-Search

With `query = NULL`, `annoy_search_bigmatrix()` searches the indexed reference
rows against themselves. In self-search mode, the nearest neighbour for each
row is another row, not the row itself.

```{r}
self_result <- annoy_search_bigmatrix(
  index,
  k = 2L,
  search_k = 100L
)

self_result$index
round(self_result$distance, 3)
```

Because the first search loads the lazy index, the handle is now available for
reuse.

```{r}
annoy_is_loaded(index)
```

The result object follows the same high-level shape as `bigKNN`:

```{r}
str(self_result, max.level = 1)
```

In particular:

- `index` is a 1-based integer matrix
- `distance` is a double matrix
- `k`, `metric`, `n_ref`, and `n_query` describe the search
- `exact` is always `FALSE` for `bigANNOY`
- `backend` is `"annoy"`

## Search with an External Query Matrix

External queries are often the more common workflow in practice. Here we build
a small dense query matrix with rows close to the first, middle, and final
clusters in the reference data.

```{r}
query_dense <- matrix(
  c(
    0.05, 0.05, 0.15, 0.25,
    1.05, 1.05, 1.10, 1.25,
    3.05, 3.05, 3.15, 3.25
  ),
  ncol = 4,
  byrow = TRUE
)

query_result <- annoy_search_bigmatrix(
  index,
  query = query_dense,
  k = 3L,
  search_k = 100L
)

query_result$index
round(query_result$distance, 3)
```

The three query rows each return three approximate neighbours from the indexed
reference matrix. For small examples like this one, the results will typically
look exact, but the important point is that the API stays the same for larger
problems where approximate search is preferable.

## Tune the Main Search Controls

Two arguments matter most when you begin tuning:

- `n_trees` controls index quality and index size at build time
- `search_k` controls search effort at query time

As a starting point:

- increase `search_k` first if recall looks too low
- rebuild with more `n_trees` when query-time tuning alone is not enough
- keep `metric = "euclidean"` when you want the most direct comparison with
  `bigKNN`

The package also supports `"angular"`, `"manhattan"`, and `"dot"` metrics,
but Euclidean is usually the easiest place to begin.

## Stream Results into big.matrix Outputs

For larger workloads, you may not want to keep neighbour matrices in ordinary
R memory. `bigANNOY` can write directly into destination `big.matrix` objects.

```{r}
index_out <- big.matrix(nrow(query_dense), 2L, type = "integer")
distance_out <- big.matrix(nrow(query_dense), 2L, type = "double")

streamed <- annoy_search_bigmatrix(
  index,
  query = query_dense,
  k = 2L,
  xpIndex = index_out,
  xpDistance = distance_out
)

bigmemory::as.matrix(index_out)
round(bigmemory::as.matrix(distance_out), 3)
```

The returned object still reports the same metadata, but the actual neighbour
matrices live in the destination `big.matrix` containers.

## Reopen and Validate a Persisted Index

One of the main v3 improvements is explicit index lifecycle support. You can
close a loaded handle, reopen the same index from disk, and validate its
metadata before reuse.

```{r}
annoy_close_index(index)
annoy_is_loaded(index)

reopened <- annoy_open_index(index$path, load_mode = "eager")
annoy_is_loaded(reopened)
```

Validation checks the recorded metadata against the current Annoy file and can
also verify that the index loads successfully.

```{r}
validation <- annoy_validate_index(reopened, strict = TRUE, load = TRUE)

validation$valid
validation$checks[, c("check", "passed", "severity")]
```

This is especially helpful when you want to reuse an index across sessions or
share the `.ann` file and its `.meta` sidecar with someone else.

## What Inputs Are Accepted?

For the quick start above we used:

- a `big.matrix` reference
- a dense matrix query
- in-memory `big.matrix` destinations for streamed outputs

The package also accepts:

- external pointers to `big.matrix` objects
- `big.matrix` descriptor objects
- descriptor file paths
- `query = NULL` for self-search

That broader file-backed workflow is covered in the dedicated vignette on
`bigmemory` persistence and descriptors.

## Recap

You have now seen the full first-run workflow:

1. create a `big.matrix` reference
2. build a persisted Annoy index
3. search it in self-search and external-query modes
4. stream results into destination `big.matrix` objects when needed
5. reopen, validate, and reuse the index

From here, the most useful next steps are:

- *Persistent Indexes and Lifecycle* for eager/lazy loading and explicit close
  and reopen workflows
- *File-Backed bigmemory Workflows* for descriptor files and on-disk matrices
- *Benchmarking Recall and Latency* for `benchmark_annoy_bigmatrix()` and
  `benchmark_annoy_recall_suite()`