--- title: "Quick Start with bigKNN" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start with bigKNN} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `bigKNN` provides exact nearest-neighbour and radius-search routines that work directly on `bigmemory::big.matrix` references. This quickstart walks through a small end-to-end example: create a reference matrix, run exact `k`-nearest neighbour and radius queries, and interpret the objects that come back. ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") if (!requireNamespace("bigmemory", quietly = TRUE)) { cat("This vignette requires the 'bigmemory' package.\n") knitr::knit_exit() } library(bigKNN) library(bigmemory) ``` ```{r helpers, include=FALSE} knn_table <- function(result, query_ids, ref_ids) { do.call(rbind, lapply(seq_along(query_ids), function(i) { data.frame( query = query_ids[i], rank = seq_len(result$k), neighbor = ref_ids[result$index[i, ]], distance = signif(result$distance[i, ], 4), row.names = NULL ) })) } radius_slice <- function(result, i, ref_ids) { start <- result$offset[i] end <- result$offset[i + 1L] - 1L if (start > end) { return(data.frame(neighbor = character(0), distance = numeric(0))) } data.frame( neighbor = ref_ids[result$index[start:end]], distance = signif(result$distance[start:end], 4), row.names = NULL ) } ``` # Create a Small Reference Matrix The reference data for `bigKNN` lives in a `bigmemory::big.matrix`. For this quickstart we will use six points in two dimensions and keep separate labels so that the returned row indices are easy to read. ```{r create-reference} reference_points <- data.frame( id = paste0("p", 1:6), x1 = c(1, 2, 1, 2, 3, 4), x2 = c(1, 1, 2, 2, 2, 3) ) query_points <- data.frame( id = c("q1", "q2"), x1 = c(1.2, 2.8), x2 = c(1.1, 2.2) ) reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")])) query_matrix <- as.matrix(query_points[c("x1", "x2")]) reference_points query_points ``` In the examples below, `reference` is the `big.matrix` searched by `bigKNN`, while `query_matrix` is an ordinary dense R matrix. The same APIs also accept queries stored in another `big.matrix`, and the v3 search paths accept sparse query matrices as well. # First Exact `k`-Nearest-Neighbour Search With `query = NULL`, `knn_bigmatrix()` performs a self-search. By default `exclude_self = TRUE` in that case, so each row does not return itself as its own nearest neighbour. ```{r self-knn} self_knn <- knn_bigmatrix(reference, k = 2) self_knn ``` The raw result stores two `n_query x k` matrices: - `index`: 1-based row indices into the reference matrix - `distance`: the corresponding exact distances ```{r self-knn-components} self_knn$index round(self_knn$distance, 3) ``` A small helper table makes the same result easier to read: ```{r self-knn-table} knn_table(self_knn, query_ids = reference_points$id, ref_ids = reference_points$id) ``` # Searching New Query Points To search new observations against the same reference, pass them through the `query` argument. Here we ask for the three nearest reference rows for two new points. ```{r query-knn} query_knn <- knn_bigmatrix( reference, query = query_matrix, k = 3, exclude_self = FALSE ) query_knn knn_table(query_knn, query_ids = query_points$id, ref_ids = reference_points$id) ``` The returned indices are still row numbers in `reference`. That makes the object compact and easy to use in later workflows, while a lookup table like `reference_points` can translate those indices into human-readable labels when needed. # First Radius Search Radius search returns every neighbour within a fixed distance threshold instead of a fixed `k`. This is useful when the local neighbourhood size should vary by query. ```{r radius-search} radius_result <- radius_bigmatrix( reference, query = query_matrix, radius = 1.15, exclude_self = FALSE ) radius_result radius_result$n_match radius_result$offset ``` `radius_bigmatrix()` uses a flattened output format: - `index` and `distance` hold all matches back-to-back - `n_match` gives the number of matches per query - `offset` tells you where each query's slice starts and ends For query `i`, the matching rows live in `index[offset[i]:(offset[i + 1] - 1)]`, with the same slice in `distance`. ```{r radius-slices} radius_slice(radius_result, 1, reference_points$id) radius_slice(radius_result, 2, reference_points$id) ``` If you only need the counts, `count_within_radius_bigmatrix()` avoids returning the flattened neighbour vectors. ```{r radius-counts} count_within_radius_bigmatrix( reference, query = query_matrix, radius = 1.15, exclude_self = FALSE ) ``` # Choosing a Metric `bigKNN` currently supports three exact metrics: - `"euclidean"` for ordinary Euclidean distance - `"sqeuclidean"` for squared Euclidean distance - `"cosine"` for cosine distance, defined as `1 - cosine similarity` The squared Euclidean metric preserves the same neighbour ordering as ordinary Euclidean distance, but reports squared values. Cosine distance can choose a different neighbour because it prefers similar direction rather than similar absolute location. ```{r metric-comparison} metric_summary <- do.call(rbind, lapply( c("euclidean", "sqeuclidean", "cosine"), function(metric) { result <- knn_bigmatrix( reference, query = query_matrix, k = 1, metric = metric, exclude_self = FALSE ) data.frame( metric = metric, query = query_points$id, nearest = reference_points$id[result$index[, 1]], distance = signif(result$distance[, 1], 4), row.names = NULL ) } )) metric_summary ``` In this example, Euclidean and squared Euclidean agree on the nearest row for each query, while cosine distance can favour a different point because its direction is more similar. Cosine distance requires non-zero rows in both the reference and the query data. # Where to Go Next This vignette focused on the smallest useful workflow. For larger or repeated jobs, the next places to look are: - `knn_prepare_bigmatrix()` and `knn_search_prepared()` for repeated exact queries against the same reference - `knn_plan_bigmatrix()`, `knn_stream_bigmatrix()`, and `radius_stream_bigmatrix()` for memory-aware and streamed workflows - `knn_graph_bigmatrix()`, `mutual_knn_graph_bigmatrix()`, `snn_graph_bigmatrix()`, and `radius_graph_bigmatrix()` for exact graph construction - `recall_against_exact()` and `rerank_candidates_bigmatrix()` when `bigKNN` is being used as the exact ground-truth engine for approximate search