--- title: "Why Your Keys Don't Match" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Why Your Keys Don't Match} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) library(joinspy) ``` A join runs without error but the row count is off -- rows vanished, or they multiplied. The columns look fine. The key values look identical in the console. `str()` reveals nothing useful. The problem is almost never the join logic. It is the strings. R's `merge()` and dplyr's `*_join()` compare key values byte-for-byte. When keys fail to match, they are genuinely different -- the difference is just invisible: a trailing space, a case mismatch, a zero-width Unicode character with no visual representation. This vignette walks through five scenarios where joins fail for string-level reasons that resist casual inspection. The data is synthetic; the patterns come from real pipelines. ## Scenario 1: The Excel Export A retail analytics team receives monthly sales data from a distribution partner as a CSV exported from Excel. They join it against their internal customer database on `customer_id`. For two quarters, everything works. Then one month, 30% of the sales records stop matching. Nobody changed the code or the customer database. The partner's IDs are all present in the internal system -- or so it appears. ```{r} partner_sales <- data.frame( customer_id = c("CUST-1001", "CUST-1002 ", "CUST-1003", " CUST-1004", "CUST-1005 ", "CUST-1006"), amount = c(2500, 1800, 3200, 950, 4100, 1600), stringsAsFactors = FALSE ) internal_db <- data.frame( customer_id = c("CUST-1001", "CUST-1002", "CUST-1003", "CUST-1004", "CUST-1005", "CUST-1006", "CUST-1007"), region = c("West", "East", "West", "South", "East", "North", "West"), stringsAsFactors = FALSE ) ``` Nothing in `str()` or `print()` reveals the issue -- trailing spaces are invisible in console output. `join_spy()` catches it: ```{r} report <- join_spy(partner_sales, internal_db, by = "customer_id") ``` Three of the six partner IDs carry whitespace. `"CUST-1002 "` is a different string from `"CUST-1002"` as far as R is concerned. `join_repair()` trims both tables at once: ```{r} repaired <- join_repair(partner_sales, internal_db, by = "customer_id") partner_fixed <- repaired$x internal_fixed <- repaired$y ``` We can verify the repair worked: ```{r} key_check(partner_fixed, internal_fixed, by = "customer_id") ``` And now the join gives us what we expected: ```{r} result <- merge(partner_fixed, internal_fixed, by = "customer_id") nrow(result) ``` The root cause was an Excel `CONCATENATE` formula that preserved trailing spaces from a variable-width source column. Excel renders `"CUST-1002"` and `"CUST-1002 "` identically, so nobody noticed. Trailing whitespace is the single most common join failure we see in practice. ## Scenario 2: Two Databases, Two Conventions A SaaS company wants to join CRM profiles to clickstream events for a churn analysis. The CRM stores email addresses in uppercase (a database migration decision from the late 1990s). The web app stores them in lowercase. Both systems are internally consistent. ```{r} crm_profiles <- data.frame( email = c("ALICE@ACME.COM", "BOB@ACME.COM", "CAROL@ACME.COM", "DAVE@ACME.COM", "EVE@ACME.COM"), plan = c("enterprise", "starter", "pro", "enterprise", "starter"), stringsAsFactors = FALSE ) click_events <- data.frame( email = c("alice@acme.com", "bob@acme.com", "carol@acme.com", "dave@acme.com", "frank@acme.com"), page_views = c(47, 12, 89, 33, 5), stringsAsFactors = FALSE ) ``` An inner join returns zero rows. R's string comparison is case-sensitive, so every key pair fails. `join_spy()` flags the situation before the join: ```{r} report <- join_spy(crm_profiles, click_events, by = "email") ``` `suggest_repairs()` generates the fix: ```{r} suggest_repairs(report) ``` Or we can use `join_repair()` directly, specifying case standardization: ```{r} repaired <- join_repair( crm_profiles, click_events, by = "email", standardize_case = "lower" ) ``` After lowercasing both sides, the inner join returns four matched rows (everyone except Eve, who has no click data, and Frank, who is not in the CRM): ```{r} result <- merge(repaired$x, repaired$y, by = "email") nrow(result) result ``` Email addresses are case-insensitive by RFC 5321, so lowercasing is the right normalization here. For other identifier types (product codes, country abbreviations), `"upper"` may be more appropriate. ## Scenario 3: The PDF Copy-Paste A public health researcher compiles data from multiple sources for a systematic review. A few studies published supplementary tables only as PDF, so she copies the table from the PDF viewer, pastes into a spreadsheet, cleans up the columns, and reads the CSV into R. The data looks perfect -- every country name is spelled correctly. But half the countries fail to match a reference population table. ```{r} # Simulating PDF copy-paste artifacts: # \u00A0 is non-breaking space, \u200B is zero-width space pdf_data <- data.frame( country = c("Brazil", "India\u200B", "Germany", "Japan\u00A0", "Canada", "France\u200B"), prevalence = c(12.3, 8.7, 5.1, 3.9, 6.2, 4.8), stringsAsFactors = FALSE ) reference <- data.frame( country = c("Brazil", "India", "Germany", "Japan", "Canada", "France", "Italy"), population_m = c(214, 1408, 84, 125, 38, 68, 59), stringsAsFactors = FALSE ) ``` Printing the PDF data shows nothing wrong: ```{r} pdf_data$country ``` The zero-width space after "India" and "France" occupies zero pixels. The non-breaking space after "Japan" renders like a regular space but is U+00A0, not U+0020 -- `trimws()` will not always remove it. The merge shows the damage: ```{r} nrow(merge(pdf_data, reference, by = "country")) ``` Three of six countries match. `join_spy()` detects the invisible characters: ```{r} report <- join_spy(pdf_data, reference, by = "country") ``` ```{r} repaired <- join_repair(pdf_data, reference, by = "country") nrow(merge(repaired$x, repaired$y, by = "country")) ``` Six matches. PDF copy-paste is the most common source of these artifacts, but web scraping, OCR output, and legacy mainframe exports can produce them too. One useful debugging trick outside of joinspy: `nchar("India\u200B")` returns 6, not 5. But that requires already suspecting the problem. ## Scenario 4: The Slowly Growing Mismatch An e-commerce pipeline joins transaction records to a product catalogue. The pipeline ran cleanly for months, then match rates started drifting: 99% in January, 97% in February, 94% in March. Nobody noticed until finance flagged a margin discrepancy in April. The code had not changed. A new data entry clerk had joined the warehouse team in December. The canonical product code format was `"WDG-100"` -- uppercase prefix, dash, three-digit suffix. The new clerk sometimes omitted the dash, sometimes typed lowercase. The warehouse system did fuzzy matching internally, so it accepted the codes. The ETL join did not. ```{r} # Product catalogue (canonical format) catalogue <- data.frame( product_code = c("WDG-100", "WDG-101", "WDG-102", "WDG-103", "WDG-104", "WDG-105"), product_name = c("Widget A", "Widget B", "Widget C", "Widget D", "Widget E", "Widget F"), margin = c(0.35, 0.28, 0.42, 0.31, 0.39, 0.25), stringsAsFactors = FALSE ) # Recent transactions (mix of old and new clerk entries) transactions <- data.frame( product_code = c("WDG-100", "WDG-101", "WDG102", "wdg-103", "WDG-104", "wdg105", "WDG-100", "WDG103"), quantity = c(5, 3, 7, 2, 4, 6, 1, 8), stringsAsFactors = FALSE ) ``` The damage is partial -- some codes match, some do not -- which makes it harder to spot than a complete failure. ```{r} report <- join_spy(transactions, catalogue, by = "product_code") ``` Here is where this scenario differs from the previous ones. `join_repair()` can fix the case issue, but it cannot insert the missing dashes -- that requires domain knowledge about the code format. We can do a dry run to see what `join_repair()` would fix: ```{r} join_repair(transactions, catalogue, by = "product_code", standardize_case = "upper", dry_run = TRUE) ``` After applying the mechanical fixes: ```{r} repaired <- join_repair(transactions, catalogue, by = "product_code", standardize_case = "upper") ``` The case issues are resolved, but the missing dashes remain. A manual transformation handles those: ```{r} # Manual fix: insert dash if missing in product codes matching the pattern fix_codes <- function(codes) { gsub("^([A-Z]{3})(\\d)", "\\1-\\2", codes) } repaired$x$product_code <- fix_codes(repaired$x$product_code) ``` ```{r} result <- merge(repaired$x, repaired$y, by = "product_code") nrow(result) ``` All eight transactions match. `join_repair()` handles context-free transformations (trimming, case normalization, stripping invisible characters). Inserting a dash into `"WDG102"` requires knowing the canonical format -- that fix has to come from someone who understands the data. ## Scenario 5: The Compound Key Trap Two government datasets need to be linked: regional economic indicators and regional population estimates, keyed on region and year. The year column is numeric and matches without trouble. The region column has a whitespace problem that affects only some records. ```{r} economics <- data.frame( region = c("North America", "Europe", "Asia Pacific ", "North America", "Europe", "Asia Pacific "), year = c(2022, 2022, 2022, 2023, 2023, 2023), gdp_growth = c(2.1, 1.8, 4.2, 1.9, 0.9, 3.8), stringsAsFactors = FALSE ) population <- data.frame( region = c("North America", "Europe", "Asia Pacific", "North America", "Europe", "Asia Pacific"), year = c(2022, 2022, 2022, 2023, 2023, 2023), pop_millions = c(580, 450, 4300, 585, 448, 4350), stringsAsFactors = FALSE ) ``` In a compound key join, both columns must match. The year column is fine everywhere, but `"Asia Pacific "` with a trailing space will not match `"Asia Pacific"`. ```{r} merged <- merge(economics, population, by = c("region", "year")) nrow(merged) ``` Four rows instead of six. North America and Europe link correctly; Asia Pacific silently vanishes. Two-thirds of the data is present -- just enough to look plausible. `join_spy()` pinpoints which column in the compound key has the problem: ```{r} report <- join_spy(economics, population, by = c("region", "year")) ``` ```{r} repaired <- join_repair(economics, population, by = c("region", "year")) result <- merge(repaired$x, repaired$y, by = c("region", "year")) nrow(result) ``` Six rows. Compound keys are more fragile than single keys because the failure is partial: one clean column and one dirty column produces a result that looks reasonable but is quietly incomplete. The more columns in the key, the more places an invisible character can hide. ## The Pattern These five scenarios share three properties. The data looks correct to standard inspection tools -- `str()`, `summary()`, `print()` all render the values identically. The failure is silent -- R does not warn when a merge drops rows because of invisible whitespace; it just returns fewer rows. And the fix is mechanical once the cause is known -- trimming, lowercasing, or stripping invisible Unicode are all one-line operations. `join_spy()` closes the gap between a thirty-second fix and the hours of confusion that precede it, particularly with data from external sources, manual entry, PDF extraction, or cross-system integrations.