--- title: "Common Join Problems" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Common Join Problems} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) library(joinspy) ``` Most join bugs are not about cardinality. They are about strings. A trailing space, a flipped case, a zero-width Unicode character: any of these will cause two keys that *look* identical to fail matching without a warning from R. We will walk through the problems joinspy catches, ordered roughly by frequency. String-level issues come first, then structural issues (duplicates, NAs, type mismatches, Cartesian explosions). ## Trailing and leading whitespace The classic. Someone exports a CSV from Excel, and now half the keys carry a trailing space. Everything looks fine when we print the data frame. Nothing matches when we join. ```{r} sales <- data.frame( product = c("Widget", "Gadget ", " Gizmo"), units = c(10, 20, 30), stringsAsFactors = FALSE ) inventory <- data.frame( product = c("Widget", "Gadget", "Gizmo"), stock = c(100, 200, 300), stringsAsFactors = FALSE ) join_spy(sales, inventory, by = "product") ``` Two of three keys carry whitespace that prevents matching. `join_repair()` strips it: ```{r} sales_clean <- join_repair(sales, by = "product") key_check(sales_clean, inventory, by = "product") ``` Passing `dry_run = TRUE` previews the repair without applying it. The problem compounds with composite keys -- whitespace in *any* column is enough to break the match: ```{r} shipments <- data.frame( warehouse = c("East ", "West", "East "), product = c("Widget", "Gadget ", "Gizmo"), shipped = c(50, 80, 35), stringsAsFactors = FALSE ) stock <- data.frame( warehouse = c("East", "West", "East"), product = c("Widget", "Gadget", "Gizmo"), on_hand = c(200, 150, 90), stringsAsFactors = FALSE ) join_spy(shipments, stock, by = c("warehouse", "product")) ``` Both `warehouse` and `product` carry trailing spaces in `shipments`, so all three rows fail to match. A single `join_repair()` call cleans every key column at once: ```{r} shipments_clean <- join_repair(shipments, by = c("warehouse", "product")) key_check(shipments_clean, stock, by = c("warehouse", "product")) ``` ## Case mismatches Databases are often case-insensitive; R is not. When we pull tables from two different systems, one might store `"ABC"` and the other `"abc"`. ```{r} sensors <- data.frame( station = c("AWS-01", "aws-02", "Aws-03"), temp = c(22.1, 18.4, 25.7), stringsAsFactors = FALSE ) metadata <- data.frame( station = c("aws-01", "AWS-02", "AWS-03"), region = c("North", "South", "East"), stringsAsFactors = FALSE ) ``` None of these keys match as-is: ```{r} join_spy(sensors, metadata, by = "station") ``` We can repair both sides to a common case: ```{r} repaired <- join_repair(sensors, metadata, by = "station", standardize_case = "lower") key_check(repaired$x, repaired$y, by = "station") ``` `"lower"`, `"upper"`, and `"title"` all work. One thing to watch for: if the key column is a `factor`, case standardization changes the level labels but not the underlying integer codes. Converting to character first avoids silently merging distinct factor levels. ## Encoding and invisible characters A key contains a non-breaking space (U+00A0) instead of a regular space, or a zero-width joiner crept in during a copy-paste from a PDF. The strings print identically but do not match. ```{r} # Simulate a non-breaking space in one key left <- data.frame( city = c("New York", "Los\u00a0Angeles", "Chicago"), pop = c(8.3, 3.9, 2.7), stringsAsFactors = FALSE ) right <- data.frame( city = c("New York", "Los Angeles", "Chicago"), area = c(302, 469, 227), stringsAsFactors = FALSE ) join_spy(left, right, by = "city") ``` The `"Los\u00a0Angeles"` key in `left` looks like `"Los Angeles"` in `right`, but the non-breaking space makes them different byte sequences. `join_repair()` with `remove_invisible = TRUE` (the default) strips these out: ```{r} left_fixed <- join_repair(left, by = "city") key_check(left_fixed, right, by = "city") ``` Common sources include PDF extraction, web scraping, and cross-platform file transfers. `join_repair()` handles the most common offenders: non-breaking spaces, zero-width joiners, BOM markers, and soft hyphens. It does not attempt full Unicode normalization (NFC vs. NFD); for that we would reach for `stringi::stri_trans_nfc()`. If `join_spy()` still flags encoding issues after repair, `suggest_repairs()` on a report will print the exact R code needed. ## Empty strings masquerading as data Empty strings (`""`) are valid character values in R. They will match other empty strings in a join, which is almost never what we want: two rows with missing identifiers get joined as though they refer to the same entity. ```{r} patients <- data.frame( mrn = c("P001", "", "P003"), age = c(34, 56, 29), stringsAsFactors = FALSE ) visits <- data.frame( mrn = c("P001", "P002", ""), date = c("2024-01-10", "2024-02-15", "2024-03-20"), stringsAsFactors = FALSE ) join_spy(patients, visits, by = "mrn") ``` Converting empties to `NA` before joining fixes this, since NAs never match in R: ```{r} patients_fixed <- join_repair(patients, by = "mrn", empty_to_na = TRUE) patients_fixed$mrn ``` Note that `data.table` treats `""` and `NA_character_` as distinct in keyed joins, so when using a data.table backend we need to convert empty strings to `NA` on both sides. ## Near-matches and typos Sometimes keys are close but not identical. These are not whitespace or case problems -- they are genuine mismatches that `join_spy()` flags when it finds keys in one table with no counterpart in the other. ```{r} orders <- data.frame( sku = c("WDG-100", "GDG-200", "GZM-300"), qty = c(5, 12, 8), stringsAsFactors = FALSE ) catalog <- data.frame( sku = c("WDG-100", "GDG-200", "GZM-301"), price = c(9.99, 14.99, 7.50), stringsAsFactors = FALSE ) report <- join_spy(orders, catalog, by = "sku") ``` Internally, `join_spy()` computes Levenshtein distances between unmatched keys. When two keys differ by only one or two characters, the report flags them as near-matches. Here is a clearer example with multiple near-matches: ```{r} employees <- data.frame( name = c("Johnson", "Smithe", "O'Brian", "Williams"), dept = c("Sales", "R&D", "Ops", "HR"), stringsAsFactors = FALSE ) payroll <- data.frame( name = c("Jonhson", "Smith", "O'Brien", "Williams"), salary = c(55000, 62000, 48000, 71000), stringsAsFactors = FALSE ) report <- join_spy(employees, payroll, by = "name") ``` `"Johnson"` vs. `"Jonhson"` (transposition), `"Smithe"` vs. `"Smith"` (extra character), and `"O'Brian"` vs. `"O'Brien"` (vowel swap) are all within edit distance 2. `"Williams"` matches exactly. There is no automated fix here since joinspy cannot know which side is correct, but the near-match list gives a concrete starting point for building a lookup table. ## Duplicate keys Duplicate keys cause row multiplication. A left join on a key that appears twice in the right table doubles the corresponding rows from the left. ```{r} orders <- data.frame( customer_id = c(1, 2, 3), amount = c(100, 250, 75) ) addresses <- data.frame( customer_id = c(1, 2, 2, 3), address = c("NYC", "LA", "SF", "Chicago"), stringsAsFactors = FALSE ) join_spy(orders, addresses, by = "customer_id") ``` `key_duplicates()` shows which rows are responsible: ```{r} key_duplicates(addresses, by = "customer_id") ``` If each customer should have one address, we deduplicate first. If we genuinely need all combinations, the multiplication is correct -- we just need to know it will happen. When *both* sides have duplicates, each key group produces a Cartesian product: ```{r} orders_dup <- data.frame( product = c("A", "A", "B", "B"), qty = c(10, 20, 5, 15) ) prices_dup <- data.frame( product = c("A", "A", "A", "B", "B"), price = c(1.0, 1.1, 1.2, 2.0, 2.5) ) join_spy(orders_dup, prices_dup, by = "product") ``` Product `"A"` has 2 rows on the left and 3 on the right, so a join produces 2 x 3 = 6 rows for that key alone. `check_cartesian()` quantifies the total expansion before we run the join: ```{r} check_cartesian(orders_dup, prices_dup, by = "product") ``` ## NA keys `NA` never equals `NA` in R. This is by design, but it surprises people who expect two missing values to match. ```{r} orders <- data.frame( customer_id = c(1, NA, 3, NA), amount = c(100, 200, 300, 400) ) customers <- data.frame( customer_id = c(1, 2, 3, NA), name = c("Alice", "Bob", "Carol", "Unknown"), stringsAsFactors = FALSE ) join_spy(orders, customers, by = "customer_id") ``` We can either remove rows with NA keys before joining, or replace NAs with a sentinel value if we actually want them to match: ```{r} # Remove orders_clean <- orders[!is.na(orders$customer_id), ] key_check(orders_clean, customers, by = "customer_id") ``` ## Type mismatches One table stores IDs as integers, the other as character strings. `merge()` coerces silently; `dplyr::left_join()` refuses. Either way, we want to know about it before the join. ```{r} invoices <- data.frame( product_id = c(1, 2, 3), total = c(500, 300, 150) ) products <- data.frame( product_id = c("1", "2", "3"), name = c("Widget", "Gadget", "Gizmo"), stringsAsFactors = FALSE ) join_spy(invoices, products, by = "product_id") ``` A subtler variant occurs with `Date` vs. character, or `POSIXct` vs. `Date`, where the join either fails or coerces through numeric intermediaries. `join_spy()` flags the type mismatch regardless of the types involved. ```{r} invoices$product_id <- as.character(invoices$product_id) key_check(invoices, products, by = "product_id") ``` ## Many-to-many explosions When both tables have duplicate keys, we get a Cartesian product within each key group. With real data this can turn a 10,000-row join into a million-row table. ```{r} items <- data.frame( order_id = c(1, 1, 2, 2, 2), item = c("A", "B", "C", "D", "E"), stringsAsFactors = FALSE ) payments <- data.frame( order_id = c(1, 1, 2, 2), method = c("Card", "Cash", "Card", "Wire"), stringsAsFactors = FALSE ) check_cartesian(items, payments, by = "order_id") ``` `detect_cardinality()` tells us the relationship type: ```{r} detect_cardinality(items, payments, by = "order_id") ``` If we expected a one-to-many relationship, `join_strict()` will stop us before the explosion happens: ```{r error = TRUE} join_strict(items, payments, by = "order_id", type = "left", expect = "1:m") ``` ## No matches at all An inner join returns zero rows, and downstream code may not check for an empty data frame. ```{r} system_a <- data.frame( user_id = c("USR-001", "USR-002", "USR-003"), score = c(85, 90, 78), stringsAsFactors = FALSE ) system_b <- data.frame( user_id = c("1", "2", "3"), dept = c("Sales", "R&D", "Ops"), stringsAsFactors = FALSE ) join_spy(system_a, system_b, by = "user_id") ``` Zero overlap -- the keys use completely different formats, and no amount of trimming or case-folding will help. We need a mapping table or a transformation that extracts the numeric part: ```{r} system_a$user_num <- gsub("^USR-0*", "", system_a$user_id) key_check(system_a, system_b, by = c("user_num" = "user_id")) ``` ## Troubleshooting workflow When a join goes wrong, we typically start with `join_spy()`. It checks string quality, key overlap, cardinality, and predicted row counts in one call: ```{r eval = FALSE} report <- join_spy(x, y, by = "key_col") ``` For large datasets, passing `sample = 1000` runs the check on a random subset first. If the report flags whitespace, case, encoding, or empty strings, `join_repair()` handles them: ```{r eval = FALSE} x_clean <- join_repair(x, by = "key_col") # or both sides: repaired <- join_repair(x, y, by = "key_col", standardize_case = "lower") ``` If the predicted row count is higher than expected, we inspect duplicates and decide whether to aggregate or deduplicate: ```{r eval = FALSE} key_duplicates(y, by = "key_col") ``` Once keys are clean, `join_strict()` enforces the cardinality we expect: ```{r eval = FALSE} result <- join_strict(x_clean, y_clean, by = "key_col", type = "left", expect = "1:1") ``` If the joined output still looks wrong, `join_explain()` gives a breakdown of what happened: ```{r eval = FALSE} result <- left_join_spy(x_clean, y_clean, by = "key_col") join_explain(result, x_clean, y_clean, by = "key_col") ``` In production pipelines, `set_log_file()` routes all subsequent reports to a log file, which is useful for debugging issues after the fact: ```{r eval = FALSE} set_log_file("logs/join_diagnostics.log") # All join_spy() / join_explain() calls now append to this file ``` For pipelines with multiple joins, `analyze_join_chain()` diagnoses an entire sequence at once, reporting where the first problem enters the chain. ## See Also - `vignette("quickstart")` for a quick introduction to joinspy - `?join_spy`, `?join_repair`, `?key_check`, `?join_strict` - `?check_cartesian`, `?detect_cardinality`, `?join_explain`