---
title: "Citation Snowballing for Literature Discovery"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Citation Snowballing for Literature Discovery}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  comment = "#>"
)
```

A keyword search only finds papers that use your keywords; it misses related work phrased in different terms. *Citation snowballing* follows citation links outward from a set of seed papers to surface those neighbors regardless of vocabulary -- a standard supplementary search step in evidence synthesis.

This vignette expands a seed corpus with `citation_snowball()`, inspects why each paper was admitted, and characterizes the *expansion space* (the citation-adjacent literature the query never returned) with MeSH keyness. It proposes candidates to screen; it does not replace manual review. Note that iCite links cover PubMed-indexed articles only, so snowballing inherits PubMed's coverage.

```{r libs}
library(puremoe)
library(dplyr)
library(DT)
```

## Seed corpus

Search PubMed, then pull iCite records for the hits. Snowballing uses the `icites` endpoint specifically: each record carries a `citation_net` of that paper's references and citing papers.

```{r search}
pmids <- search_pubmed('"political ideology"[TiAb]')

length(pmids)

seed_icites <- get_records(pmids, endpoint = "icites", cores = 1L, sleep = 0.25)
```

## Expand by snowballing

`citation_snowball()` walks the links already in the iCite response (no extra API call). `direction = "both"` looks backward (papers the seeds cite) and forward (papers that cite the seeds); `min_links` admits a candidate only if it connects to at least that many seeds.

```{r snowball}
snowball <- seed_icites |>
  citation_snowball(direction = "both", min_links = 2)

snowball |>
  count(seed)   # seeds vs newly discovered candidates
```

## The audit trail

Every row carries its provenance: `seed`, `cited_links` (seeds that cite it), `citing_links` (seeds it cites), and `link_count` (the ranking total and `min_links` gate). Candidates are, by construction, papers the keyword query did not return.

```{r audit}
candidates <- snowball |>
  filter(!seed) |>
  arrange(desc(link_count))

candidates |>
  head(25) |>
  DT::datatable(rownames = FALSE)
```

## The expansion space

Fetch metadata -- including MeSH -- once for the whole snowballed corpus (seeds plus candidates) and reuse it below. Joining the candidates back to their titles shows what the snowball surfaced.

```{r corpus-meta}
corpus_meta <- snowball$pmid |>
  get_records(endpoint = "pubmed_abstracts", cores = 1L, sleep = 0.25)

corpus_meta |>
  left_join(snowball, by = "pmid") |>
  filter(!seed) |>
  arrange(desc(link_count)) |>
  select(pmid, year, journal, articletitle,
         cited_links, citing_links, link_count) |>
  head(25) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```

### Keyness: how the corpus profile shifts

Raw MeSH counts are dominated by terms common everywhere (`Humans`). *Keyness* compares each descriptor's rate in a corpus to its PubMed-wide rate from `data_mesh_frequencies` (log2 ratio; 3 means 8x over-represented), counted per document to match the baseline.

```{r keyness-helpers}
baseline <- puremoe::data_mesh_frequencies
i <- which.max(baseline$n_pmids)
total_pubmed <- round(baseline$n_pmids[i] / baseline$prop_total[i])

# document frequency of each MeSH descriptor within a set of PMIDs
mesh_df <- function(meta, ids) {
  ann <- data.table::rbindlist(meta[pmid %in% ids]$annotations, fill = TRUE)
  if (!"type" %in% names(ann) || nrow(ann) == 0L) {
    return(data.table::data.table(DescriptorUI = character(),
                                  DescriptorName = character(), docs = integer()))
  }
  ann[type == "MeSH" & !is.na(DescriptorUI)] |>
    distinct(pmid, DescriptorUI, DescriptorName) |>
    count(DescriptorUI, DescriptorName, name = "docs")
}

# log-ratio keyness vs the PubMed baseline, with a document-frequency floor
keyness <- function(meta, ids, min_docs = 3L) {
  n_docs <- length(unique(ids))
  mesh_df(meta, ids) |>
    filter(docs >= min_docs) |>
    inner_join(select(baseline, DescriptorUI, n_pmids), by = "DescriptorUI") |>
    mutate(log_ratio = round(log2((docs / n_docs) / (n_pmids / total_pubmed)), 2))
}
```

Pass 1 -- the seed corpus alone -- should surface the obvious topic terms, a sanity check that keyness behaves.

```{r keyness-seed}
pass1 <- keyness(corpus_meta, pmids)

pass1 |>
  arrange(desc(log_ratio)) |>
  select(DescriptorName, docs, log_ratio) |>
  head(15) |>
  DT::datatable(rownames = FALSE)
```

Pass 2 repeats it on the expanded corpus; the `shift` ranks descriptors by how much snowballing amplified them, so concepts barely present in the seeds rise to the top -- the expansion space, quantified.

```{r keyness-shift}
pass2 <- keyness(corpus_meta, snowball$pmid)

pass1 |>
  select(DescriptorUI, name1 = DescriptorName, keyness_seed = log_ratio) |>
  full_join(
    pass2 |> select(DescriptorUI, name2 = DescriptorName,
                    keyness_expanded = log_ratio),
    by = "DescriptorUI"
  ) |>
  mutate(
    DescriptorName   = coalesce(name1, name2),
    keyness_seed     = coalesce(keyness_seed, 0),
    keyness_expanded = coalesce(keyness_expanded, 0),
    shift            = round(keyness_expanded - keyness_seed, 2)
  ) |>
  arrange(desc(shift)) |>
  select(DescriptorName, keyness_seed, keyness_expanded, shift) |>
  head(20) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```

Recent papers may not be MeSH-indexed yet and the baseline is a snapshot; keyness describes the corpus, it does not judge relevance.

## Tuning precision and recall

`min_links` trades recall for precision -- higher thresholds keep only papers tightly woven into the corpus. Sweeping it shows the trade-off directly.

```{r min-links-sweep}
data.frame(
  min_links = 1:5,
  n_candidates = sapply(1:5, function(k) {
    sum(!citation_snowball(seed_icites, direction = "both", min_links = k)$seed)
  })
)
```

`max_nodes` is a hard ceiling on corpus size: candidates are ranked by `link_count` and truncated to fit, bounding the screening workload.

```{r max-nodes}
seed_icites |>
  citation_snowball(direction = "both", min_links = 1, max_nodes = 50) |>
  nrow()
```

## Choosing a direction

`direction` matches the search intent: `"cited"` finds shared foundational references, `"citing"` finds later work building on the corpus, and `"both"` casts the widest net.

```{r direction}
backward <- citation_snowball(seed_icites, direction = "cited",  min_links = 2)
forward  <- citation_snowball(seed_icites, direction = "citing", min_links = 2)

data.frame(
  direction    = c("cited (foundational)", "citing (downstream)"),
  n_candidates = c(sum(!backward$seed), sum(!forward$seed))
)
```

## Iterating

Re-seed by feeding the expanded PMIDs back through the iCite endpoint and snowballing again; each hop keeps the same audit columns.

```{r reseed, eval=FALSE}
hop2 <- snowball$pmid |>
  get_records(endpoint = "icites", cores = 1L, sleep = 0.25) |>
  citation_snowball(direction = "both", min_links = 3, max_nodes = 500)
```

## Summary

`citation_snowball()` turns an iCite response into a ranked, auditable candidate set: it finds citation-adjacent papers a keyword query misses, the audit columns document why each was admitted, and MeSH keyness against `data_mesh_frequencies` characterizes the expansion space. It complements keyword search and manual screening rather than replacing them.