Citation Snowballing for Literature Discovery

A keyword search only finds papers that use your keywords; it misses related work phrased in different terms. Citation snowballing follows citation links outward from a set of seed papers to surface those neighbors regardless of vocabulary – a standard supplementary search step in evidence synthesis.

This vignette expands a seed corpus with citation_snowball(), inspects why each paper was admitted, and characterizes the expansion space (the citation-adjacent literature the query never returned) with MeSH keyness. It proposes candidates to screen; it does not replace manual review. Note that iCite links cover PubMed-indexed articles only, so snowballing inherits PubMed’s coverage.

library(puremoe)
library(dplyr)
library(DT)

Seed corpus

Search PubMed, then pull iCite records for the hits. Snowballing uses the icites endpoint specifically: each record carries a citation_net of that paper’s references and citing papers.

pmids <- search_pubmed('"political ideology"[TiAb]')

length(pmids)

#> [1] 954

seed_icites <- get_records(pmids, endpoint = "icites", cores = 1L, sleep = 0.25)

Expand by snowballing

citation_snowball() walks the links already in the iCite response (no extra API call). direction = "both" looks backward (papers the seeds cite) and forward (papers that cite the seeds); min_links admits a candidate only if it connects to at least that many seeds.

snowball <- seed_icites |>
  citation_snowball(direction = "both", min_links = 2)

snowball |>
  count(seed)   # seeds vs newly discovered candidates

#>      seed     n
#>    <lgcl> <int>
#> 1:  FALSE  1049
#> 2:   TRUE   951

The audit trail

Every row carries its provenance: seed, cited_links (seeds that cite it), citing_links (seeds it cites), and link_count (the ranking total and min_links gate). Candidates are, by construction, papers the keyword query did not return.

candidates <- snowball |>
  filter(!seed) |>
  arrange(desc(link_count))

candidates |>
  head(25) |>
  DT::datatable(rownames = FALSE)

The expansion space

Fetch metadata – including MeSH – once for the whole snowballed corpus (seeds plus candidates) and reuse it below. Joining the candidates back to their titles shows what the snowball surfaced.

corpus_meta <- snowball$pmid |>
  get_records(endpoint = "pubmed_abstracts", cores = 1L, sleep = 0.25)

corpus_meta |>
  left_join(snowball, by = "pmid") |>
  filter(!seed) |>
  arrange(desc(link_count)) |>
  select(pmid, year, journal, articletitle,
         cited_links, citing_links, link_count) |>
  head(25) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Keyness: how the corpus profile shifts

Raw MeSH counts are dominated by terms common everywhere (Humans). Keyness compares each descriptor’s rate in a corpus to its PubMed-wide rate from data_mesh_frequencies (log2 ratio; 3 means 8x over-represented), counted per document to match the baseline.

baseline <- puremoe::data_mesh_frequencies
i <- which.max(baseline$n_pmids)
total_pubmed <- round(baseline$n_pmids[i] / baseline$prop_total[i])

# document frequency of each MeSH descriptor within a set of PMIDs
mesh_df <- function(meta, ids) {
  ann <- data.table::rbindlist(meta[pmid %in% ids]$annotations, fill = TRUE)
  if (!"type" %in% names(ann) || nrow(ann) == 0L) {
    return(data.table::data.table(DescriptorUI = character(),
                                  DescriptorName = character(), docs = integer()))
  }
  ann[type == "MeSH" & !is.na(DescriptorUI)] |>
    distinct(pmid, DescriptorUI, DescriptorName) |>
    count(DescriptorUI, DescriptorName, name = "docs")
}

# log-ratio keyness vs the PubMed baseline, with a document-frequency floor
keyness <- function(meta, ids, min_docs = 3L) {
  n_docs <- length(unique(ids))
  mesh_df(meta, ids) |>
    filter(docs >= min_docs) |>
    inner_join(select(baseline, DescriptorUI, n_pmids), by = "DescriptorUI") |>
    mutate(log_ratio = round(log2((docs / n_docs) / (n_pmids / total_pubmed)), 2))
}

Pass 1 – the seed corpus alone – should surface the obvious topic terms, a sanity check that keyness behaves.

pass1 <- keyness(corpus_meta, pmids)

pass1 |>
  arrange(desc(log_ratio)) |>
  select(DescriptorName, docs, log_ratio) |>
  head(15) |>
  DT::datatable(rownames = FALSE)

Pass 2 repeats it on the expanded corpus; the shift ranks descriptors by how much snowballing amplified them, so concepts barely present in the seeds rise to the top – the expansion space, quantified.

pass2 <- keyness(corpus_meta, snowball$pmid)

pass1 |>
  select(DescriptorUI, name1 = DescriptorName, keyness_seed = log_ratio) |>
  full_join(
    pass2 |> select(DescriptorUI, name2 = DescriptorName,
                    keyness_expanded = log_ratio),
    by = "DescriptorUI"
  ) |>
  mutate(
    DescriptorName   = coalesce(name1, name2),
    keyness_seed     = coalesce(keyness_seed, 0),
    keyness_expanded = coalesce(keyness_expanded, 0),
    shift            = round(keyness_expanded - keyness_seed, 2)
  ) |>
  arrange(desc(shift)) |>
  select(DescriptorName, keyness_seed, keyness_expanded, shift) |>
  head(20) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Recent papers may not be MeSH-indexed yet and the baseline is a snapshot; keyness describes the corpus, it does not judge relevance.

Tuning precision and recall

min_links trades recall for precision – higher thresholds keep only papers tightly woven into the corpus. Sweeping it shows the trade-off directly.

data.frame(
  min_links = 1:5,
  n_candidates = sapply(1:5, function(k) {
    sum(!citation_snowball(seed_icites, direction = "both", min_links = k)$seed)
  })
)

#>   min_links n_candidates
#> 1         1         1049
#> 2         2         1049
#> 3         3         1049
#> 4         4          652
#> 5         5          420

max_nodes is a hard ceiling on corpus size: candidates are ranked by link_count and truncated to fit, bounding the screening workload.

seed_icites |>
  citation_snowball(direction = "both", min_links = 1, max_nodes = 50) |>
  nrow()

#> [1] 951

Choosing a direction

direction matches the search intent: "cited" finds shared foundational references, "citing" finds later work building on the corpus, and "both" casts the widest net.

backward <- citation_snowball(seed_icites, direction = "cited",  min_links = 2)
forward  <- citation_snowball(seed_icites, direction = "citing", min_links = 2)

data.frame(
  direction    = c("cited (foundational)", "citing (downstream)"),
  n_candidates = c(sum(!backward$seed), sum(!forward$seed))
)

#>              direction n_candidates
#> 1 cited (foundational)         1049
#> 2  citing (downstream)         1049

Iterating

Re-seed by feeding the expanded PMIDs back through the iCite endpoint and snowballing again; each hop keeps the same audit columns.

hop2 <- snowball$pmid |>
  get_records(endpoint = "icites", cores = 1L, sleep = 0.25) |>
  citation_snowball(direction = "both", min_links = 3, max_nodes = 500)

Summary

citation_snowball() turns an iCite response into a ranked, auditable candidate set: it finds citation-adjacent papers a keyword query misses, the audit columns document why each was admitted, and MeSH keyness against data_mesh_frequencies characterizes the expansion space. It complements keyword search and manual screening rather than replacing them.