PubTator Sentence Mapping

puremoe retrieves PubTator3 annotations as one row per entity span. pubtator_sentences() attaches the sentence each annotation falls in, and pubtator_cooccurrence() then counts entity pairs that share a sentence (or fall within a few sentences) – optionally returning the sentence context so every count traces back to text.

library(puremoe)
library(dplyr)
library(DT)

Retrieve PubTator annotations

Search PubMed and retrieve PubTator3 annotations for a small subset of hits.

pmids <- search_pubmed(
  '"biomarker"[TiAb] AND "cancer"[TiAb]',
  start_year    = 2022,
  end_year      = 2024,
  use_pub_years = TRUE
)

pubtations <- get_records(head(pmids, 40L), endpoint = "pubtations",
                          cores = 1L, sleep = 0.5)

The raw table carries PubTator’s passage text and offsets, which pubtator_sentences() uses to align spans against the exact annotated text.

pubtations |>
  select(pmid, tiab, text, type, identifier, start, end) |>
  head(25) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Map annotations to sentences

pubtator_sentences() adds sentence_id, sentence, and the entity’s zero-based offsets within that sentence (sentence_start, sentence_end). Empty placeholder rows – passages with no annotations – are kept with missing sentence fields.

mapped <- pubtator_sentences(pubtations)

mapped |>
  filter(!is.na(text)) |>
  select(pmid, tiab, sentence_id, text, type, identifier,
         sentence_start, sentence_end, sentence) |>
  head(30) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Entity co-occurrence

With a sentence_id on every annotation, pubtator_cooccurrence() counts co-occurring entity pairs directly. window = 0 counts same-sentence pairs; by = "type" aggregates by entity type, reporting co-occurrence instances (n) and distinct documents (n_pmids).

mapped |>
  pubtator_cooccurrence(window = 0, by = "type") |>
  DT::datatable(rownames = FALSE)

by = "entity" keeps specific labels and identifiers. Entities are de-duplicated per sentence and same-entity pairs are dropped, so same-type pairs survive only between two distinct entities (e.g. two different genes).

mapped |>
  pubtator_cooccurrence(window = 0, by = "entity") |>
  head(30) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

window widens the scope to nearby sentences within a passage (title and abstract are kept separate); window = 1 reaches one sentence on either side.

mapped |>
  pubtator_cooccurrence(window = 1, by = "entity") |>
  DT::datatable(rownames = FALSE)

Traceable evidence

evidence = TRUE returns the supporting sentence context for each co-occurring pair (identical contexts de-duplicated), so a count is never a dead end.

evidence <- mapped |>
  pubtator_cooccurrence(window = 0, evidence = TRUE)

evidence |>
  head(20) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))

Filtering to the top-ranked pair pulls up the sentences behind it.

top_pair <- mapped |>
  pubtator_cooccurrence(window = 0, by = "entity") |>
  slice_head(n = 1)

evidence |>
  semi_join(
    top_pair,
    by = c("type_x", "identifier_x", "text_x",
           "type_y", "identifier_y", "text_y")
  ) |>
  select(pmid, tiab, context) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))