puremoe retrieves
PubTator3 annotations as one row per entity span.
pubtator_sentences() attaches the sentence each annotation
falls in, and pubtator_cooccurrence() then counts entity
pairs that share a sentence (or fall within a few sentences) –
optionally returning the sentence context so every count traces back to
text.
Search PubMed and retrieve PubTator3 annotations for a small subset of hits.
pmids <- search_pubmed(
'"biomarker"[TiAb] AND "cancer"[TiAb]',
start_year = 2022,
end_year = 2024,
use_pub_years = TRUE
)
pubtations <- get_records(head(pmids, 40L), endpoint = "pubtations",
cores = 1L, sleep = 0.5)The raw table carries PubTator’s passage text and offsets, which
pubtator_sentences() uses to align spans against the exact
annotated text.
pubtator_sentences() adds sentence_id,
sentence, and the entity’s zero-based offsets within that
sentence (sentence_start, sentence_end). Empty
placeholder rows – passages with no annotations – are kept with missing
sentence fields.
With a sentence_id on every annotation,
pubtator_cooccurrence() counts co-occurring entity pairs
directly. window = 0 counts same-sentence pairs;
by = "type" aggregates by entity type, reporting
co-occurrence instances (n) and distinct documents
(n_pmids).
by = "entity" keeps specific labels and identifiers.
Entities are de-duplicated per sentence and same-entity pairs are
dropped, so same-type pairs survive only between two distinct entities
(e.g. two different genes).
mapped |>
pubtator_cooccurrence(window = 0, by = "entity") |>
head(30) |>
DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))window widens the scope to nearby sentences within a
passage (title and abstract are kept separate); window = 1
reaches one sentence on either side.
evidence = TRUE returns the supporting sentence
context for each co-occurring pair (identical contexts
de-duplicated), so a count is never a dead end.
evidence <- mapped |>
pubtator_cooccurrence(window = 0, evidence = TRUE)
evidence |>
head(20) |>
DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))Filtering to the top-ranked pair pulls up the sentences behind it.
top_pair <- mapped |>
pubtator_cooccurrence(window = 0, by = "entity") |>
slice_head(n = 1)
evidence |>
semi_join(
top_pair,
by = c("type_x", "identifier_x", "text_x",
"type_y", "identifier_y", "text_y")
) |>
select(pmid, tiab, context) |>
DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))