---
title: "PubTator Sentence Mapping"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{PubTator Sentence Mapping}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  comment = "#>"
)
```

`puremoe` retrieves PubTator3 annotations as one row per entity span. `pubtator_sentences()` attaches the sentence each annotation falls in, and `pubtator_cooccurrence()` then counts entity pairs that share a sentence (or fall within a few sentences) -- optionally returning the sentence context so every count traces back to text.

```{r libs}
library(puremoe)
library(dplyr)
library(DT)
```

## Retrieve PubTator annotations

Search PubMed and retrieve PubTator3 annotations for a small subset of hits.

```{r search}
pmids <- search_pubmed(
  '"biomarker"[TiAb] AND "cancer"[TiAb]',
  start_year    = 2022,
  end_year      = 2024,
  use_pub_years = TRUE
)

pubtations <- get_records(head(pmids, 40L), endpoint = "pubtations",
                          cores = 1L, sleep = 0.5)
```

The raw table carries PubTator's passage text and offsets, which `pubtator_sentences()` uses to align spans against the exact annotated text.

```{r pubtations-table}
pubtations |>
  select(pmid, tiab, text, type, identifier, start, end) |>
  head(25) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```

## Map annotations to sentences

`pubtator_sentences()` adds `sentence_id`, `sentence`, and the entity's zero-based offsets within that sentence (`sentence_start`, `sentence_end`). Empty placeholder rows -- passages with no annotations -- are kept with missing sentence fields.

```{r mapped}
mapped <- pubtator_sentences(pubtations)

mapped |>
  filter(!is.na(text)) |>
  select(pmid, tiab, sentence_id, text, type, identifier,
         sentence_start, sentence_end, sentence) |>
  head(30) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```

## Entity co-occurrence

With a `sentence_id` on every annotation, `pubtator_cooccurrence()` counts co-occurring entity pairs directly. `window = 0` counts same-sentence pairs; `by = "type"` aggregates by entity type, reporting co-occurrence instances (`n`) and distinct documents (`n_pmids`).

```{r cooccur-type}
mapped |>
  pubtator_cooccurrence(window = 0, by = "type") |>
  DT::datatable(rownames = FALSE)
```

`by = "entity"` keeps specific labels and identifiers. Entities are de-duplicated per sentence and same-entity pairs are dropped, so same-type pairs survive only between two distinct entities (e.g. two different genes).

```{r cooccur-entity}
mapped |>
  pubtator_cooccurrence(window = 0, by = "entity") |>
  head(30) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```

`window` widens the scope to nearby sentences within a passage (title and abstract are kept separate); `window = 1` reaches one sentence on either side.

```{r cooccur-window}
mapped |>
  pubtator_cooccurrence(window = 1, by = "entity") |>
  DT::datatable(rownames = FALSE)
```

## Traceable evidence

`evidence = TRUE` returns the supporting sentence `context` for each co-occurring pair (identical contexts de-duplicated), so a count is never a dead end.

```{r evidence}
evidence <- mapped |>
  pubtator_cooccurrence(window = 0, evidence = TRUE)

evidence |>
  head(20) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```

Filtering to the top-ranked pair pulls up the sentences behind it.

```{r evidence-examples}
top_pair <- mapped |>
  pubtator_cooccurrence(window = 0, by = "entity") |>
  slice_head(n = 1)

evidence |>
  semi_join(
    top_pair,
    by = c("type_x", "identifier_x", "text_x",
           "type_y", "identifier_y", "text_y")
  ) |>
  select(pmid, tiab, context) |>
  DT::datatable(rownames = FALSE, options = list(scrollX = TRUE))
```