| Title: | Unified Retrieval of 'PubMed' and 'NIH' Literature Data |
|---|---|
| Description: | Access a variety of 'PubMed' data through a single, user-friendly interface, including abstracts, bibliometrics from 'iCite', pubtations from 'PubTator3', and full-text records from 'PMC'. |
| Authors: | Jason Timm [aut, cre] |
| Maintainer: | Jason Timm <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-06-03 03:25:21 UTC |
| Source: | https://github.com/jaytimm/puremoe |
Converts an icites data.table into a tidy graph representation
(nodes + edges) suitable for igraph or tidygraph. Only edges
where both endpoints are present in the corpus are retained, so the
graph is bounded to the papers you already have metadata for.
citation_network(icites)citation_network(icites)
icites |
A |
RCR and is_clinical are carried as node attributes, making the
resulting graph immediately weighted by field-normalized impact and enabling
bench-to-bedside edge filtering without any additional API calls.
A named list with two data.tables:
nodesOne row per PMID. Contains all iCite metadata
columns except citation_net. Key columns: pmid,
relative_citation_ratio, nih_percentile,
is_clinical.
edgesOne row per within-corpus directed citation.
Columns: from_pmid (the citing paper),
to_pmid (the cited paper).
## Not run: # network from a seed corpus pmids |> get_records(endpoint = "icites") |> citation_network() # expand first, then fetch iCite metadata for the full network snowball <- pmids |> get_records(endpoint = "icites") |> citation_snowball() snowball$pmid |> get_records(endpoint = "icites") |> citation_network() # translational footprint: filter to bench -> clinical edges snowball <- pmids |> get_records(endpoint = "icites") |> citation_snowball() net <- snowball$pmid |> get_records(endpoint = "icites") |> citation_network() clinical_edges <- net$edges |> merge(net$nodes[, .(pmid, is_clinical)], by.x = "to_pmid", by.y = "pmid") |> subset(is_clinical == TRUE) ## End(Not run)## Not run: # network from a seed corpus pmids |> get_records(endpoint = "icites") |> citation_network() # expand first, then fetch iCite metadata for the full network snowball <- pmids |> get_records(endpoint = "icites") |> citation_snowball() snowball$pmid |> get_records(endpoint = "icites") |> citation_network() # translational footprint: filter to bench -> clinical edges snowball <- pmids |> get_records(endpoint = "icites") |> citation_snowball() net <- snowball$pmid |> get_records(endpoint = "icites") |> citation_network() clinical_edges <- net$edges |> merge(net$nodes[, .(pmid, is_clinical)], by.x = "to_pmid", by.y = "pmid") |> subset(is_clinical == TRUE) ## End(Not run)
Starting from an icites data.table returned by
get_records(endpoint = "icites"), follows the citation links
already present in the citation_net column and returns a candidate
table. The function does not call iCite again; use
get_records(endpoint = "icites") explicitly on the returned
PMIDs if metadata is needed for the expanded corpus.
citation_snowball( icites, max_nodes = 2000, direction = c("both", "citing", "cited"), min_links = 2 )citation_snowball( icites, max_nodes = 2000, direction = c("both", "citing", "cited"), min_links = 2 )
icites |
A |
max_nodes |
Hard ceiling on the total number of PMIDs in the returned
corpus (seed + discovered). Candidates are filtered by |
direction |
One of |
min_links |
Minimum number of seed papers a candidate must be linked
to in order to be included. Default |
A data.table with one row per seed or candidate PMID.
Columns are pmid, seed, cited_links,
citing_links, and link_count. cited_links counts seed
papers that cite the candidate; citing_links counts seed papers
cited by the candidate.
## Not run: pmids <- search_pubmed("metformin AND PCOS [TiAb]") snowball <- pmids |> get_records(endpoint = "icites") |> citation_snowball(direction = "cited", min_links = 2) snowball$pmid |> get_records(endpoint = "pubmed_abstracts") ## End(Not run)## Not run: pmids <- search_pubmed("metformin AND PCOS [TiAb]") snowball <- pmids |> get_records(endpoint = "icites") |> citation_snowball(direction = "cited", min_links = 2) snowball$pmid |> get_records(endpoint = "pubmed_abstracts") ## End(Not run)
Baseline frequencies for MeSH descriptors computed from a local PostgreSQL mirror of PubMed (April 2026). For each descriptor, counts reflect the number of distinct PMIDs indexed with that term; proportions use the full PubMed corpus of 39,703,112 PMIDs as denominator. Descriptor UI and canonical name are joined from the NLM MeSH thesaurus. Intended as a baseline for MeSH term enrichment analyses against arbitrary PubMed subsets.
data_mesh_frequenciesdata_mesh_frequencies
A data.table with 30,521 rows and 4 columns:
MeSH descriptor unique identifier (e.g.,
D000001)
Canonical MeSH descriptor name
Number of distinct PubMed records indexed with this descriptor
Proportion of all 39,703,112 PubMed PMIDs indexed with this descriptor
Computed from mesh_descriptor table in a local PubMed
PostgreSQL mirror; descriptor metadata from the NLM MeSH Thesaurus
(April 2026).
This function downloads and combines the 'MeSH' (Medical Subject Headings) Thesaurus and a supplemental concept thesaurus. The data is sourced from specified URLs and stored locally for subsequent use. By default, the data is stored in a temporary directory. Users can opt into persistent storage by setting 'use_persistent_storage' to TRUE and optionally specifying a path.
data_mesh_thesaurus( path = NULL, use_persistent_storage = FALSE, force_install = FALSE )data_mesh_thesaurus( path = NULL, use_persistent_storage = FALSE, force_install = FALSE )
path |
A character string specifying the directory path where data should be stored. If not provided and persistent storage is requested, it defaults to a system-appropriate persistent location managed by 'rappdirs'. |
use_persistent_storage |
A logical value indicating whether to use persistent storage. If TRUE and no path is provided, data will be stored in a system-appropriate location. Defaults to FALSE, using a temporary directory. |
force_install |
A logical value indicating whether to force re-downloading of the data even if it already exists locally. |
A data.table containing the combined MeSH and supplemental thesaurus data.
if (interactive()) { data <- data_mesh_thesaurus() }if (interactive()) { data <- data_mesh_thesaurus() }
This function downloads and loads the 'MeSH' (Medical Subject Headings) Trees data.
data_mesh_trees( path = NULL, use_persistent_storage = FALSE, force_install = FALSE )data_mesh_trees( path = NULL, use_persistent_storage = FALSE, force_install = FALSE )
path |
A character string specifying the directory path where data should be stored. If not provided and persistent storage is requested, it defaults to a system-appropriate persistent location managed by 'rappdirs'. |
use_persistent_storage |
A logical value indicating whether to use persistent storage. If TRUE and no path is provided, data will be stored in a system-appropriate location. Defaults to FALSE, using a temporary directory. |
force_install |
A logical value indicating whether to force re-downloading of the data even if it already exists locally. |
The data is sourced from specified URLs and stored locally for subsequent use. By default, the data is stored in a temporary directory. Users can opt into persistent storage by setting 'use_persistent_storage' to TRUE and optionally specifying a path.
A data frame containing the MeSH Trees data.
if (interactive()) { data <- data_mesh_trees() }if (interactive()) { data <- data_mesh_trees() }
This function provides detailed information about the available endpoints in the package, including column descriptions, parameters, rate limits, and usage notes.
endpoint_info(endpoint = NULL, format = c("list", "json"))endpoint_info(endpoint = NULL, format = c("list", "json"))
endpoint |
Character string specifying which endpoint to get information about. If NULL (default), returns a list of all available endpoints. |
format |
Character string specifying the output format. Either "list" (default) or "json" for JSON-formatted output. |
If endpoint is NULL, returns a character vector of available endpoint names.
If endpoint is specified, returns a list (or JSON string) with detailed information
about that endpoint including description, columns, parameters, rate limits, and notes.
if (interactive()) { # List all available endpoints endpoint_info() # Get information about a specific endpoint endpoint_info("pubmed_abstracts") # Get information in JSON format endpoint_info("icites", format = "json") }if (interactive()) { # List all available endpoints endpoint_info() # Get information about a specific endpoint endpoint_info("pubmed_abstracts") # Get information in JSON format endpoint_info("icites", format = "json") }
This function retrieves different types of data (like 'PubMed' records, affiliations, 'iCites 'data, etc.) from 'PubMed' based on provided PMIDs. It supports parallel processing for efficiency.
get_records( pmids, endpoint = c("pubtations", "icites", "pubmed_affiliations", "pubmed_abstracts", "pmc_fulltext"), cores = 3, sleep = 1, ncbi_key = NULL, icite_timeout = getOption("puremoe.icite_timeout", 15) )get_records( pmids, endpoint = c("pubtations", "icites", "pubmed_affiliations", "pubmed_abstracts", "pmc_fulltext"), cores = 3, sleep = 1, ncbi_key = NULL, icite_timeout = getOption("puremoe.icite_timeout", 15) )
pmids |
A vector of PMIDs for which data is to be retrieved. For 'pmc_fulltext' endpoint,
provide full URLs instead (e.g., from |
endpoint |
A character vector specifying the type of data to retrieve ('pubtations', 'icites', 'pubmed_affiliations', 'pubmed_abstracts', 'pmc_fulltext'). |
cores |
Number of cores to use for parallel processing (default is 3). |
sleep |
Duration (in seconds) to pause after each batch |
ncbi_key |
(Optional) NCBI API key for authenticated access. |
icite_timeout |
Maximum elapsed seconds to allow each iCite batch before
skipping it and returning PMID-only rows. Defaults to the
|
For the 'pmc_fulltext' endpoint, provide full URLs to PMC tar.gz files.
Use pmid_to_pmc to convert PMIDs to PMC IDs and full URLs first.
A data.table containing combined results from the specified endpoint.
pmids <- c("38136652") results <- get_records(pmids, endpoint = "pubmed_abstracts", cores = 1)pmids <- c("38136652") results <- get_records(pmids, endpoint = "pubmed_abstracts", cores = 1)
This function converts PMIDs to PMC IDs, then fetches the full-text file URLs from the PMC Open Access service. It combines both steps into a single workflow.
pmid_to_ftp( pmids, batch_size = 200L, sleep = 0.5, verbose = FALSE, ncbi_key = NULL )pmid_to_ftp( pmids, batch_size = 200L, sleep = 0.5, verbose = FALSE, ncbi_key = NULL )
pmids |
A character or numeric vector of PubMed IDs (PMIDs) to convert. |
batch_size |
An integer specifying the number of PMIDs to process per batch for ID conversion. Defaults to 200L. The NCBI API has limitations on batch sizes. |
sleep |
A numeric value specifying the number of seconds to pause between API requests for ID conversion (Step 1). Defaults to 0.5 seconds. For OA API calls (Step 2), sleep time is automatically adjusted based on rate limits: 0.11s with API key (10 req/sec), 0.34s without (3 req/sec). |
verbose |
Logical, whether to print progress messages. Defaults to FALSE. |
ncbi_key |
(Optional) NCBI API key for authenticated access. |
A data.table with columns:
pmid: The input PubMed ID
pmcid: The corresponding PMC ID
doi: The corresponding DOI (NA if not available)
url: The full HTTPS URL for downloading PMC full text
Results are filtered to only include rows with valid URLs (open access articles), ordered by PMID. Returns NULL with a message if the API is unavailable or returns invalid data.
if (interactive()) { # Convert PMIDs to PMC IDs and get full-text URLs result <- pmid_to_ftp(c("11250746", "11573492")) }if (interactive()) { # Convert PMIDs to PMC IDs and get full-text URLs result <- pmid_to_ftp(c("11250746", "11573492")) }
This function converts a vector of PubMed IDs (PMIDs) to their corresponding PubMed Central (PMC) IDs and DOIs using the NCBI ID Converter API.
pmid_to_pmc(pmids, batch_size = 200L, sleep = 0.5)pmid_to_pmc(pmids, batch_size = 200L, sleep = 0.5)
pmids |
A character or numeric vector of PubMed IDs (PMIDs) to convert. |
batch_size |
An integer specifying the number of PMIDs to process per batch. Defaults to 200L. The NCBI API has limitations on batch sizes. |
sleep |
A numeric value specifying the number of seconds to pause between API requests. Defaults to 0.5 seconds to respect API rate limits. |
A data.table with columns:
pmid: The input PubMed ID
pmcid: The corresponding PMC ID (NA if not available in PMC)
doi: The corresponding DOI (NA if not available)
Results are ordered by PMID. Returns NULL with a message if the API is unavailable or returns invalid data.
if (interactive()) { # Convert a single PMID to PMC ID result <- pmid_to_pmc("12345678") # Convert multiple PMIDs pmids <- c("12345678", "23456789", "34567890") result <- pmid_to_pmc(pmids, batch_size = 10, sleep = 1) }if (interactive()) { # Convert a single PMID to PMC ID result <- pmid_to_pmc("12345678") # Convert multiple PMIDs pmids <- c("12345678", "23456789", "34567890") result <- pmid_to_pmc(pmids, batch_size = 10, sleep = 1) }
Counts pairs of biomedical entities that co-occur within the same sentence
(window = 0) or within window sentences of each other, using the
sentence-mapped annotation table returned by pubtator_sentences.
Co-occurrence is computed within each pmid/tiab passage: title
and abstract are treated separately because their sentence offsets are
numbered independently.
pubtator_cooccurrence( mapped, window = 0L, by = c("type", "entity"), evidence = FALSE )pubtator_cooccurrence( mapped, window = 0L, by = c("type", "entity"), evidence = FALSE )
mapped |
A |
window |
Non-negative integer sentence distance. |
by |
One of |
evidence |
Logical. When |
Entities are de-duplicated to one mention per sentence before pairing, and
pairs of the same entity (identical type, identifier,
and text) are dropped, so same-type pairs between two distinct
entities (e.g. two different genes) are retained.
Counting follows windowed-collocation semantics: a pair contributes one
instance for each pair of mentions within window sentences of each
other. At window = 0 this is simply one instance per shared sentence,
but for window > 0 a pair recurring across several sentences yields
multiple instances, so counts scale with mention frequency. n_pmids
(distinct documents) is unaffected and is the more conservative signal.
A data.table. With evidence = FALSE and
by = "type": type_x, type_y, n (co-occurrence
instances), and n_pmids (distinct documents), ordered by n.
With by = "entity": the same plus
identifier_x/text_x/identifier_y/text_y. With
evidence = TRUE: one row per distinct context string for an
entity pair (identical contexts de-duplicated), with pmid,
tiab, the two entities' type/identifier/text,
and context.
## Not run: pmids <- search_pubmed('"biomarker"[TiAb] AND "cancer"[TiAb]') mapped <- pmids |> get_records(endpoint = "pubtations") |> pubtator_sentences() # same-sentence entity-type co-occurrence mapped |> pubtator_cooccurrence(window = 0, by = "type") # specific entity pairs within one sentence on either side mapped |> pubtator_cooccurrence(window = 1, by = "entity") # traceable evidence: every instance with its sentence context mapped |> pubtator_cooccurrence(window = 0, evidence = TRUE) ## End(Not run)## Not run: pmids <- search_pubmed('"biomarker"[TiAb] AND "cancer"[TiAb]') mapped <- pmids |> get_records(endpoint = "pubtations") |> pubtator_sentences() # same-sentence entity-type co-occurrence mapped |> pubtator_cooccurrence(window = 0, by = "type") # specific entity pairs within one sentence on either side mapped |> pubtator_cooccurrence(window = 1, by = "entity") # traceable evidence: every instance with its sentence context mapped |> pubtator_cooccurrence(window = 0, evidence = TRUE) ## End(Not run)
Splits abstract text into sentences and assigns each PubTator3 entity annotation to its containing sentence via character-offset overlap. When available, the PubTator3 passage text and offsets are used directly.
pubtator_sentences(pubtations)pubtator_sentences(pubtations)
pubtations |
A data.table returned by
|
A data.table with annotation columns plus integer
sentence_id, sentence, sentence_start, and
sentence_end. sentence_start and sentence_end are
zero-based, end-exclusive entity offsets within sentence. PubTator passage metadata
columns are used for mapping but are not returned. Only passage annotations
that can be assigned to a sentence are returned.
## Not run: pmids <- search_pubmed('"Biomarkers Consortium"') pubtations <- get_records(pmids, endpoint = "pubtations") mapped <- pubtator_sentences(pubtations) ## End(Not run)## Not run: pmids <- search_pubmed('"Biomarkers Consortium"') pubtations <- get_records(pmids, endpoint = "pubtations") mapped <- pubtator_sentences(pubtations) ## End(Not run)
Performs a 'PubMed' search based on a query, optionally filtered by publication years. Returns a unique set of 'PubMed' IDs matching the query.
search_pubmed( x, start_year = NULL, end_year = NULL, retmax = 9999, use_pub_years = FALSE )search_pubmed( x, start_year = NULL, end_year = NULL, retmax = 9999, use_pub_years = FALSE )
x |
Character string, the search query. |
start_year |
Integer, the start year of publication date range (used if 'use_pub_years' is TRUE). |
end_year |
Integer, the end year of publication date range (used if 'use_pub_years' is TRUE). |
retmax |
Integer, maximum number of records to retrieve, defaults to 9999. |
use_pub_years |
Logical, whether to filter search by publication years, defaults to TRUE. |
Numeric vector of unique PubMed IDs.
ethnob1 <- search_pubmed("ethnobotany", 2010, 2012)ethnob1 <- search_pubmed("ethnobotany", 2010, 2012)