| Title: | A Lightweight and Versatile NLP Toolkit |
|---|---|
| Description: | An R toolkit for building text corpora and searching them. No custom object classes, just plain data frames from start to finish. Covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all compatible with the native R pipe ('|>'). |
| Authors: | Jason Timm [aut, cre] |
| Maintainer: | Jason Timm <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.1 |
| Built: | 2026-05-17 08:50:05 UTC |
| Source: | https://github.com/jaytimm/textpress |
Common abbreviations for NLP (e.g. sentence splitting). Named list; used by
nlp_split_sentences.
abbreviationsabbreviations
A named list with the following components:
abbreviationsA character vector of common abbreviations, including titles, months, and standard abbreviations.
Internally compiled linguistic resource.
A small dictionary of generational cohort terms (Greatest, Silent, Boomers,
Gen X, Millennials, Gen Z, Alpha, etc.) and spelling/variant forms, for use
with search_dict. Built in-package (no data()).
dict_generationsdict_generations
A data frame with columns variant (surface form to match), TermName (standardized label), is_cusp (logical), start and end (birth year range; Pew definitions where applicable, see https://github.com/jaytimm/AmericanGenerations/blob/main/data/pew-generations.csv).
head(dict_generations) # use as term list: search_dict(corpus, by = "doc_id", terms = dict_generations$variant)head(dict_generations) # use as term list: search_dict(corpus, by = "doc_id", terms = dict_generations$variant)
A small dictionary of political party and ideology terms (Democrat, Republican,
MAGA, Liberal, Conservative, Christian Nationalist, White Supremacist, etc.)
and spelling/variant forms, for use with search_dict. Built in-package (no data()).
dict_politicaldict_political
A data frame with columns variant (surface form to match) and TermName (standardized label).
head(dict_political) # search_dict(corpus, by = "doc_id", terms = dict_political$variant)head(dict_political) # search_dict(corpus, by = "doc_id", terms = dict_political$variant)
Web (general). Queries a search engine and returns result URLs. Use
read_urls to get content from these URLs.
fetch_urls(query, n_pages = 1, date_filter = "w")fetch_urls(query, n_pages = 1, date_filter = "w")
query |
Search query string. |
n_pages |
Number of search result pages to fetch (default 1). ~30 results per page. |
date_filter |
Recency filter: |
A data.table with columns search_engine, url, is_excluded, and optionally path_depth.
## Not run: urls_dt <- fetch_urls("R programming nlp", n_pages = 1) urls_dt$url ## End(Not run)## Not run: urls_dt <- fetch_urls("R programming nlp", n_pages = 1) urls_dt$url ## End(Not run)
Wikipedia. Extracts external citation URLs from the References section of one
or more Wikipedia article URLs. Use read_urls to scrape content
from those URLs.
fetch_wiki_refs(url, n = NULL)fetch_wiki_refs(url, n = NULL)
url |
Character vector of full Wikipedia article URLs (e.g. from |
n |
Maximum number of citation URLs to return per source page. Default |
For one URL, a data.table with columns source_url, ref_id, and ref_url. For multiple URLs, a named list of such data.tables (names are the Wikipedia article titles); elements are NULL for pages with no refs.
## Not run: wiki_urls <- fetch_wiki_urls("January 6 Capitol attack") refs_dt <- fetch_wiki_refs(wiki_urls[1]) # single URL: data.table refs_list <- fetch_wiki_refs(wiki_urls[1:3]) # multiple: named list articles <- read_urls(refs_dt$ref_url) ## End(Not run)## Not run: wiki_urls <- fetch_wiki_urls("January 6 Capitol attack") refs_dt <- fetch_wiki_refs(wiki_urls[1]) # single URL: data.table refs_list <- fetch_wiki_refs(wiki_urls[1:3]) # multiple: named list articles <- read_urls(refs_dt$ref_url) ## End(Not run)
Wikipedia. Uses the MediaWiki API to get Wikipedia article URLs matching a
search phrase. Does not search your local corpus. Use read_urls
to get article content from these URLs.
fetch_wiki_urls(query, limit = 10)fetch_wiki_urls(query, limit = 10)
query |
Search phrase (e.g. "117th Congress"). |
limit |
Number of page URLs to return (default 10). |
Character vector of full Wikipedia article URLs.
## Not run: wiki_urls <- fetch_wiki_urls("January 6 Capitol attack") corpus <- read_urls(wiki_urls[1]) ## End(Not run)## Not run: wiki_urls <- fetch_wiki_urls("January 6 Capitol attack") corpus <- read_urls(wiki_urls[1]) ## End(Not run)
Convert the token list returned by nlp_tokenize_text into a data
frame (long format), with identifiers and optional spans.
nlp_cast_tokens(tok)nlp_cast_tokens(tok)
tok |
List with at least a |
Data frame with columns for unit id, token, and optionally start/end spans.
tok <- list( tokens = list( "1.1" = c("Hello", "world", "."), "1.2" = c("This", "is", "an", "example", "."), "2.1" = c("This", "is", "a", "party", "!") ) ) dtm <- nlp_cast_tokens(tok)tok <- list( tokens = list( "1.1" = c("Hello", "world", "."), "1.2" = c("This", "is", "an", "example", "."), "2.1" = c("This", "is", "a", "party", "!") ) ) dtm <- nlp_cast_tokens(tok)
Build a weighted BM25 index for ranked keyword search. Creates a searchable
index from a named list of token vectors. The unit-id column name is taken
from attr(tokens, "id_col") when present (e.g. from nlp_tokenize_text), else "uid".
nlp_index_tokens(tokens, k1 = 1.2, b = 0.75, stem = FALSE)nlp_index_tokens(tokens, k1 = 1.2, b = 0.75, stem = FALSE)
tokens |
Named list of character vectors (e.g. from |
k1 |
BM25 saturation parameter (default 1.2). |
b |
BM25 length normalization (default 0.75). |
stem |
Logical. If |
Data.table with unit-id column, token, score; attr(., "id_col") set for search_index.
Roll units (e.g. sentences) into fixed-size chunks with optional context
(RAG-style). Groups consecutive rows at the finest by level into chunks
and optionally adds surrounding context.
nlp_roll_chunks(corpus, by, chunk_size, context_size, id_col = "uid")nlp_roll_chunks(corpus, by, chunk_size, context_size, id_col = "uid")
corpus |
Data frame or data.table with a |
by |
Character vector of identifier columns that define the text unit (e.g. |
chunk_size |
Integer. Number of units per chunk. |
context_size |
Integer. Number of units of context around each chunk. |
id_col |
Character. Name of the column holding the unique chunk id (default |
Data.table with id_col (pasted grouping + chunk index), grouping columns from by, and text (chunk plus context). Unique on by[1] and text.
corpus <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) chunks <- nlp_roll_chunks(corpus, by = c('doc_id', 'sentence_id'), chunk_size = 2, context_size = 1)corpus <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) chunks <- nlp_roll_chunks(corpus, by = c('doc_id', 'sentence_id'), chunk_size = 2, context_size = 1)
Break documents into structural blocks (paragraphs). Splits text from the
text column by a paragraph delimiter.
nlp_split_paragraphs(corpus, by = c("doc_id"), paragraph_delim = "\\n+")nlp_split_paragraphs(corpus, by = c("doc_id"), paragraph_delim = "\\n+")
corpus |
Data frame or data.table with a |
by |
Character vector of identifier columns that define the text unit (e.g. |
paragraph_delim |
Regular expression used to split text into paragraphs (default |
Data.table with the by columns, paragraph_id, and text. One row per paragraph.
corpus <- data.frame(doc_id = c('1', '2'), text = c("Hello world.\n\nMind your business!", "This is an example.n\nThis is a party!")) paragraphs <- nlp_split_paragraphs(corpus)corpus <- data.frame(doc_id = c('1', '2'), text = c("Hello world.\n\nMind your business!", "This is an example.n\nThis is a party!")) paragraphs <- nlp_split_paragraphs(corpus)
Refine blocks into individual sentences. Splits text into sentences with accurate start/end offsets; handles abbreviations (Wikipedia and web optimized).
nlp_split_sentences( corpus, by = c("doc_id"), abbreviations = textpress::abbreviations )nlp_split_sentences( corpus, by = c("doc_id"), abbreviations = textpress::abbreviations )
corpus |
Data frame or data.table with a |
by |
Character vector of identifier columns that define the text unit (e.g. |
abbreviations |
Character vector of abbreviations to protect (default |
Data.table with by columns, sentence_id, text, start, end.
Normalize text into a clean token stream. Tokenizes corpus text, preserving
structure (capitalization, punctuation). The last column in by determines
the tokenization unit.
nlp_tokenize_text( corpus, by = c("doc_id", "paragraph_id", "sentence_id"), id_col = "uid", include_spans = TRUE, method = "word" )nlp_tokenize_text( corpus, by = c("doc_id", "paragraph_id", "sentence_id"), id_col = "uid", include_spans = TRUE, method = "word" )
corpus |
Data frame or data.table with a |
by |
Character vector of identifier columns that define the text unit (e.g. |
id_col |
Character. Name of the column (and list names) used for the unit id (default |
include_spans |
Logical. Include start/end character spans for each token (default |
method |
Character. |
Named list of tokens; or list of tokens and spans if include_spans = TRUE.
corpus <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) tokens <- nlp_tokenize_text(corpus, by = c('doc_id', 'sentence_id'))corpus <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) tokens <- nlp_tokenize_text(corpus, by = c('doc_id', 'sentence_id'))
Input: character vector of URLs. Output: structured data frame (one row per
node: headings, paragraphs, lists). Like read_csv or read_html:
bring an external resource into R. Follows fetch_urls or
fetch_wiki_urls in the pipeline—fetch gets locations, read gets
text. Wikipedia uses high-fidelity selectors; use parent_heading to see
which section each node belongs to. External links and empty text rows are
omitted; optionally exclude References/See also/Bibliography/Sources sections for
wiki URLs.
read_urls( x, cores = 1, detect_boilerplate = TRUE, remove_boilerplate = TRUE, exclude_wiki_refs = TRUE )read_urls( x, cores = 1, detect_boilerplate = TRUE, remove_boilerplate = TRUE, exclude_wiki_refs = TRUE )
x |
Character vector of URLs. |
cores |
Number of cores for parallel requests (default 1). |
detect_boilerplate |
Logical. Detect boilerplate (e.g. sign-up, related links). |
remove_boilerplate |
Logical. If |
exclude_wiki_refs |
Logical. For Wikipedia URLs only, drop nodes whose |
A list with text (node-level data: doc_id, url, node_id, parent_heading, text, and optionally type, is_boilerplate) and meta (one row per URL: doc_id, url, h1_title, date, source). doc_id is an integer key (1 to number of distinct URLs) in first-appearance order of the input vector.
## Not run: urls <- fetch_urls("R programming", n_pages = 1)$url out <- read_urls(urls[1:3], cores = 1) nodes <- out$text meta <- out$meta ## End(Not run)## Not run: urls <- fetch_urls("R programming", n_pages = 1)$url out <- read_urls(urls[1:3], cores = 1) nodes <- out$text meta <- out$meta ## End(Not run)
Exact phrase or multi-word expression (MWE) matcher; no partial-match risk.
Tokenizes corpus, builds n-grams, and exact-joins against terms. Word
boundaries respected. N-gram range is set from the min and max word count of
terms. Good for deterministic entity extraction (e.g. before an LLM call).
search_dict(corpus, by = c("doc_id"), terms)search_dict(corpus, by = c("doc_id"), terms)
corpus |
Data frame or data.table with a |
by |
Character vector of identifier columns that define the text unit (e.g. |
terms |
Character vector of terms or phrases to match exactly. N-gram range derived from word counts of |
Data.table with id, start, end, n, ngram, term.
corpus <- data.frame(doc_id = "1", text = "Gen Z and Millennials use social media.") search_dict(corpus, by = "doc_id", terms = c("Gen Z", "Millennials", "social media"))corpus <- data.frame(doc_id = "1", text = "Gen Z and Millennials use social media.") search_dict(corpus, by = "doc_id", terms = c("Gen Z", "Millennials", "social media"))
BM25 ranked retrieval. Search the index produced by nlp_index_tokens
with a keyword query. The unit-id column in results is taken from attr(index, "id_col") when present, else "uid".
search_index(index, query, n = 10, stem = FALSE)search_index(index, query, n = 10, stem = FALSE)
index |
Object created by |
query |
Character string (keywords). |
n |
Number of results to return (default 10). |
stem |
Logical; must match the setting used during indexing (default |
Data.table with columns query, method (“bm25”), score (3 significant figures), and the unit-id column (e.g. uid), ranked by score.
Search corpus by regex. Specific strings/patterns; good for KWIC-style results. Returns matches with optional highlighting.
search_regex(corpus, query, by = c("doc_id"), highlight = c("<b>", "</b>"))search_regex(corpus, query, by = c("doc_id"), highlight = c("<b>", "</b>"))
corpus |
Data frame or data.table with a |
query |
Search pattern (regex). |
by |
Character vector of identifier columns that define the text unit (e.g. |
highlight |
Length-two character vector for wrapping matches (default |
Data.table with id, by columns, text, start, end, pattern.
Semantic search by cosine similarity. Returns top-n matches from an
embedding matrix for one or more query vectors. Subject-first: embeddings
(haystack) then query (needle). Pipe-friendly.
search_vector(embeddings, query, n = 10)search_vector(embeddings, query, n = 10)
embeddings |
Numeric matrix of embeddings; rows are searchable units (row names used as identifiers). |
query |
Row name in |
n |
Number of results to return per query (default 10). |
Data frame with columns query, method (“cosine”), score (3 significant figures), and the unit-id column (e.g. uid). For multiple queries, a list of such data frames.
Builds a numeric matrix of embeddings for each text unit. Row names come from
by (data frame) or from names(corpus) / corpus (character vector).
Use the result with search_vector for semantic search.
util_fetch_embeddings( corpus, by = NULL, api_token, api_url = "https://router.huggingface.co/hf-inference/models/BAAI/bge-small-en-v1.5" )util_fetch_embeddings( corpus, by = NULL, api_token, api_url = "https://router.huggingface.co/hf-inference/models/BAAI/bge-small-en-v1.5" )
corpus |
A data frame with |
by |
Character vector of identifier columns; required when |
api_token |
Hugging Face API token. |
api_url |
Inference endpoint URL (default BAAI/bge-small-en-v1.5). |
Numeric matrix with row names (unit ids).