Package: textpress 1.1.1

Jason Timm

textpress: A Lightweight and Versatile NLP Toolkit

An R toolkit for building text corpora and searching them. No custom object classes, just plain data frames from start to finish. Covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all compatible with the native R pipe ('|>').

Authors:Jason Timm [aut, cre]

textpress_1.1.1.tar.gz
textpress_1.1.1.zip(r-4.7)textpress_1.1.1.zip(r-4.6)textpress_1.1.1.zip(r-4.5)
textpress_1.1.1.tgz(r-4.6-any)textpress_1.1.1.tgz(r-4.5-any)
textpress_1.1.1.tar.gz(r-4.7-any)textpress_1.1.1.tar.gz(r-4.6-any)
textpress_1.1.1.tgz(r-4.6-emscripten)
manual.pdf |manual.html✨
DESCRIPTION |NEWS
card.svg |card.png
textpress/json (API)

# Install 'textpress' in R:

install.packages('textpress', repos = c('https://jaytimm.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/jaytimm/textpress/issues

Pkgdown/docs site:https://jaytimm.github.io

On CRAN:

corpus-search nlp web-scraping

4.26 score 3 stars 1 packages 6 scripts 548 downloads 18 exports 31 dependencies

Last updated from:33c73e76ae. Checks:7 WARNING, 2 OK. Indexed: yes.

Target	Result	Time
linux-devel-x86_64	WARNING	132
source / vignettes	OK	213
linux-release-x86_64	WARNING	131
macos-release-arm64	WARNING	113
macos-oldrel-arm64	WARNING	110
windows-devel	WARNING	96
windows-release	WARNING	75
windows-oldrel	WARNING	87
wasm-release	OK	127

Exports:abbreviations dict_generations dict_political fetch_urls fetch_wiki_refs fetch_wiki_urls nlp_cast_tokens nlp_index_tokens nlp_roll_chunks nlp_split_paragraphs nlp_split_sentences nlp_tokenize_text read_urls search_dict search_index search_regex search_vector util_fetch_embeddings

Dependencies:askpass cli cpp11 curl data.table generics glue httr jsonlite lattice lifecycle lubridate magrittr Matrix mime openssl pbapply pillar pkgconfig R6 rlang rvest selectr stringi stringr sys tibble timechange utf8 vctrs xml2

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
Common abbreviations for NLP	abbreviations
Demo dictionary of generation-name variants for NER	dict_generations
Demo dictionary of political / partisan term variants for NER	dict_political
Fetch URLs from a search engine	fetch_urls
Fetch external citation URLs from Wikipedia article(s)	fetch_wiki_refs
Fetch Wikipedia page URLs by search query	fetch_wiki_urls
Convert token list to data frame	nlp_cast_tokens
Build a BM25 index for ranked keyword search	nlp_index_tokens
Roll units into fixed-size chunks with optional context	nlp_roll_chunks
Split text into paragraphs	nlp_split_paragraphs
Split text into sentences	nlp_split_sentences
Tokenize text into a clean token stream	nlp_tokenize_text
Read content from URLs	read_urls
Exact phrase / MWE matcher	search_dict
Search the BM25 index	search_index
Search corpus by regex	search_regex
Semantic search by cosine similarity	search_vector
Fetch embeddings from a Hugging Face inference endpoint	util_fetch_embeddings