Title: | A Lightweight and Versatile NLP Toolkit |
---|---|
Description: | A simple Natural Language Processing (NLP) toolkit focused on search-centric workflows with minimal dependencies. The package offers key features for web scraping, text processing, corpus search, and text embedding generation via the 'HuggingFace API' <https://huggingface.co/docs/api-inference/index>. |
Authors: | Jason Timm [aut, cre] |
Maintainer: | Jason Timm <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2024-11-14 04:44:08 UTC |
Source: | https://github.com/jaytimm/textpress |
This function decodes the DuckDuckGo search result URLs that are redirected.
.decode_duckduckgo_urls(redirected_urls)
.decode_duckduckgo_urls(redirected_urls)
redirected_urls |
A vector of DuckDuckGo search result URLs. |
A vector of decoded URLs.
This function extracts all the links (href attributes) from a search engine result page.
.extract_links(search_url)
.extract_links(search_url)
search_url |
The URL of the search engine result page. |
A character vector of URLs.
This function attempts to retrieve the HTML content of a URL, extract specific
HTML elements (e.g., paragraphs, headings), and extract publication date information
using the extract_date
function.
.get_site(x)
.get_site(x)
x |
A URL to extract content and publication date from. |
A data frame with columns for the URL, HTML element types, text content, extracted date, and date source.
This function retrieves and processes search results from Bing.
.process_bing( search_term, num_pages, time_filter, insite, intitle, combined_pattern )
.process_bing( search_term, num_pages, time_filter, insite, intitle, combined_pattern )
search_term |
The search query. |
num_pages |
Number of result pages to retrieve. |
time_filter |
Optional time filter ("week", "month", "year"). |
insite |
Restrict search to a specific domain. |
intitle |
Search within the title. |
combined_pattern |
A pattern for filtering out irrelevant URLs. |
A 'data.table' of search results from Bing.
This function handles the extraction of search results from DuckDuckGo.
.process_duckduckgo( search_term, num_pages, time_filter, insite, intitle, combined_pattern )
.process_duckduckgo( search_term, num_pages, time_filter, insite, intitle, combined_pattern )
search_term |
The search query. |
num_pages |
Number of result pages to retrieve. |
time_filter |
Optional time filter ("week", "month", "year"). |
insite |
Restrict search to a specific domain. |
intitle |
Search within the title. |
combined_pattern |
A pattern for filtering out irrelevant URLs. |
A 'data.table' of search results from DuckDuckGo.
This function retrieves and processes search results from Yahoo News, automatically sorting by the most recent articles.
.process_yahoo(search_term, num_pages, combined_pattern = combined_pattern)
.process_yahoo(search_term, num_pages, combined_pattern = combined_pattern)
search_term |
The search query. |
num_pages |
Number of result pages to retrieve. |
combined_pattern |
A pattern for filtering out irrelevant URLs. |
A 'data.table' of search results from Yahoo News.
A character vector of common abbreviations used in English. These abbreviations are used to assist in sentence splitting, ensuring that sentence boundaries are not incorrectly identified at these abbreviations.
abbreviations
abbreviations
A character vector with some common English abbreviations.
Developed internally for sentence splitting functionality.
Retrieves embeddings for text data using Hugging Face's API. It can process a batch of texts or a single query. Mostly for demo purposes.
api_huggingface_embeddings( tif, text_hierarchy, api_token, api_url = NULL, query = NULL, dims = 384, batch_size = 250, sleep_duration = 1, verbose = TRUE )
api_huggingface_embeddings( tif, text_hierarchy, api_token, api_url = NULL, query = NULL, dims = 384, batch_size = 250, sleep_duration = 1, verbose = TRUE )
tif |
A data frame containing text data. |
text_hierarchy |
A character vector indicating the columns used to create row names. |
api_token |
Token for accessing the Hugging Face API. |
api_url |
The URL of the Hugging Face API endpoint (default is all-MiniLM-L6-v2). |
query |
An optional single text query for which embeddings are required. |
dims |
The dimension of the output embeddings. |
batch_size |
Number of rows in each batch sent to the API. |
sleep_duration |
Duration in seconds to pause between processing batches. |
verbose |
A boolean specifying whether to include progress bar |
A matrix containing embeddings, with each row corresponding to a text input.
## Not run: tif <- data.frame(doc_id = c('1'), text = c("Hello world.")) embeddings <- api_huggingface_embeddings(tif, text_hierarchy = 'doc_id', api_token = api_token) ## End(Not run)
## Not run: tif <- data.frame(doc_id = c('1'), text = c("Hello world.")) embeddings <- api_huggingface_embeddings(tif, text_hierarchy = 'doc_id', api_token = api_token) ## End(Not run)
This function attempts to extract a publication date from the HTML content of a web page using various methods such as JSON-LD, OpenGraph meta tags, standard meta tags, and common HTML elements.
extract_date(site)
extract_date(site)
site |
An HTML document (as parsed by xml2 or rvest) from which to extract the date. |
A data.frame with two columns: 'date' and 'source', indicating the extracted date and the source from which it was extracted (e.g., JSON-LD, OpenGraph, etc.). If no date is found, returns NA for both fields.
This function processes a data frame for NLP analysis by dividing text into chunks and providing context. It generates chunks of text with a specified size and includes context based on the specified context size.
nlp_build_chunks(tif, text_hierarchy, chunk_size, context_size)
nlp_build_chunks(tif, text_hierarchy, chunk_size, context_size)
tif |
A data.table containing the text to be chunked. |
text_hierarchy |
A character vector specifying the columns used for grouping and chunking. |
chunk_size |
An integer specifying the size of each chunk. |
context_size |
An integer specifying the size of the context around each chunk. |
A data.table with the chunked text and their respective contexts.
# Creating a data frame tif <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) chunks <- nlp_build_chunks(tif, chunk_size = 2, context_size = 1, text_hierarchy = c('doc_id', 'sentence_id'))
# Creating a data frame tif <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) chunks <- nlp_build_chunks(tif, chunk_size = 2, context_size = 1, text_hierarchy = c('doc_id', 'sentence_id'))
This function converts a list of tokens into a data frame, extracting and separating document and sentence identifiers if needed.
nlp_cast_tokens(tok)
nlp_cast_tokens(tok)
tok |
A list where each element contains tokens corresponding to a document or a sentence. |
A data frame with columns for token name and token.
tokens <- list(c("Hello", "world", "."), c("This", "is", "an", "example", "." ), c("This", "is", "a", "party", "!")) names(tokens) <- c('1.1', '1.2', '2.1') dtm <- nlp_cast_tokens(tokens)
tokens <- list(c("Hello", "world", "."), c("This", "is", "an", "example", "." ), c("This", "is", "a", "party", "!")) names(tokens) <- c('1.1', '1.2', '2.1') dtm <- nlp_cast_tokens(tokens)
This function tokenizes a data frame based on a specified token column and groups the data by one or more specified columns.
nlp_melt_tokens( df, melt_col = "token", parent_cols = c("doc_id", "sentence_id") )
nlp_melt_tokens( df, melt_col = "token", parent_cols = c("doc_id", "sentence_id") )
df |
A data frame containing the data to be tokenized. |
melt_col |
The name of the column in 'df' that contains the tokens. |
parent_cols |
A character vector indicating the column(s) by which to group the data. |
A list of vectors, each containing the tokens of a group defined by the 'by' parameter.
dtm <- data.frame(doc_id = as.character(c(1, 1, 1, 1, 1, 1, 1, 1)), sentence_id = as.character(c(1, 1, 1, 2, 2, 2, 2, 2)), token = c("Hello", "world", ".", "This", "is", "an", "example", ".")) tokens <- nlp_melt_tokens(dtm, melt_col = 'token', parent_cols = c('doc_id', 'sentence_id'))
dtm <- data.frame(doc_id = as.character(c(1, 1, 1, 1, 1, 1, 1, 1)), sentence_id = as.character(c(1, 1, 1, 2, 2, 2, 2, 2)), token = c("Hello", "world", ".", "This", "is", "an", "example", ".")) tokens <- nlp_melt_tokens(dtm, melt_col = 'token', parent_cols = c('doc_id', 'sentence_id'))
Splits text from the 'text' column of a data frame into individual paragraphs, based on a specified paragraph delimiter.
nlp_split_paragraphs(tif, paragraph_delim = "\\n+")
nlp_split_paragraphs(tif, paragraph_delim = "\\n+")
tif |
A data frame with at least two columns: 'doc_id' and 'text'. |
paragraph_delim |
A regular expression pattern used to split text into paragraphs. |
A data.table with columns: 'doc_id', 'paragraph_id', and 'text'. Each row represents a paragraph, along with its associated document and paragraph identifiers.
tif <- data.frame(doc_id = c('1', '2'), text = c("Hello world.\n\nMind your business!", "This is an example.n\nThis is a party!")) paragraphs <- nlp_split_paragraphs(tif)
tif <- data.frame(doc_id = c('1', '2'), text = c("Hello world.\n\nMind your business!", "This is an example.n\nThis is a party!")) paragraphs <- nlp_split_paragraphs(tif)
This function splits text from a data frame into individual sentences based on specified columns and handles abbreviations effectively.
nlp_split_sentences( tif, text_hierarchy = c("doc_id"), abbreviations = textpress::abbreviations )
nlp_split_sentences( tif, text_hierarchy = c("doc_id"), abbreviations = textpress::abbreviations )
tif |
A data frame containing text to be split into sentences. |
text_hierarchy |
A character vector specifying the columns to group by for sentence splitting, usually 'doc_id'. |
abbreviations |
A character vector of abbreviations to handle during sentence splitting, defaults to textpress::abbreviations. |
A data.table with columns specified in 'by', 'sentence_id', and 'text'.
tif <- data.frame(doc_id = c('1'), text = c("Hello world. This is an example. No, this is a party!")) sentences <- nlp_split_paragraphs(tif)
tif <- data.frame(doc_id = c('1'), text = c("Hello world. This is an example. No, this is a party!")) sentences <- nlp_split_paragraphs(tif)
This function tokenizes text data from a data frame using the 'tokenizers' package, preserving the original text structure like capitalization and punctuation.
nlp_tokenize_text( tif, text_hierarchy = c("doc_id", "paragraph_id", "sentence_id") )
nlp_tokenize_text( tif, text_hierarchy = c("doc_id", "paragraph_id", "sentence_id") )
tif |
A data frame containing the text to be tokenized and a document identifier in 'doc_id'. |
text_hierarchy |
A character string specifying grouping column. |
A named list of tokens, where each list item corresponds to a document.
tif <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) tokens <- nlp_tokenize_text(tif, text_hierarchy = c('doc_id', 'sentence_id'))
tif <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) tokens <- nlp_tokenize_text(tif, text_hierarchy = c('doc_id', 'sentence_id'))
This function identifies the nearest neighbors of a given term or vector in a matrix based on cosine similarity.
sem_nearest_neighbors(x, matrix, n = 10)
sem_nearest_neighbors(x, matrix, n = 10)
x |
A character or numeric vector representing the term or vector. |
matrix |
A numeric matrix or a sparse matrix against which the similarity is calculated. |
n |
Number of nearest neighbors to return. |
A data frame with the ranks, terms, and their cosine similarity scores.
## Not run: api_token <- '' matrix <- api_huggingface_embeddings(tif, text_hierarchy = c('doc_id', 'sentence_id'), api_token = api_token) query <- api_huggingface_embeddings(query = "Where's the party at?", api_token = api_token) neighbors <- sem_nearest_neighbors(x = query, matrix = matrix) ## End(Not run)
## Not run: api_token <- '' matrix <- api_huggingface_embeddings(tif, text_hierarchy = c('doc_id', 'sentence_id'), api_token = api_token) query <- api_huggingface_embeddings(query = "Where's the party at?", api_token = api_token) neighbors <- sem_nearest_neighbors(x = query, matrix = matrix) ## End(Not run)
Searches a text corpus for specified patterns, with support for parallel processing.
sem_search_corpus( tif, text_hierarchy = c("doc_id", "paragraph_id", "sentence_id"), search, context_size = 0, is_inline = FALSE, highlight = c("<b>", "</b>"), cores = 1 )
sem_search_corpus( tif, text_hierarchy = c("doc_id", "paragraph_id", "sentence_id"), search, context_size = 0, is_inline = FALSE, highlight = c("<b>", "</b>"), cores = 1 )
tif |
A data frame or data.table containing the text corpus. |
text_hierarchy |
A character vector indicating the column(s) by which to group the data. |
search |
The search pattern or query. |
context_size |
Numeric, default 0. Specifies the context size, in sentences, around the found patterns. |
is_inline |
Logical, default FALSE. Indicates if the search should be inline. |
highlight |
A character vector of length two, default c('<b>', '</b>'). Used to highlight the found patterns in the text. |
cores |
Numeric, default 1. The number of cores to use for parallel processing. |
A data.table with the search results.
tif <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) sem_search_corpus(tif, search = 'This is', text_hierarchy = c('doc_id', 'sentence_id'))
tif <- data.frame(doc_id = c('1', '1', '2'), sentence_id = c('1', '2', '1'), text = c("Hello world.", "This is an example.", "This is a party!")) sem_search_corpus(tif, search = 'This is', text_hierarchy = c('doc_id', 'sentence_id'))
This function attempts to parse a date string using multiple formats and standardizes it to "YYYY-MM-DD". It first tries ISO 8601 formats, and then common formats like ymd, dmy, and mdy.
standardize_date(date_str)
standardize_date(date_str)
date_str |
A character string representing a date. |
A character string representing the standardized date in "YYYY-MM-DD" format, or NA if the date cannot be parsed.
Function scrapes content of provided list of URLs.
web_scrape_urls(x, cores = 3)
web_scrape_urls(x, cores = 3)
x |
A character vector of URLs. |
cores |
The number of cores to use for parallel processing. |
A data frame containing scraped news data.
## Not run: url <- 'https://www.nytimes.com/2024/03/25/nyregion/trump-bond-reduced.html' article_tif <- web_scrape_urls(x = url, input = 'urls', cores = 1) ## End(Not run)
## Not run: url <- 'https://www.nytimes.com/2024/03/25/nyregion/trump-bond-reduced.html' article_tif <- web_scrape_urls(x = url, input = 'urls', cores = 1) ## End(Not run)
This function allows you to query different search engines (DuckDuckGo, Bing, Yahoo News), retrieve search results, and filter them based on predefined patterns.
web_search( search_term, search_engine, num_pages = 1, time_filter = NULL, insite = NULL, intitle = FALSE )
web_search( search_term, search_engine, num_pages = 1, time_filter = NULL, insite = NULL, intitle = FALSE )
search_term |
The search query as a string. |
search_engine |
The search engine to use: "DuckDuckGo", "Bing", or "Yahoo News". |
num_pages |
The number of result pages to retrieve (default: 1). |
time_filter |
Optional time filter ("week", "month", "year"). |
insite |
Restrict search to a specific domain (not supported for Yahoo). |
intitle |
Search within the title (relevant for DuckDuckGo and Bing). |
A 'data.table' containing search engine results with columns 'search_engine' and 'raw_url'.