Utils#
- class stream.utils.DocumentCoherence(documents, column='tfidf_top_words', stopwords=None)[source]#
A class for calculating the coherence between documents based on their top words. This is achieved through the use of Normalized Pointwise Mutual Information (NPMI).
- documents#
DataFrame containing documents and their top words.
- Type:
DataFrame
- column#
Column name in DataFrame that contains the top words for each document.
- Type:
str
- stopwords#
Set of stopwords to exclude from analysis.
- Type:
set
- word_index#
Dictionary mapping each unique word to a unique index.
- Type:
dict
- doc_word_matrix#
Sparse matrix representing the occurrence of words in documents.
- Type:
csr_matrix
- class stream.utils.TMDataset(*args: Any, **kwargs: Any)[source]#
- static clean_text(text)[source]#
Clean the input text.
- Parameters:
text (str) – Input text to clean.
- Returns:
Cleaned text.
- Return type:
str
- create_load_save_dataset(data, dataset_name, save_dir, doc_column=None, label_column=None, **kwargs)[source]#
Create, load, and save a dataset.
- Parameters:
data (pd.DataFrame or list) – The data to create the dataset from.
dataset_name (str) – Name of the dataset.
save_dir (str) – Directory to save the dataset.
doc_column (str, optional) – Column name for documents if data is a DataFrame.
label_column (str, optional) – Column name for labels if data is a DataFrame.
**kwargs (dict) – Additional columns and their values to include in the dataset.
- Returns:
The preprocessed dataset.
- Return type:
Preprocessing
- fetch_dataset(name, dataset_path=None)[source]#
Fetch a dataset by name.
- Parameters:
name (str) – Name of the dataset to fetch.
dataset_path (str, optional) – Path to the dataset directory.
- get_corpus()[source]#
Get the corpus (tokens) from the dataframe.
- Returns:
Corpus tokens.
- Return type:
list of list of str
- get_data_loader(batch_size=32, shuffle=True, num_workers=0, pin_memory=False)[source]#
Get a data loader for the dataset.
- Parameters:
batch_size (int, optional) – Number of samples per batch, by default 32.
shuffle (bool, optional) – Whether to shuffle the data, by default True.
num_workers (int, optional) – Number of subprocesses to use for data loading, by default 0.
pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory, by default False.
- Returns:
Data loader for the dataset.
- Return type:
DataLoader
- get_data_loaders(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, batch_size=32, shuffle=True, num_workers=0, pin_memory=False, seed=None)[source]#
Get data loaders for train, validation, and test sets.
- Parameters:
train_ratio (float, optional) – Ratio of the training set, by default 0.8.
val_ratio (float, optional) – Ratio of the validation set, by default 0.1.
test_ratio (float, optional) – Ratio of the test set, by default 0.1.
batch_size (int, optional) – Number of samples per batch, by default 32.
shuffle (bool, optional) – Whether to shuffle the data, by default True.
num_workers (int, optional) – Number of subprocesses to use for data loading, by default 0.
pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory, by default False.
seed (int, optional) – Random seed for shuffling, by default None.
- Returns:
Data loaders for train, validation, and test sets.
- Return type:
tuple of DataLoader
- get_embeddings(embedding_model_name, path=None, file_name=None)[source]#
Get embeddings for the dataset.
- Parameters:
embedding_model_name (str) – Name of the embedding model to use.
path (str, optional) – Path to save the embeddings.
file_name (str, optional) – File name for the embeddings.
- Returns:
Embeddings for the dataset.
- Return type:
np.ndarray
- get_info(dataset_path=None)[source]#
Load and return the dataset information.
- Parameters:
name (str) – Name of the dataset.
save_dir (str) – Directory where the dataset is saved.
- Returns:
Dictionary containing the dataset information.
- Return type:
dict
- get_package_dataset_path(name)[source]#
Get the path to the package dataset.
- Parameters:
name (str) – Name of the dataset.
- Returns:
Path to the dataset.
- Return type:
str
- get_package_embeddings_path(name)[source]#
Get the path to the package embeddings.
- Parameters:
name (str) – Name of the dataset.
- Returns:
Path to the embeddings.
- Return type:
str
- get_vocabulary()[source]#
Get the vocabulary from the dataframe.
- Returns:
Vocabulary.
- Return type:
list of str
- has_embeddings(embedding_model_name, path=None, file_name=None)[source]#
Check if embeddings are available for the dataset.
- Parameters:
embedding_model_name (str) – Name of the embedding model used.
path (str, optional) – Path where embeddings are expected to be saved.
file_name (str, optional) – File name for the embeddings.
- Returns:
True if embeddings are available, False otherwise.
- Return type:
bool
- load_custom_dataset_from_folder(dataset_path)[source]#
Load a custom dataset from a folder.
- Parameters:
dataset_path (str) – Path to the dataset folder.
- load_dataset_from_parquet(load_path)[source]#
Load a dataset from a Parquet file.
- Parameters:
load_path (str) – Path to the Parquet file.
- load_model_preprocessing_steps(model_type, filepath=None)[source]#
Load the default preprocessing steps from a JSON file.
- Parameters:
filepath (str) – The path to the JSON file containing the default preprocessing steps.
- Returns:
The default preprocessing steps.
- Return type:
dict
- preprocess(model_type=None, custom_stopwords=None, **preprocessing_steps)[source]#
Preprocess the dataset.
- Parameters:
language (str, optional) – The language to use for preprocessing (default is “english”).
remove_stopwords (bool, optional) – Whether to remove stopwords (default is False).
lowercase (bool, optional) – Whether to convert text to lowercase (default is True).
remove_punctuation (bool, optional) – Whether to remove punctuation (default is True).
remove_numbers (bool, optional) – Whether to remove numbers (default is True).
lemmatize (bool, optional) – Whether to lemmatize words (default is False).
stem (bool, optional) – Whether to stem words (default is False).
expand_contractions (bool, optional) – Whether to expand contractions (default is False).
remove_html_tags (bool, optional) – Whether to remove HTML tags (default is False).
remove_special_chars (bool, optional) – Whether to remove special characters (default is False).
remove_accents (bool, optional) – Whether to remove accents (default is False).
custom_stopwords (list of str, optional) – List of custom stopwords to remove (default is an empty list).
detokenize (bool, optional) – Whether to detokenize the text after processing (default is True).
- Returns:
This method modifies the object’s texts and dataframe attributes in place.
- Return type:
None
Notes
This function applies a series of preprocessing steps to the text data stored in the object’s
textsattribute. The preprocessed text is then stored back into thetextsattribute and updated in thedataframe["text"]column.
- save_embeddings(embeddings, embedding_model_name, path=None, file_name=None)[source]#
Save embeddings for the dataset.
- Parameters:
embeddings (np.ndarray) – Embeddings to save.
embedding_model_name (str) – Name of the embedding model used.
path (str, optional) – Path to save the embeddings.
file_name (str, optional) – File name for the embeddings.
- split_dataset(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=None)[source]#
Split the dataset into train, validation, and test sets.
- Parameters:
train_ratio (float, optional) – Ratio of the training set, by default 0.8.
val_ratio (float, optional) – Ratio of the validation set, by default 0.1.
test_ratio (float, optional) – Ratio of the test set, by default 0.1.
seed (int, optional) – Random seed for shuffling, by default None.
- Returns:
Train, validation, and test datasets.
- Return type:
tuple of Dataset
- stream.utils.benchmarking(models, num_topics, metrics, model_args=None, metric_args=None, dataset=None, embedding_model_name='all-MiniLM-L6-v2')[source]#
Benchmark multiple models against specified metrics. Initialization parameters for models are handled dynamically to accommodate different model requirements.
Parameters: - models : List of model instances to benchmark. - num_topics : Integer specifying the number of topics for models that require it. - metrics : List of metric functions to evaluate the models. - model_args : Optional list of dictionaries with initialization parameters for each model. - metric_args : List of dictionaries containing arguments for each metric function. - dataset : The dataset to train the models on. - embedding_model_name : Default embedding model name, used if applicable.
Returns: - A dictionary mapping model names to another dictionary of metric names and their corresponding scores.
- Return type:
Dict[str,Dict[str,float]]
- stream.utils.get_top_tfidf_words_per_document(corpus, n=10)[source]#
Get the top TF-IDF words per document in a corpus.
- Parameters:
corpus (list) – List of documents.
n (int, optional) – Number of top words to retrieve per document (default is 10).
- Returns:
A list of lists containing the top TF-IDF words for each document in the corpus.
- Return type:
list