Utils

Utils#

class stream.utils.DocumentCoherence(documents, column='tfidf_top_words', stopwords=None)[source]#

A class for calculating the coherence between documents based on their top words. This is achieved through the use of Normalized Pointwise Mutual Information (NPMI).

documents#

DataFrame containing documents and their top words.

Type:: DataFrame

column#

Column name in DataFrame that contains the top words for each document.

Type:: str

stopwords#

Set of stopwords to exclude from analysis.

Type:: set

word_index#

Dictionary mapping each unique word to a unique index.

Type:: dict

doc_word_matrix#

Sparse matrix representing the occurrence of words in documents.

Type:: csr_matrix

calculate_document_coherence()[source]#

Calculate document coherence scores based on NP (Normalized Pointwise) Mutual Information (NPMI).

Returns:: A DataFrame containing coherence scores between each pair of documents.
Return type:: pd.DataFrame

class stream.utils.TMDataset(*args: Any, **kwargs: Any)[source]#

static clean_text(text)[source]#

Clean the input text.

Parameters:: text (str) – Input text to clean.
Returns:: Cleaned text.
Return type:: str

create_load_save_dataset(data, dataset_name, save_dir, doc_column=None, label_column=None, **kwargs)[source]#

Create, load, and save a dataset.

Parameters:

data (pd.DataFrame or list) – The data to create the dataset from.
dataset_name (str) – Name of the dataset.
save_dir (str) – Directory to save the dataset.
doc_column (str, optional) – Column name for documents if data is a DataFrame.
label_column (str, optional) – Column name for labels if data is a DataFrame.
**kwargs (dict) – Additional columns and their values to include in the dataset.

Returns:

The preprocessed dataset.

Return type:

Preprocessing

fetch_dataset(name, dataset_path=None)[source]#

Fetch a dataset by name.

Parameters:

name (str) – Name of the dataset to fetch.
dataset_path (str, optional) – Path to the dataset directory.

get_corpus()[source]#

Get the corpus (tokens) from the dataframe.

Returns:: Corpus tokens.
Return type:: list of list of str

get_data_loader(batch_size=32, shuffle=True, num_workers=0, pin_memory=False)[source]#

Get a data loader for the dataset.

Parameters:

batch_size (int, optional) – Number of samples per batch, by default 32.
shuffle (bool, optional) – Whether to shuffle the data, by default True.
num_workers (int, optional) – Number of subprocesses to use for data loading, by default 0.
pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory, by default False.

Returns:

Data loader for the dataset.

Return type:

DataLoader

get_data_loaders(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, batch_size=32, shuffle=True, num_workers=0, pin_memory=False, seed=None)[source]#

Get data loaders for train, validation, and test sets.

Parameters:

train_ratio (float, optional) – Ratio of the training set, by default 0.8.
val_ratio (float, optional) – Ratio of the validation set, by default 0.1.
test_ratio (float, optional) – Ratio of the test set, by default 0.1.
batch_size (int, optional) – Number of samples per batch, by default 32.
shuffle (bool, optional) – Whether to shuffle the data, by default True.
num_workers (int, optional) – Number of subprocesses to use for data loading, by default 0.
pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory, by default False.
seed (int, optional) – Random seed for shuffling, by default None.

Returns:

Data loaders for train, validation, and test sets.

Return type:

tuple of DataLoader

get_embeddings(embedding_model_name, path=None, file_name=None)[source]#

Get embeddings for the dataset.

Parameters:

embedding_model_name (str) – Name of the embedding model to use.
path (str, optional) – Path to save the embeddings.
file_name (str, optional) – File name for the embeddings.

Returns:

Embeddings for the dataset.

Return type:

np.ndarray

get_info(dataset_path=None)[source]#

Load and return the dataset information.

Parameters:

name (str) – Name of the dataset.
save_dir (str) – Directory where the dataset is saved.

Returns:

Dictionary containing the dataset information.

Return type:

dict

get_labels()[source]#

Get the labels from the dataframe.

Returns:: Labels.
Return type:: list of str

get_package_dataset_path(name)[source]#

Get the path to the package dataset.

Parameters:: name (str) – Name of the dataset.
Returns:: Path to the dataset.
Return type:: str

get_package_embeddings_path(name)[source]#

Get the path to the package embeddings.

Parameters:: name (str) – Name of the dataset.
Returns:: Path to the embeddings.
Return type:: str

get_vocabulary()[source]#

Get the vocabulary from the dataframe.

Returns:: Vocabulary.
Return type:: list of str

has_embeddings(embedding_model_name, path=None, file_name=None)[source]#

Check if embeddings are available for the dataset.

Parameters:

embedding_model_name (str) – Name of the embedding model used.
path (str, optional) – Path where embeddings are expected to be saved.
file_name (str, optional) – File name for the embeddings.

Returns:

True if embeddings are available, False otherwise.

Return type:

bool

load_custom_dataset_from_folder(dataset_path)[source]#

Load a custom dataset from a folder.

Parameters:: dataset_path (str) – Path to the dataset folder.

load_dataset_from_parquet(load_path)[source]#

Load a dataset from a Parquet file.

Parameters:: load_path (str) – Path to the Parquet file.

load_model_preprocessing_steps(model_type, filepath=None)[source]#

Load the default preprocessing steps from a JSON file.

Parameters:: filepath (str) – The path to the JSON file containing the default preprocessing steps.
Returns:: The default preprocessing steps.
Return type:: dict

preprocess(model_type=None, custom_stopwords=None, **preprocessing_steps)[source]#

Preprocess the dataset.

Parameters:

language (str, optional) – The language to use for preprocessing (default is “english”).
remove_stopwords (bool, optional) – Whether to remove stopwords (default is False).
lowercase (bool, optional) – Whether to convert text to lowercase (default is True).
remove_punctuation (bool, optional) – Whether to remove punctuation (default is True).
remove_numbers (bool, optional) – Whether to remove numbers (default is True).
lemmatize (bool, optional) – Whether to lemmatize words (default is False).
stem (bool, optional) – Whether to stem words (default is False).
expand_contractions (bool, optional) – Whether to expand contractions (default is False).
remove_html_tags (bool, optional) – Whether to remove HTML tags (default is False).
remove_special_chars (bool, optional) – Whether to remove special characters (default is False).
remove_accents (bool, optional) – Whether to remove accents (default is False).
custom_stopwords (list of str, optional) – List of custom stopwords to remove (default is an empty list).
detokenize (bool, optional) – Whether to detokenize the text after processing (default is True).

Returns:

This method modifies the object’s texts and dataframe attributes in place.

Return type:

None

Notes

This function applies a series of preprocessing steps to the text data stored in the object’s texts attribute. The preprocessed text is then stored back into the texts attribute and updated in the dataframe["text"] column.

save_embeddings(embeddings, embedding_model_name, path=None, file_name=None)[source]#

Save embeddings for the dataset.

Parameters:

embeddings (np.ndarray) – Embeddings to save.
embedding_model_name (str) – Name of the embedding model used.
path (str, optional) – Path to save the embeddings.
file_name (str, optional) – File name for the embeddings.

split_dataset(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=None)[source]#

Split the dataset into train, validation, and test sets.

Parameters:

train_ratio (float, optional) – Ratio of the training set, by default 0.8.
val_ratio (float, optional) – Ratio of the validation set, by default 0.1.
test_ratio (float, optional) – Ratio of the test set, by default 0.1.
seed (int, optional) – Random seed for shuffling, by default None.

Returns:

Train, validation, and test datasets.

Return type:

tuple of Dataset

update_preprocessing_steps(**preprocessing_steps)[source]#

Update preprocessing steps to True if they were previously False.

Parameters:: preprocessing_steps (dict) – Key-value pairs of preprocessing steps to update.

stream.utils.benchmarking(models, num_topics, metrics, model_args=None, metric_args=None, dataset=None, embedding_model_name='all-MiniLM-L6-v2')[source]#

Benchmark multiple models against specified metrics. Initialization parameters for models are handled dynamically to accommodate different model requirements.

Parameters: - models : List of model instances to benchmark. - num_topics : Integer specifying the number of topics for models that require it. - metrics : List of metric functions to evaluate the models. - model_args : Optional list of dictionaries with initialization parameters for each model. - metric_args : List of dictionaries containing arguments for each metric function. - dataset : The dataset to train the models on. - embedding_model_name : Default embedding model name, used if applicable.

Returns: - A dictionary mapping model names to another dictionary of metric names and their corresponding scores.

Return type:: Dict[str, Dict[str, float]]

stream.utils.get_top_tfidf_words_per_document(corpus, n=10)[source]#

Get the top TF-IDF words per document in a corpus.

Parameters:

corpus (list) – List of documents.
n (int, optional) – Number of top words to retrieve per document (default is 10).

Returns:

A list of lists containing the top TF-IDF words for each document in the corpus.

Return type:

list

Utils

Contents

Utils#