Utils#

class stream.utils.DocumentCoherence(documents, column='tfidf_top_words', stopwords=None)[source]#

A class for calculating the coherence between documents based on their top words. This is achieved through the use of Normalized Pointwise Mutual Information (NPMI).

documents#

DataFrame containing documents and their top words.

Type:

DataFrame

column#

Column name in DataFrame that contains the top words for each document.

Type:

str

stopwords#

Set of stopwords to exclude from analysis.

Type:

set

word_index#

Dictionary mapping each unique word to a unique index.

Type:

dict

doc_word_matrix#

Sparse matrix representing the occurrence of words in documents.

Type:

csr_matrix

calculate_document_coherence()[source]#

Calculate document coherence scores based on NP (Normalized Pointwise) Mutual Information (NPMI).

Returns:

A DataFrame containing coherence scores between each pair of documents.

Return type:

pd.DataFrame

class stream.utils.TMDataset(*args: Any, **kwargs: Any)[source]#
static clean_text(text)[source]#

Clean the input text.

Parameters:

text (str) – Input text to clean.

Returns:

Cleaned text.

Return type:

str

create_load_save_dataset(data, dataset_name, save_dir, doc_column=None, label_column=None, **kwargs)[source]#

Create, load, and save a dataset.

Parameters:
  • data (pd.DataFrame or list) – The data to create the dataset from.

  • dataset_name (str) – Name of the dataset.

  • save_dir (str) – Directory to save the dataset.

  • doc_column (str, optional) – Column name for documents if data is a DataFrame.

  • label_column (str, optional) – Column name for labels if data is a DataFrame.

  • **kwargs (dict) – Additional columns and their values to include in the dataset.

Returns:

The preprocessed dataset.

Return type:

Preprocessing

fetch_dataset(name, dataset_path=None)[source]#

Fetch a dataset by name.

Parameters:
  • name (str) – Name of the dataset to fetch.

  • dataset_path (str, optional) – Path to the dataset directory.

get_corpus()[source]#

Get the corpus (tokens) from the dataframe.

Returns:

Corpus tokens.

Return type:

list of list of str

get_data_loader(batch_size=32, shuffle=True, num_workers=0, pin_memory=False)[source]#

Get a data loader for the dataset.

Parameters:
  • batch_size (int, optional) – Number of samples per batch, by default 32.

  • shuffle (bool, optional) – Whether to shuffle the data, by default True.

  • num_workers (int, optional) – Number of subprocesses to use for data loading, by default 0.

  • pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory, by default False.

Returns:

Data loader for the dataset.

Return type:

DataLoader

get_data_loaders(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, batch_size=32, shuffle=True, num_workers=0, pin_memory=False, seed=None)[source]#

Get data loaders for train, validation, and test sets.

Parameters:
  • train_ratio (float, optional) – Ratio of the training set, by default 0.8.

  • val_ratio (float, optional) – Ratio of the validation set, by default 0.1.

  • test_ratio (float, optional) – Ratio of the test set, by default 0.1.

  • batch_size (int, optional) – Number of samples per batch, by default 32.

  • shuffle (bool, optional) – Whether to shuffle the data, by default True.

  • num_workers (int, optional) – Number of subprocesses to use for data loading, by default 0.

  • pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory, by default False.

  • seed (int, optional) – Random seed for shuffling, by default None.

Returns:

Data loaders for train, validation, and test sets.

Return type:

tuple of DataLoader

get_embeddings(embedding_model_name, path=None, file_name=None)[source]#

Get embeddings for the dataset.

Parameters:
  • embedding_model_name (str) – Name of the embedding model to use.

  • path (str, optional) – Path to save the embeddings.

  • file_name (str, optional) – File name for the embeddings.

Returns:

Embeddings for the dataset.

Return type:

np.ndarray

get_info(dataset_path=None)[source]#

Load and return the dataset information.

Parameters:
  • name (str) – Name of the dataset.

  • save_dir (str) – Directory where the dataset is saved.

Returns:

Dictionary containing the dataset information.

Return type:

dict

get_labels()[source]#

Get the labels from the dataframe.

Returns:

Labels.

Return type:

list of str

get_package_dataset_path(name)[source]#

Get the path to the package dataset.

Parameters:

name (str) – Name of the dataset.

Returns:

Path to the dataset.

Return type:

str

get_package_embeddings_path(name)[source]#

Get the path to the package embeddings.

Parameters:

name (str) – Name of the dataset.

Returns:

Path to the embeddings.

Return type:

str

get_vocabulary()[source]#

Get the vocabulary from the dataframe.

Returns:

Vocabulary.

Return type:

list of str

has_embeddings(embedding_model_name, path=None, file_name=None)[source]#

Check if embeddings are available for the dataset.

Parameters:
  • embedding_model_name (str) – Name of the embedding model used.

  • path (str, optional) – Path where embeddings are expected to be saved.

  • file_name (str, optional) – File name for the embeddings.

Returns:

True if embeddings are available, False otherwise.

Return type:

bool

load_custom_dataset_from_folder(dataset_path)[source]#

Load a custom dataset from a folder.

Parameters:

dataset_path (str) – Path to the dataset folder.

load_dataset_from_parquet(load_path)[source]#

Load a dataset from a Parquet file.

Parameters:

load_path (str) – Path to the Parquet file.

load_model_preprocessing_steps(model_type, filepath=None)[source]#

Load the default preprocessing steps from a JSON file.

Parameters:

filepath (str) – The path to the JSON file containing the default preprocessing steps.

Returns:

The default preprocessing steps.

Return type:

dict

preprocess(model_type=None, custom_stopwords=None, **preprocessing_steps)[source]#

Preprocess the dataset.

Parameters:
  • language (str, optional) – The language to use for preprocessing (default is “english”).

  • remove_stopwords (bool, optional) – Whether to remove stopwords (default is False).

  • lowercase (bool, optional) – Whether to convert text to lowercase (default is True).

  • remove_punctuation (bool, optional) – Whether to remove punctuation (default is True).

  • remove_numbers (bool, optional) – Whether to remove numbers (default is True).

  • lemmatize (bool, optional) – Whether to lemmatize words (default is False).

  • stem (bool, optional) – Whether to stem words (default is False).

  • expand_contractions (bool, optional) – Whether to expand contractions (default is False).

  • remove_html_tags (bool, optional) – Whether to remove HTML tags (default is False).

  • remove_special_chars (bool, optional) – Whether to remove special characters (default is False).

  • remove_accents (bool, optional) – Whether to remove accents (default is False).

  • custom_stopwords (list of str, optional) – List of custom stopwords to remove (default is an empty list).

  • detokenize (bool, optional) – Whether to detokenize the text after processing (default is True).

Returns:

This method modifies the object’s texts and dataframe attributes in place.

Return type:

None

Notes

This function applies a series of preprocessing steps to the text data stored in the object’s texts attribute. The preprocessed text is then stored back into the texts attribute and updated in the dataframe["text"] column.

save_embeddings(embeddings, embedding_model_name, path=None, file_name=None)[source]#

Save embeddings for the dataset.

Parameters:
  • embeddings (np.ndarray) – Embeddings to save.

  • embedding_model_name (str) – Name of the embedding model used.

  • path (str, optional) – Path to save the embeddings.

  • file_name (str, optional) – File name for the embeddings.

split_dataset(train_ratio=0.8, val_ratio=0.1, test_ratio=0.1, seed=None)[source]#

Split the dataset into train, validation, and test sets.

Parameters:
  • train_ratio (float, optional) – Ratio of the training set, by default 0.8.

  • val_ratio (float, optional) – Ratio of the validation set, by default 0.1.

  • test_ratio (float, optional) – Ratio of the test set, by default 0.1.

  • seed (int, optional) – Random seed for shuffling, by default None.

Returns:

Train, validation, and test datasets.

Return type:

tuple of Dataset

update_preprocessing_steps(**preprocessing_steps)[source]#

Update preprocessing steps to True if they were previously False.

Parameters:

preprocessing_steps (dict) – Key-value pairs of preprocessing steps to update.

stream.utils.benchmarking(models, num_topics, metrics, model_args=None, metric_args=None, dataset=None, embedding_model_name='all-MiniLM-L6-v2')[source]#

Benchmark multiple models against specified metrics. Initialization parameters for models are handled dynamically to accommodate different model requirements.

Parameters: - models : List of model instances to benchmark. - num_topics : Integer specifying the number of topics for models that require it. - metrics : List of metric functions to evaluate the models. - model_args : Optional list of dictionaries with initialization parameters for each model. - metric_args : List of dictionaries containing arguments for each metric function. - dataset : The dataset to train the models on. - embedding_model_name : Default embedding model name, used if applicable.

Returns: - A dictionary mapping model names to another dictionary of metric names and their corresponding scores.

Return type:

Dict[str, Dict[str, float]]

stream.utils.get_top_tfidf_words_per_document(corpus, n=10)[source]#

Get the top TF-IDF words per document in a corpus.

Parameters:
  • corpus (list) – List of documents.

  • n (int, optional) – Number of top words to retrieve per document (default is 10).

Returns:

A list of lists containing the top TF-IDF words for each document in the corpus.

Return type:

list