Models

Contents

Models#

class stream.models.BERTopicTM(embedding_model_name='paraphrase-MiniLM-L3-v2', umap_args=None, min_cluster_size=None, hdbscan_args=None, random_state=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#

A topic modeling class that uses K-Means clustering on text data.

This class inherits from the AbstractModel class and utilizes sentence embeddings, UMAP for dimensionality reduction, and K-Means for clustering text data into topics.

hyperparameters#

A dictionary of hyperparameters for the model.

Type:

dict

n_topics#

The number of topics to cluster the documents into.

Type:

int

embedding_model#

The sentence embedding model.

Type:

SentenceTransformer

umap_args#

Arguments for UMAP dimensionality reduction.

Type:

dict

kmeans_args#

Arguments for the KMeans clustering algorithm.

Type:

dict

optim#

Flag to enable optimization of the number of clusters.

Type:

bool

dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#

Encode a list of documents into embeddings.

Parameters:
  • documents (List[str]) – List of documents to encode.

  • encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.

  • use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.

Returns:

Array of shape (n_documents, embedding_size) containing document embeddings.

Return type:

np.ndarray

fit(dataset)[source]#

Trains the BERTOPIC topic model on the provided dataset.

Applies sentence embedding, UMAP dimensionality reduction, and hdbscan clustering to the dataset to identify distinct topics within the text data.

Parameters:

dataset – The dataset to train the model on. It should contain the text documents.

Returns:

A dictionary containing the identified topics and the topic-word matrix.

Return type:

dict

get_beta()#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the model.

Returns:

Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.

Return type:

dict

get_theta()#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(texts)[source]#

Predict topics for new documents.

Parameters:

texts (list of str) – List of texts to predict topics for.

Returns:

List of predicted topic labels.

Return type:

list of int

Raises:

ValueError – If the model has not been trained yet.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

split_document(document, max_length)#

Split a long document into segments of specified maximum length.

Parameters:
  • document (str) – Document to split into segments.

  • max_length (int) – Maximum length of each segment.

Returns:

List of document segments.

Return type:

List[str]

class stream.models.CBC[source]#
cluster_documents()[source]#

Clusters documents based on coherence scores.

Returns:

A dictionary mapping cluster labels to lists of document indices.

Return type:

dict

combine_documents(documents, clusters)[source]#

Combines documents within each cluster.

Parameters:
  • documents (DataFrame) – Original DataFrame of documents.

  • clusters (dict) – Dictionary of document clusters.

Returns:

New DataFrame with combined documents.

Return type:

DataFrame

dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

fit(dataset=None, max_topics=20, max_iterations=20)[source]#

Clusters documents based on coherence scores until the number of clusters is within a specified threshold.

Parameters:
  • dataset (TMDataset, optional) – Dataset containing the documents.

  • max_topics (int, optional) – Maximum acceptable number of clusters.

  • max_iterations (int, optional) – Maximum number of iterations for clustering.

Raises:

AssertionError – If the dataset is not an instance of TMDataset.

get_beta()#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the model.

Returns:

Dictionary containing model information including model name

Return type:

dict

get_theta()#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(texts)[source]#

Predict topics for new documents.

Parameters:

texts (list of str) – List of texts to predict topics for.

Returns:

List of predicted topic labels.

Return type:

list of int

Raises:

ValueError – If the model has not been trained yet.

prepare_data(dataset)[source]#

Prepares the dataset for clustering.

Parameters:

dataset (TMDataset) – Dataset containing the documents.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

class stream.models.CEDC(embedding_model_name='paraphrase-MiniLM-L3-v2', umap_args=None, random_state=None, gmm_args=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#

Class for Clustering-based Embedding-driven Document Clustering (CEDC). Inherits from BaseModel and SentenceEncodingMixin.

Parameters:
  • n_topics (int or None) – Number of topics to extract.

  • embedding_model_name (str) – Name of the embedding model to use.

  • umap_args (dict) – Arguments for UMAP dimensionality reduction.

  • gmm_args (dict) – Arguments for Gaussian Mixture Model (GMM) clustering.

  • embeddings_path (str) – Path to the folder containing embeddings.

  • embeddings_file_path (str) – Path to the file containing embeddings.

  • trained (bool) – Flag indicating whether the model has been trained.

  • save_embeddings (bool) – Whether to save generated embeddings.

dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#

Encode a list of documents into embeddings.

Parameters:
  • documents (List[str]) – List of documents to encode.

  • encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.

  • use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.

Returns:

Array of shape (n_documents, embedding_size) containing document embeddings.

Return type:

np.ndarray

fit(dataset=None, n_topics=20, only_nouns=False, clean=False, clean_threshold=0.85, expansion_corpus='octis', n_words=20)[source]#

Trains the CEDC model on the provided dataset.

Parameters:
  • dataset (Dataset) – Dataset containing texts to cluster.

  • n_topics (int, optional) – Number of topics to extract (default is 20).

  • only_nouns (bool, optional) – Whether to consider only nouns during topic extraction (default is False).

  • clean (bool, optional) – Whether to clean topics based on similarity (default is False).

  • clean_threshold (float, optional) – Threshold for cleaning topics based on similarity (default is 0.85).

  • expansion_corpus (str, optional) – Corpus for expanding topics (default is ‘octis’).

  • n_words (int, optional) – Number of top words to include in each topic (default is 20).

Return type:

None

get_beta()[source]#

Constructs a topic-word matrix from the given topic dictionary.

Parameters:

topic_dict (dict) – Dictionary where keys are topic indices and values are lists of (word, prevalence) tuples.

Returns:

Topic-word matrix where rows represent topics and columns represent words.

Return type:

ndarray

Notes

The topic-word matrix is constructed by assigning prevalences of words in topics. Words are sorted alphabetically across all topics.

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the model.

Returns:

Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.

Return type:

dict

get_theta()#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(texts, proba=True)[source]#

Predict topics for new documents.

Parameters:

texts (list of str) – List of texts to predict topics for.

Returns:

List of predicted topic labels.

Return type:

list of int

Raises:

ValueError – If the model has not been trained yet.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

split_document(document, max_length)#

Split a long document into segments of specified maximum length.

Parameters:
  • document (str) – Document to split into segments.

  • max_length (int) – Maximum length of each segment.

Returns:

List of document segments.

Return type:

List[str]

class stream.models.DCTE(model='paraphrase-MiniLM-L3-v2')[source]#

A document classification and topic extraction class that utilizes the SetFitModel for document classification and TF-IDF for topic extraction.

This class inherits from the AbstractModel class and is designed for supervised document classification and unsupervised topic modeling.

n_topics#

The number of topics to identify in the dataset.

Type:

int

model#

The SetFitModel used for document classification.

Type:

SetFitModel

batch_size#

The batch size used in training.

Type:

int

num_iterations#

The number of iterations for SetFit training.

Type:

int

num_epochs#

The number of epochs for SetFit training.

Type:

int

dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

get_beta()#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

abstract get_info()#

Get information about the model.

get_theta()#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

abstract predict(X)#

Predict topics for new documents X.

Parameters:

X – Input documents to predict topics for.

Returns:

Predicted topics for input documents.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

train_model(train_dataset, predict_dataset, val_split=0.2, n_top_words=10, **training_args)[source]#

Trains the DCTE model using the given training dataset and then performs prediction and topic extraction on the specified prediction dataset.

The method uses the SetFitTrainer for training and evaluates the model’s performance. It then applies the trained model for prediction and extracts topics using TF-IDF.

Parameters:
  • train_dataset – The dataset used for training the model.

  • predict_dataset – The dataset on which to perform prediction and topic extraction.

  • val_split (float, optional) – The fraction of the training data to use as validation data. Defaults to 0.2.

  • top_words (int, optional) – The number of top words to extract for each topic. Defaults to 10.

Returns:

A dictionary containing the extracted topics and the topic-word matrix.

Return type:

dict

class stream.models.KmeansTM(embedding_model_name='paraphrase-MiniLM-L3-v2', umap_args=None, kmeans_args=None, random_state=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#

A topic modeling class that uses K-Means clustering on text data.

This class inherits from the BaseModel class and utilizes sentence embeddings, UMAP for dimensionality reduction, and K-Means for clustering text data into topics.

Parameters:
  • embedding_model_name (str) – Name of the sentence embedding model to use.

  • umap_args (dict) – Arguments for UMAP dimensionality reduction.

  • kmeans_args (dict) – Arguments for K-Means clustering.

  • embeddings_path (str) – Path to the folder containing embeddings.

  • embeddings_file_path (str) – Path to the file containing embeddings.

  • trained (bool) – Flag indicating whether the model has been trained.

  • save_embeddings (bool) – Whether to save generated embeddings.

  • n_topics (int or None) – Number of topics to extract.

dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#

Encode a list of documents into embeddings.

Parameters:
  • documents (List[str]) – List of documents to encode.

  • encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.

  • use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.

Returns:

Array of shape (n_documents, embedding_size) containing document embeddings.

Return type:

np.ndarray

fit(dataset=None, n_topics=20)[source]#

Trains the K-Means topic model on the provided dataset.

Parameters:
  • dataset (Dataset) – The dataset to train the model on.

  • n_topics (int, optional) – Number of topics to extract, by default 20

Raises:

AssertionError – If the dataset is not an instance of TMDataset.

get_beta()#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the model.

Returns:

Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.

Return type:

dict

get_theta()#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(texts)[source]#

Predict topics for new documents.

Parameters:

texts (list of str) – List of texts to predict topics for.

Returns:

List of predicted topic labels.

Return type:

list of int

Raises:

ValueError – If the model has not been trained yet.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

split_document(document, max_length)#

Split a long document into segments of specified maximum length.

Parameters:
  • document (str) – Document to split into segments.

  • max_length (int) – Maximum length of each segment.

Returns:

List of document segments.

Return type:

List[str]

class stream.models.SOMTM(m, n, umap_args={}, embedding_model_name='paraphrase-MiniLM-L3-v2', embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, reduce_dim=True, reduced_dimension=16, dim=None, **kwargs)[source]#
dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#

Encode a list of documents into embeddings.

Parameters:
  • documents (List[str]) – List of documents to encode.

  • encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.

  • use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.

Returns:

Array of shape (n_documents, embedding_size) containing document embeddings.

Return type:

np.ndarray

fit(dataset=None, n_iterations=100, batch_size=128, lr=None, sigma=None, use_softmax=True)[source]#

Fit the SOMTM model to the dataset.

Parameters:
  • dataset (TMDataset, optional) – The dataset to fit the model to.

  • n_iterations (int, optional) – Number of iterations for training (default is 100).

  • batch_size (int, optional) – Batch size for training (default is 128).

  • lr (float, optional) – Initial learning rate (default is None, which sets it to 0.3).

  • sigma (float, optional) – Initial neighborhood value (default is None, which sets it to max(m, n) / 2).

  • use_softmax (bool, optional) – Whether to use softmax for mapping (default is True).

get_beta()#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the model.

Returns:

Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.

Return type:

dict

get_theta()#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topic_document_matrix()[source]#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topic_word_matrix()[source]#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)[source]#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(documents)[source]#

Predict topics for new documents X.

Parameters:

X – Input documents to predict topics for.

Returns:

Predicted topics for input documents.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

split_document(document, max_length)#

Split a long document into segments of specified maximum length.

Parameters:
  • document (str) – Document to split into segments.

  • max_length (int) – Maximum length of each segment.

Returns:

List of document segments.

Return type:

List[str]

class stream.models.WordCluTM(umap_args=None, random_state=None, gmm_args=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#

A topic modeling class that uses Word2Vec embeddings and K-Means or GMM clustering on vocabulary to form coherent word clusters.

dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

get_beta(dataset)[source]#

Retrieve the topic-word distribution matrix.

Returns:

Topic-word distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the model.

Returns:

Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.

Return type:

dict

get_theta()[source]#

Retrieve the topic-document distribution matrix.

Returns:

Topic-document distribution matrix.

Return type:

numpy.ndarray

Raises:

ValueError – If the model has not been trained yet.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(texts, proba=True)[source]#

Predict topics for new documents.

Parameters:

texts (list of str) – List of texts to predict topics for.

Returns:

List of predicted topic labels.

Return type:

list of int

Raises:

ValueError – If the model has not been trained yet.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.

train_word2vec(sentences, epochs, vector_size, window, min_count, workers)[source]#

Train a Word2Vec model on the given sentences.

Parameters:

sentences (list) – List of tokenized sentences.

class stream.models.LDA(id2word=None, id_corpus=None, random_state=None)[source]#
dim_reduction(logger)#

Reduces the dimensionality of embeddings using UMAP.

Raises:

ValueError – If an error occurs during dimensionality reduction.

fit(dataset=None, n_topics=20, **lda_params)[source]#

Fit the LDA model to the dataset.

Parameters:
  • dataset (TMDataset, optional) – The dataset to fit the model to. Must be an instance of TMDataset.

  • n_topics (int, optional) – The number of topics to extract (default is 20).

  • **lda_params (dict, optional) – Additional parameters to pass to the Gensim LdaModel.

Raises:
  • AssertionError – If the dataset is not an instance of TMDataset.

  • RuntimeError – If there is an error during training.

get_beta()[source]#

Get the word distribution for each topic.

Returns:

topic_word_matrix – List of topics, where each topic is a list of (word_id, probability) tuples.

Return type:

list of list of tuples

Raises:

RuntimeError – If the model has not been trained yet or failed.

get_hyperparameters()#

Get the model hyperparameters.

Returns:

Dictionary containing the model hyperparameters.

Return type:

dict

get_info()[source]#

Get information about the LDA model.

Returns:

info – Dictionary containing model information.

Return type:

dict

get_theta()[source]#

Get the topic distribution for each document.

Returns:

topic_document_matrix – DataFrame where each row corresponds to a document and each column to a topic, with the values representing the topic probabilities for each document.

Return type:

pd.DataFrame

Raises:

RuntimeError – If the model has not been trained yet or failed.

get_topics(n_words=10)#

Retrieve the top words for each topic.

Parameters:

n_words (int) – Number of top words to retrieve for each topic.

Returns:

List of topics with each topic represented as a list of top words.

Return type:

list of list of str

Raises:

ValueError – If the model has not been trained yet.

load_hyperparameters(path)#

Load the model hyperparameters from a JSON file.

Parameters:

path (str) – Path to the JSON file containing hyperparameters.

load_model(path)#

Load the model state and parameters from a file.

Parameters:

path (str) – Path to the saved model file.

predict(dataset)[source]#

Predict topics for new documents X.

Parameters:

X – Input documents to predict topics for.

Returns:

Predicted topics for input documents.

prepare_embeddings(dataset, logger)#

Prepares the dataset for clustering.

Parameters:

dataset (Dataset) – The dataset to be used for clustering.

save_hyperparameters(ignore=[])#

Save the hyperparameters while ignoring specified keys.

Parameters:

ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].

save_model(path)#

Save the model state and parameters to a file.

Parameters:

path (str) – Path to save the model file.