Models#
- class stream.models.BERTopicTM(embedding_model_name='paraphrase-MiniLM-L3-v2', umap_args=None, min_cluster_size=None, hdbscan_args=None, random_state=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#
A topic modeling class that uses K-Means clustering on text data.
This class inherits from the AbstractModel class and utilizes sentence embeddings, UMAP for dimensionality reduction, and K-Means for clustering text data into topics.
- hyperparameters#
A dictionary of hyperparameters for the model.
- Type:
dict
- n_topics#
The number of topics to cluster the documents into.
- Type:
int
- embedding_model#
The sentence embedding model.
- Type:
SentenceTransformer
- umap_args#
Arguments for UMAP dimensionality reduction.
- Type:
dict
- kmeans_args#
Arguments for the KMeans clustering algorithm.
- Type:
dict
- optim#
Flag to enable optimization of the number of clusters.
- Type:
bool
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#
Encode a list of documents into embeddings.
- Parameters:
documents (List[str]) – List of documents to encode.
encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.
use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.
- Returns:
Array of shape (n_documents, embedding_size) containing document embeddings.
- Return type:
np.ndarray
- fit(dataset)[source]#
Trains the BERTOPIC topic model on the provided dataset.
Applies sentence embedding, UMAP dimensionality reduction, and hdbscan clustering to the dataset to identify distinct topics within the text data.
- Parameters:
dataset – The dataset to train the model on. It should contain the text documents.
- Returns:
A dictionary containing the identified topics and the topic-word matrix.
- Return type:
dict
- get_beta()#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the model.
- Returns:
Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.
- Return type:
dict
- get_theta()#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(texts)[source]#
Predict topics for new documents.
- Parameters:
texts (list of str) – List of texts to predict topics for.
- Returns:
List of predicted topic labels.
- Return type:
list of int
- Raises:
ValueError – If the model has not been trained yet.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- split_document(document, max_length)#
Split a long document into segments of specified maximum length.
- Parameters:
document (str) – Document to split into segments.
max_length (int) – Maximum length of each segment.
- Returns:
List of document segments.
- Return type:
List[str]
- class stream.models.CBC[source]#
- cluster_documents()[source]#
Clusters documents based on coherence scores.
- Returns:
A dictionary mapping cluster labels to lists of document indices.
- Return type:
dict
- combine_documents(documents, clusters)[source]#
Combines documents within each cluster.
- Parameters:
documents (DataFrame) – Original DataFrame of documents.
clusters (dict) – Dictionary of document clusters.
- Returns:
New DataFrame with combined documents.
- Return type:
DataFrame
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- fit(dataset=None, max_topics=20, max_iterations=20)[source]#
Clusters documents based on coherence scores until the number of clusters is within a specified threshold.
- Parameters:
dataset (TMDataset, optional) – Dataset containing the documents.
max_topics (int, optional) – Maximum acceptable number of clusters.
max_iterations (int, optional) – Maximum number of iterations for clustering.
- Raises:
AssertionError – If the dataset is not an instance of TMDataset.
- get_beta()#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the model.
- Returns:
Dictionary containing model information including model name
- Return type:
dict
- get_theta()#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(texts)[source]#
Predict topics for new documents.
- Parameters:
texts (list of str) – List of texts to predict topics for.
- Returns:
List of predicted topic labels.
- Return type:
list of int
- Raises:
ValueError – If the model has not been trained yet.
- prepare_data(dataset)[source]#
Prepares the dataset for clustering.
- Parameters:
dataset (TMDataset) – Dataset containing the documents.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- class stream.models.CEDC(embedding_model_name='paraphrase-MiniLM-L3-v2', umap_args=None, random_state=None, gmm_args=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#
Class for Clustering-based Embedding-driven Document Clustering (CEDC). Inherits from BaseModel and SentenceEncodingMixin.
- Parameters:
n_topics (int or None) – Number of topics to extract.
embedding_model_name (str) – Name of the embedding model to use.
umap_args (dict) – Arguments for UMAP dimensionality reduction.
gmm_args (dict) – Arguments for Gaussian Mixture Model (GMM) clustering.
embeddings_path (str) – Path to the folder containing embeddings.
embeddings_file_path (str) – Path to the file containing embeddings.
trained (bool) – Flag indicating whether the model has been trained.
save_embeddings (bool) – Whether to save generated embeddings.
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#
Encode a list of documents into embeddings.
- Parameters:
documents (List[str]) – List of documents to encode.
encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.
use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.
- Returns:
Array of shape (n_documents, embedding_size) containing document embeddings.
- Return type:
np.ndarray
- fit(dataset=None, n_topics=20, only_nouns=False, clean=False, clean_threshold=0.85, expansion_corpus='octis', n_words=20)[source]#
Trains the CEDC model on the provided dataset.
- Parameters:
dataset (Dataset) – Dataset containing texts to cluster.
n_topics (int, optional) – Number of topics to extract (default is 20).
only_nouns (bool, optional) – Whether to consider only nouns during topic extraction (default is False).
clean (bool, optional) – Whether to clean topics based on similarity (default is False).
clean_threshold (float, optional) – Threshold for cleaning topics based on similarity (default is 0.85).
expansion_corpus (str, optional) – Corpus for expanding topics (default is ‘octis’).
n_words (int, optional) – Number of top words to include in each topic (default is 20).
- Return type:
None
- get_beta()[source]#
Constructs a topic-word matrix from the given topic dictionary.
- Parameters:
topic_dict (dict) – Dictionary where keys are topic indices and values are lists of (word, prevalence) tuples.
- Returns:
Topic-word matrix where rows represent topics and columns represent words.
- Return type:
ndarray
Notes
The topic-word matrix is constructed by assigning prevalences of words in topics. Words are sorted alphabetically across all topics.
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the model.
- Returns:
Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.
- Return type:
dict
- get_theta()#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(texts, proba=True)[source]#
Predict topics for new documents.
- Parameters:
texts (list of str) – List of texts to predict topics for.
- Returns:
List of predicted topic labels.
- Return type:
list of int
- Raises:
ValueError – If the model has not been trained yet.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- split_document(document, max_length)#
Split a long document into segments of specified maximum length.
- Parameters:
document (str) – Document to split into segments.
max_length (int) – Maximum length of each segment.
- Returns:
List of document segments.
- Return type:
List[str]
- class stream.models.DCTE(model='paraphrase-MiniLM-L3-v2')[source]#
A document classification and topic extraction class that utilizes the SetFitModel for document classification and TF-IDF for topic extraction.
This class inherits from the AbstractModel class and is designed for supervised document classification and unsupervised topic modeling.
- n_topics#
The number of topics to identify in the dataset.
- Type:
int
- model#
The SetFitModel used for document classification.
- Type:
SetFitModel
- batch_size#
The batch size used in training.
- Type:
int
- num_iterations#
The number of iterations for SetFit training.
- Type:
int
- num_epochs#
The number of epochs for SetFit training.
- Type:
int
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- get_beta()#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- abstract get_info()#
Get information about the model.
- get_theta()#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- abstract predict(X)#
Predict topics for new documents X.
- Parameters:
X – Input documents to predict topics for.
- Returns:
Predicted topics for input documents.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- train_model(train_dataset, predict_dataset, val_split=0.2, n_top_words=10, **training_args)[source]#
Trains the DCTE model using the given training dataset and then performs prediction and topic extraction on the specified prediction dataset.
The method uses the SetFitTrainer for training and evaluates the model’s performance. It then applies the trained model for prediction and extracts topics using TF-IDF.
- Parameters:
train_dataset – The dataset used for training the model.
predict_dataset – The dataset on which to perform prediction and topic extraction.
val_split (float, optional) – The fraction of the training data to use as validation data. Defaults to 0.2.
top_words (int, optional) – The number of top words to extract for each topic. Defaults to 10.
- Returns:
A dictionary containing the extracted topics and the topic-word matrix.
- Return type:
dict
- class stream.models.KmeansTM(embedding_model_name='paraphrase-MiniLM-L3-v2', umap_args=None, kmeans_args=None, random_state=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#
A topic modeling class that uses K-Means clustering on text data.
This class inherits from the BaseModel class and utilizes sentence embeddings, UMAP for dimensionality reduction, and K-Means for clustering text data into topics.
- Parameters:
embedding_model_name (str) – Name of the sentence embedding model to use.
umap_args (dict) – Arguments for UMAP dimensionality reduction.
kmeans_args (dict) – Arguments for K-Means clustering.
embeddings_path (str) – Path to the folder containing embeddings.
embeddings_file_path (str) – Path to the file containing embeddings.
trained (bool) – Flag indicating whether the model has been trained.
save_embeddings (bool) – Whether to save generated embeddings.
n_topics (int or None) – Number of topics to extract.
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#
Encode a list of documents into embeddings.
- Parameters:
documents (List[str]) – List of documents to encode.
encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.
use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.
- Returns:
Array of shape (n_documents, embedding_size) containing document embeddings.
- Return type:
np.ndarray
- fit(dataset=None, n_topics=20)[source]#
Trains the K-Means topic model on the provided dataset.
- Parameters:
dataset (Dataset) – The dataset to train the model on.
n_topics (int, optional) – Number of topics to extract, by default 20
- Raises:
AssertionError – If the dataset is not an instance of TMDataset.
- get_beta()#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the model.
- Returns:
Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.
- Return type:
dict
- get_theta()#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(texts)[source]#
Predict topics for new documents.
- Parameters:
texts (list of str) – List of texts to predict topics for.
- Returns:
List of predicted topic labels.
- Return type:
list of int
- Raises:
ValueError – If the model has not been trained yet.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- split_document(document, max_length)#
Split a long document into segments of specified maximum length.
- Parameters:
document (str) – Document to split into segments.
max_length (int) – Maximum length of each segment.
- Returns:
List of document segments.
- Return type:
List[str]
- class stream.models.SOMTM(m, n, umap_args={}, embedding_model_name='paraphrase-MiniLM-L3-v2', embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, reduce_dim=True, reduced_dimension=16, dim=None, **kwargs)[source]#
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- encode_documents(documents, encoder_model='paraphrase-MiniLM-L3-v2', use_average=True)#
Encode a list of documents into embeddings.
- Parameters:
documents (List[str]) – List of documents to encode.
encoder_model (str) – Name or path of the sentence encoder model. Defaults to ‘all-MiniLM-L6-v2’.
use_average (bool) – Whether to use average embeddings for long documents. Defaults to True.
- Returns:
Array of shape (n_documents, embedding_size) containing document embeddings.
- Return type:
np.ndarray
- fit(dataset=None, n_iterations=100, batch_size=128, lr=None, sigma=None, use_softmax=True)[source]#
Fit the SOMTM model to the dataset.
- Parameters:
dataset (TMDataset, optional) – The dataset to fit the model to.
n_iterations (int, optional) – Number of iterations for training (default is 100).
batch_size (int, optional) – Batch size for training (default is 128).
lr (float, optional) – Initial learning rate (default is None, which sets it to 0.3).
sigma (float, optional) – Initial neighborhood value (default is None, which sets it to max(m, n) / 2).
use_softmax (bool, optional) – Whether to use softmax for mapping (default is True).
- get_beta()#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the model.
- Returns:
Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.
- Return type:
dict
- get_theta()#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topic_document_matrix()[source]#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topic_word_matrix()[source]#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)[source]#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(documents)[source]#
Predict topics for new documents X.
- Parameters:
X – Input documents to predict topics for.
- Returns:
Predicted topics for input documents.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- split_document(document, max_length)#
Split a long document into segments of specified maximum length.
- Parameters:
document (str) – Document to split into segments.
max_length (int) – Maximum length of each segment.
- Returns:
List of document segments.
- Return type:
List[str]
- class stream.models.WordCluTM(umap_args=None, random_state=None, gmm_args=None, embeddings_folder_path=None, embeddings_file_path=None, save_embeddings=False, **kwargs)[source]#
A topic modeling class that uses Word2Vec embeddings and K-Means or GMM clustering on vocabulary to form coherent word clusters.
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- get_beta(dataset)[source]#
Retrieve the topic-word distribution matrix.
- Returns:
Topic-word distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the model.
- Returns:
Dictionary containing model information including model name, number of topics, embedding model name, UMAP arguments, K-Means arguments, and training status.
- Return type:
dict
- get_theta()[source]#
Retrieve the topic-document distribution matrix.
- Returns:
Topic-document distribution matrix.
- Return type:
numpy.ndarray
- Raises:
ValueError – If the model has not been trained yet.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(texts, proba=True)[source]#
Predict topics for new documents.
- Parameters:
texts (list of str) – List of texts to predict topics for.
- Returns:
List of predicted topic labels.
- Return type:
list of int
- Raises:
ValueError – If the model has not been trained yet.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.
- class stream.models.LDA(id2word=None, id_corpus=None, random_state=None)[source]#
- dim_reduction(logger)#
Reduces the dimensionality of embeddings using UMAP.
- Raises:
ValueError – If an error occurs during dimensionality reduction.
- fit(dataset=None, n_topics=20, **lda_params)[source]#
Fit the LDA model to the dataset.
- Parameters:
dataset (TMDataset, optional) – The dataset to fit the model to. Must be an instance of TMDataset.
n_topics (int, optional) – The number of topics to extract (default is 20).
**lda_params (dict, optional) – Additional parameters to pass to the Gensim LdaModel.
- Raises:
AssertionError – If the dataset is not an instance of TMDataset.
RuntimeError – If there is an error during training.
- get_beta()[source]#
Get the word distribution for each topic.
- Returns:
topic_word_matrix – List of topics, where each topic is a list of (word_id, probability) tuples.
- Return type:
list of list of tuples
- Raises:
RuntimeError – If the model has not been trained yet or failed.
- get_hyperparameters()#
Get the model hyperparameters.
- Returns:
Dictionary containing the model hyperparameters.
- Return type:
dict
- get_info()[source]#
Get information about the LDA model.
- Returns:
info – Dictionary containing model information.
- Return type:
dict
- get_theta()[source]#
Get the topic distribution for each document.
- Returns:
topic_document_matrix – DataFrame where each row corresponds to a document and each column to a topic, with the values representing the topic probabilities for each document.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the model has not been trained yet or failed.
- get_topics(n_words=10)#
Retrieve the top words for each topic.
- Parameters:
n_words (int) – Number of top words to retrieve for each topic.
- Returns:
List of topics with each topic represented as a list of top words.
- Return type:
list of list of str
- Raises:
ValueError – If the model has not been trained yet.
- load_hyperparameters(path)#
Load the model hyperparameters from a JSON file.
- Parameters:
path (str) – Path to the JSON file containing hyperparameters.
- load_model(path)#
Load the model state and parameters from a file.
- Parameters:
path (str) – Path to the saved model file.
- predict(dataset)[source]#
Predict topics for new documents X.
- Parameters:
X – Input documents to predict topics for.
- Returns:
Predicted topics for input documents.
- prepare_embeddings(dataset, logger)#
Prepares the dataset for clustering.
- Parameters:
dataset (Dataset) – The dataset to be used for clustering.
- save_hyperparameters(ignore=[])#
Save the hyperparameters while ignoring specified keys.
- Parameters:
ignore (list, optional) – List of keys to ignore while saving hyperparameters. Defaults to [].
- save_model(path)#
Save the model state and parameters to a file.
- Parameters:
path (str) – Path to save the model file.