Preprocessor#
- class stream.preprocessor.BaseEmbedder(embedding_model)[source]#
Base Embedder Class
This class provides a base for creating document and word embeddings using different models.
- Parameters:
embedding_model – The embedding model used for generating embeddings.
- embedder#
The specific backend embedder used for generating embeddings.
- embedding_model#
The embedding model used for generating embeddings.
- _check_documents_type(documents
List[str]): Check if the provided documents are of the correct type.
- _clean_docs(documents
List[str]): Clean and preprocess a list of documents.
- create_doc_embeddings(documents
List[str], progress: bool = False): Create document embeddings.
- create_word_embeddings(word
List[str]): Create word embeddings.
- create_doc_embeddings(documents, progress=False)[source]#
Create document embeddings for a list of documents.
- Parameters:
documents (List[str]) – List of documents to create embeddings for.
progress (bool, optional) – Controls the verbosity of the process.
- Returns:
Document embeddings. pd.DataFrame: A DataFrame with cleaned and lowercased documents.
- Return type:
np.ndarray
- class stream.preprocessor.GensimBackend(embedding_model)[source]#
Gensim Embedding Model
This class provides functionality to create document embeddings using Gensim Word2Vec embeddings.
- Parameters:
embedding_model (Word2VecKeyedVectors) – A Gensim Word2Vec model for word embeddings.
- embedding_model#
The Gensim Word2Vec model used for embeddings.
- Type:
Word2VecKeyedVectors
- encode(documents
List[str], verbose: bool = False) -> np.ndarray: Embed a list of documents/words into a matrix of embeddings.
- encode(documents, verbose=False)[source]#
Embed a list of documents/words into an n-dimensional matrix of embeddings.
- Parameters:
documents (List[str]) – A list of documents or words to be embedded.
verbose (bool, optional) – Controls the verbosity of the process.
- Returns:
Document/words embeddings with shape (n, m) with
ndocuments/words that each have an embeddings size ofm.- Return type:
np.ndarray
- class stream.preprocessor.TextPreprocessor(**kwargs)[source]#
- class stream.preprocessor.TopicExtractor(dataset, topic_assignments, n_topics, embedding_model)[source]#
- stream.preprocessor.clean_topics(topics, embedding_model, similarity=0.75)[source]#
Cleans the topics based on their cosine similarity between all words in the topic. Although we are only extracting nouns, and lemmatize them, it is possible that e.g. “tiger” and “tigers” are the top words in a topic. Therefore it could also happen, that all possible Conjugations of a word are the top k words from a topic. This would not be very meaningful/expressive. Hence we clean the topics.
For each topic, iterates through every word an computes al cosine similarities between all words. It is cleaned top-down, which means, the first word will never be cleaned. If for instance the cosine similarity between word1 and word2 is larger than the specified threshold, word2 will be removed from the topic. If then the cosine similarity between word2 and word5 is also bigger than the specified threshold, word5 will remain in the corpus, as word2 is already removed.
We compute all combinations between all words, hence a topic of k words, has (k-1)*((k-1)+1)/2 combinations.
The resulting topics can hence vary in their lengths.
- Parameters:
topics (_type_) – the models topics
embedding_model (_type_) – BAse_Embedder class, see backend._base.py
similarity (float, optional) – cosine similarity threshold. Defaults to 0.75.
- Returns:
cleaned topics
- Return type:
dict
- stream.preprocessor.c_tf_idf(documents, m, ngram_range=(1, 1))[source]#
class based tf_idf retrieval from cluster of documents
- Parameters:
documents (_type_) – _description_
m (_type_) – _description_
ngram_range (tuple, optional) – _description_. Defaults to (1, 1).
- Returns:
_description_
- Return type:
_type_
- stream.preprocessor.extract_tfidf_topics(tf_idf, count, docs_per_topic, n=10)[source]#
class based tf_idf retrieval from cluster of documents
- Parameters:
tf_idf (_type_) – _description_
count (_type_) – _description_
docs_per_topic (_type_) – _description_
n (int, optional) – _description_. Defaults to 20.
- Returns:
_description_
- Return type:
_type_
- stream.preprocessor.extract_topic_sizes(df)[source]#
Extracts and computes the size of each topic from a given DataFrame.
This function groups the DataFrame by the ‘Topic’ column, which represents topic IDs, and then counts the number of documents associated with each topic. It returns a DataFrame with two columns: ‘Topic’ and ‘Size’, where ‘Size’ is the count of documents in each topic. The returned DataFrame is sorted in descending order of ‘Size’.
- Parameters:
df (pandas.DataFrame) – A DataFrame containing at least two columns, ‘Topic’ and ‘docs’, where ‘Topic’ is an ID column for topics and ‘docs’ contains documents or data points associated with each topic.
- Returns:
- A DataFrame with ‘Topic’ and ‘Size’ columns, where ‘Size’
indicates the number of documents in each topic, sorted in descending order of size.
- Return type:
pandas.DataFrame