Preprocessor#

class stream.preprocessor.BaseEmbedder(embedding_model)[source]#

Base Embedder Class

This class provides a base for creating document and word embeddings using different models.

Parameters:

embedding_model – The embedding model used for generating embeddings.

embedder#

The specific backend embedder used for generating embeddings.

embedding_model#

The embedding model used for generating embeddings.

_check_documents_type(documents

List[str]): Check if the provided documents are of the correct type.

_clean_docs(documents

List[str]): Clean and preprocess a list of documents.

create_doc_embeddings(documents

List[str], progress: bool = False): Create document embeddings.

create_word_embeddings(word

List[str]): Create word embeddings.

create_doc_embeddings(documents, progress=False)[source]#

Create document embeddings for a list of documents.

Parameters:
  • documents (List[str]) – List of documents to create embeddings for.

  • progress (bool, optional) – Controls the verbosity of the process.

Returns:

Document embeddings. pd.DataFrame: A DataFrame with cleaned and lowercased documents.

Return type:

np.ndarray

create_word_embeddings(word)[source]#

Create word embeddings for a list of words.

Parameters:

word (List[str]) – List of words to create embeddings for.

Returns:

Word embeddings.

Return type:

np.ndarray

class stream.preprocessor.GensimBackend(embedding_model)[source]#

Gensim Embedding Model

This class provides functionality to create document embeddings using Gensim Word2Vec embeddings.

Parameters:

embedding_model (Word2VecKeyedVectors) – A Gensim Word2Vec model for word embeddings.

embedding_model#

The Gensim Word2Vec model used for embeddings.

Type:

Word2VecKeyedVectors

encode(documents

List[str], verbose: bool = False) -> np.ndarray: Embed a list of documents/words into a matrix of embeddings.

encode(documents, verbose=False)[source]#

Embed a list of documents/words into an n-dimensional matrix of embeddings.

Parameters:
  • documents (List[str]) – A list of documents or words to be embedded.

  • verbose (bool, optional) – Controls the verbosity of the process.

Returns:

Document/words embeddings with shape (n, m) with n documents/words that each have an embeddings size of m.

Return type:

np.ndarray

class stream.preprocessor.TextPreprocessor(**kwargs)[source]#
add_custom_stopwords(stopwords)[source]#

Add custom stopwords to the preprocessor.

Parameters:

stopwords (set) – Set of custom stopwords to be added.

remove_custom_stopwords(stopwords)[source]#

Remove custom stopwords from the preprocessor.

Parameters:

stopwords (set) – Set of custom stopwords to be removed.

class stream.preprocessor.TopicExtractor(dataset, topic_assignments, n_topics, embedding_model)[source]#
stream.preprocessor.clean_topics(topics, embedding_model, similarity=0.75)[source]#

Cleans the topics based on their cosine similarity between all words in the topic. Although we are only extracting nouns, and lemmatize them, it is possible that e.g. “tiger” and “tigers” are the top words in a topic. Therefore it could also happen, that all possible Conjugations of a word are the top k words from a topic. This would not be very meaningful/expressive. Hence we clean the topics.

For each topic, iterates through every word an computes al cosine similarities between all words. It is cleaned top-down, which means, the first word will never be cleaned. If for instance the cosine similarity between word1 and word2 is larger than the specified threshold, word2 will be removed from the topic. If then the cosine similarity between word2 and word5 is also bigger than the specified threshold, word5 will remain in the corpus, as word2 is already removed.

We compute all combinations between all words, hence a topic of k words, has (k-1)*((k-1)+1)/2 combinations.

The resulting topics can hence vary in their lengths.

Parameters:
  • topics (_type_) – the models topics

  • embedding_model (_type_) – BAse_Embedder class, see backend._base.py

  • similarity (float, optional) – cosine similarity threshold. Defaults to 0.75.

Returns:

cleaned topics

Return type:

dict

stream.preprocessor.c_tf_idf(documents, m, ngram_range=(1, 1))[source]#

class based tf_idf retrieval from cluster of documents

Parameters:
  • documents (_type_) – _description_

  • m (_type_) – _description_

  • ngram_range (tuple, optional) – _description_. Defaults to (1, 1).

Returns:

_description_

Return type:

_type_

stream.preprocessor.extract_tfidf_topics(tf_idf, count, docs_per_topic, n=10)[source]#

class based tf_idf retrieval from cluster of documents

Parameters:
  • tf_idf (_type_) – _description_

  • count (_type_) – _description_

  • docs_per_topic (_type_) – _description_

  • n (int, optional) – _description_. Defaults to 20.

Returns:

_description_

Return type:

_type_

stream.preprocessor.extract_topic_sizes(df)[source]#

Extracts and computes the size of each topic from a given DataFrame.

This function groups the DataFrame by the ‘Topic’ column, which represents topic IDs, and then counts the number of documents associated with each topic. It returns a DataFrame with two columns: ‘Topic’ and ‘Size’, where ‘Size’ is the count of documents in each topic. The returned DataFrame is sorted in descending order of ‘Size’.

Parameters:

df (pandas.DataFrame) – A DataFrame containing at least two columns, ‘Topic’ and ‘docs’, where ‘Topic’ is an ID column for topics and ‘docs’ contains documents or data points associated with each topic.

Returns:

A DataFrame with ‘Topic’ and ‘Size’ columns, where ‘Size’

indicates the number of documents in each topic, sorted in descending order of size.

Return type:

pandas.DataFrame