Metrics#

class stream.metrics.NPMI(*args: Any, **kwargs: Any)[source]#

A class for calculating Normalized Pointwise Mutual Information (NPMI) for topics.

NPMI is a metric used in topic modeling to measure the coherence of topics by evaluating the co-occurrence of pairs of words across the documents. Higher NPMI scores typically indicate more coherent topics.

stopwords#

A list of stopwords to exclude from analysis.

Type:

list

ntopics#

The number of topics to evaluate.

Type:

int

dataset#

The dataset used for calculating NPMI.

files#

Processed text data from the dataset.

Type:

list

score(topic_words)[source]#

Calculates the average NPMI score for the given model output.

The method computes the NPMI score for each pair of words in every topic and then averages these scores to evaluate the overall topic coherence.

Parameters:

model_output (dict) – The output of a topic model, containing a list of topics.

Returns:

The average NPMI score for the topics.

Return type:

float

score_per_topic(topic_words, preprocess=5)[source]#

Calculates NPMI scores per topic for the given set of topics.

This method evaluates the coherence of each topic individually by computing NPMI scores for each pair of words within the topic.

Parameters:
  • topic_words (list) – A list of lists containing words in each topic.

  • ntopics (int) – The number of topics.

  • preprocess (int, optional) – The minimum number of documents a word must appear in. Defaults to 5.

Returns:

A dictionary with topics as keys and their corresponding NPMI scores as values.

Return type:

dict

class stream.metrics.Embedding_Coherence(*args: Any, **kwargs: Any)[source]#

A metric class to calculate the coherence of topics based on word embeddings. It computes the average cosine similarity between all top words in each topic.

n_words#

The number of top words to consider for each topic.

Type:

int

corpus_dict#

A dictionary mapping each word in the corpus to its embedding.

Type:

dict

embeddings#

The embeddings for the top words of the topics.

Type:

numpy.ndarray

score(model_output)[source]#

Calculates the overall average coherence score for the given model output.

This method computes the overall coherence of the topics by averaging the coherence scores obtained from each topic.

Parameters:

model_output (dict) – The output of a topic model, containing a list of topics.

Returns:

The average coherence score for all topics.

Return type:

float

score_per_topic(model_output)[source]#

Calculates coherence scores for each topic individually based on embedding similarities.

This method computes the coherence of each topic by calculating the average pairwise cosine similarity between the embeddings of the top words in each topic.

Parameters:

model_output (dict) – The output of a topic model, containing a list of topics.

Returns:

An array of coherence scores for each topic.

Return type:

numpy.ndarray

class stream.metrics.Embedding_Topic_Diversity(*args: Any, **kwargs: Any)[source]#

A metric class to calculate the diversity of topics based on word embeddings. It computes the mean cosine similarity of the mean vectors of the top words of all topics, providing a measure of how diverse the topics are in the embedding space.

n_words#

The number of top words to consider for each topic.

Type:

int

corpus_dict#

A dictionary mapping each word in the corpus to its embedding.

Type:

dict

score(model_output)[source]#

Calculates the overall diversity score for the given model output.

This method computes the diversity of the topics by averaging the cosine similarity of the mean vectors of the top words of each topic. A lower score indicates higher diversity.

Parameters:

model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

Returns:

The overall diversity score for all topics.

Return type:

float

score_per_topic(model_output)[source]#

Calculates diversity scores for each topic individually based on embedding similarities.

This method computes the diversity of each topic by calculating the cosine similarity of its mean vector with the mean vectors of other topics.

Parameters:

model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

Returns:

An array of diversity scores for each topic.

Return type:

numpy.ndarray

class stream.metrics.Expressivity(*args: Any, **kwargs: Any)[source]#

A metric class to calculate the expressivity of topics by measuring the distance between the mean vector of the top words in a topic and the mean vector of the embeddings of the stop words. Lower distances suggest higher expressivity, indicating that the topic’s top words are distinct from common stopwords.

stopword_list#

A list of stopwords to use for comparison.

Type:

list

n_words#

The number of top words to consider for each topic.

Type:

int

corpus_dict#

A dictionary mapping each word in the corpus to its embedding.

Type:

dict

embeddings#

The embeddings for the top words of the topics.

Type:

numpy.ndarray

stopword_emb#

The embeddings for the stopwords.

Type:

numpy.ndarray

stopword_mean#

The mean vector of the embeddings of the stopwords.

Type:

numpy.ndarray

score(model_output, new_Embeddings=True)[source]#

Calculates the overall expressivity score for the given model output.

This method computes the expressivity of the topics by averaging the cosine similarity between the mean vectors of the top words of each topic and the mean vector of the stopwords. A lower score indicates higher expressivity.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

The overall expressivity score for all topics.

Return type:

float

score_per_topic(model_output, new_Embeddings=True)[source]#

Calculates expressivity scores for each topic individually based on embedding distances.

This method computes the expressivity of each topic by calculating the cosine similarity of its mean vector with the mean vector of the stopwords.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

An array of expressivity scores for each topic.

Return type:

numpy.ndarray

class stream.metrics.INT(*args: Any, **kwargs: Any)[source]#

A metric class to calculate the Intruder Topic Metric (INT) for topics. This metric assesses the distinctiveness of topics by calculating the embedding intruder cosine similarity accuracy. It involves selecting intruder words from different topics and then measuring the accuracy by which the top words of a topic are least similar to these intruder words. Higher scores suggest better topic distinctiveness.

n_intruders#

The number of intruder words to draw for each topic.

Type:

int

n_words#

The number of top words to consider for each topic.

Type:

int

corpus_dict#

A dictionary mapping each word in the corpus to its embedding.

Type:

dict

embeddings#

The embeddings for the top words of the topics.

Type:

numpy.ndarray

score(model_output, new_Embeddings=True)[source]#

Calculates the overall INT score for all topics combined using several intruder words.

This method computes the overall INT score by averaging the INT scores obtained from each topic using multiple randomly chosen intruder words.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

The overall INT score for all topics with several intruder words.

Return type:

float

score_one_intr(model_output, new_Embeddings=True)[source]#

Calculates the overall INT score for all topics combined using only one intruder word.

This method computes the overall INT score by averaging the INT scores obtained from each topic using one randomly chosen intruder word.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

The overall INT score for all topics with one intruder word.

Return type:

float

score_one_intr_per_topic(model_output, new_Embeddings=True)[source]#

Calculates the INT score for each topic individually using only one intruder word.

This method computes the INT score for each topic by measuring the accuracy with which the top words of the topic are least similar to one randomly chosen intruder word.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

An array of INT scores for each topic with one intruder word.

Return type:

numpy.ndarray

score_per_topic(model_output, new_Embeddings=True)[source]#

Calculates the INT scores for each topic individually using several intruder words.

This method computes the INT score for each topic by averaging the accuracy scores obtained with multiple randomly chosen intruder words.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

An array of INT scores for each topic with several intruder words.

Return type:

numpy.ndarray

class stream.metrics.ISH(*args: Any, **kwargs: Any)[source]#

For each topic, draw several intruder words that are not from the same topic by first selecting some topics that are not the specific topic and then selecting one word from each of those topics. The embedding intruder distance to mean is then calculated as the average distance that each intruder word has to the mean of the other words.

score(model_output, new_Embeddings=True)[source]#

Calculate the score for all topics combined

class stream.metrics.ISIM(dataset, n_intruders=1, n_words=10, metric_embedder=sentence_transformers.SentenceTransformer, emb_filename=None, emb_path='Embeddings/', expansion_path='Embeddings/', expansion_filename=None, expansion_word_list=None)[source]#

A metric class to calculate the Intruder Similarity Metric (ISIM) for topics. This metric evaluates the distinctiveness of topics by measuring the average cosine similarity between the top words of a topic and randomly chosen intruder words from other topics. Lower scores suggest higher topic distinctiveness.

n_intruders#

The number of intruder words to draw for each topic.

Type:

int

n_words#

The number of top words to consider for each topic.

Type:

int

corpus_dict#

A dictionary mapping each word in the corpus to its embedding.

Type:

dict

embeddings#

The embeddings for the top words of the topics.

Type:

numpy.ndarray

get_info()[source]#

Get information about the metric.

Returns:

Dictionary containing model information including metric name, number of top words, number of intruders, embedding model name, metric range and metric discription

Return type:

dict

score(topics, new_Embeddings=True)[source]#

Calculates the overall ISIM score for all topics combined using several intruder words.

This method computes the overall ISIM score by averaging the ISIM scores obtained from each topic using multiple randomly chosen intruder words.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

The overall ISIM score for all topics with several intruder words.

Return type:

float

score_one_intr(topics, new_Embeddings=True)[source]#

Calculates the overall ISIM score for all topics combined using only one intruder word.

This method computes the overall ISIM score by averaging the ISIM scores obtained from each topic using one randomly chosen intruder word.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

The overall ISIM score for all topics with one intruder word.

Return type:

float

score_one_intr_per_topic(topics, new_Embeddings=True)[source]#

Calculates the ISIM score for each topic individually using only one intruder word.

This method computes the ISIM score for each topic by averaging the cosine similarity between one randomly chosen intruder word and the top words of that topic.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

An array of ISIM scores for each topic with one intruder word.

Return type:

numpy.ndarray

score_per_topic(topics, new_Embeddings=True)[source]#

Calculates the ISIM scores for each topic individually using several intruder words.

This method computes the ISIM score for each topic by averaging the cosine similarity between multiple randomly chosen intruder words and the top words of that topic.

Parameters:
  • model_output (dict) – The output of a topic model, containing a list of topics and a topic-word matrix.

  • new_Embeddings (bool, optional) – Whether to recalculate embeddings. Defaults to True.

Returns:

An array of ISIM scores for each topic with several intruder words.

Return type:

numpy.ndarray