STREAM#

We present STREAM, a Simplified Topic Retrieval, Exploration, and Analysis Module for user-friendly topic modelling and especially subsequent interactive topic visualization and analysis. For better topic analysis, we implement multiple intruder-word based topic evaluation metrics. Additionally, we publicize multiple new datasets that can extend the so far very limited number of publicly available benchmark datasets in topic modeling. We integrate downstream interpretable analysis modules to enable users to easily analyse the created topics in downstream tasks together with additional tabular information.

Octis#

The core of the STREAM package is built on top of the OCTIS framework and allows seamless integration of all of OCTIS’ multitude of models, datasets, evaluation metrics and hyperparameter optimization techniques. See the Octis Github repository for an overview.

Speed#

Since most of STREAMs models are centered around Document embeddings, STREAM comes along with a set of pre-embedded datasets. Additionally, once a user fits a model that leverages document embeddings, the embeddings are saved and automatically loaded the next time the user wants to fit any model with the same set of embeddings.

Figure Description

Installation#

Since we are currently under review and wish to maintain anonymity, STREAM is not yet available on PyPI. To install STREAM, you can install it directly from the GitHub repository using the following command:

pip install git+https://github.com/AnFreTh/STREAM.git

Make additionally sure to download the necessary nltk ressources, e.g. via:

import nltk
nltk.download('averaged_perceptron_tagger')

Available Models#

Name

Implementation

WordCluTM

Tired of topic models?

CEDC

Topics in the Haystack

DCTE

Human in the Loop

KMeansTM

Simple Kmeans followed by c-tfidf

SomTM

Self organizing map followed by c-tfidf

CBC

Coherence based document clustering

Available (Additional) Metrics#

Name

Description

ISIM

Average cosine similarity of top words of a topic to an intruder word.

INT

For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.

ISH

calculates the shift in the centroid of a topic when an intruder word is replaced.

Expressivity

Cosine Distance of topics to meaningless (stopword) embedding centroid

Embedding Topic Diversity

Topic diversity in the embedding space

Embedding Coherence

Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.

NPMI

Classical NPMi coherence computed on the scource corpus.

Available Datasets#

Usage#

To use these models, follow the steps below:

  1. Import the necessary modules:

    from stream.models import CEDC, KmeansTM, DCTE
    from stream.data_utils import TMDataset
    
  2. Get your dataset and data directory:

    dataset = TMDataset()
    
    dataset.fetch_dataset("20NewsGroup")
    
  3. Choose the model you want to use and train it:

    model = CEDC(num_topics=20)
    output = model.train_model(dataset)
    
  4. Evaluate the model using either Octis evaluation metrics or newly defined ones such as INT or ISIM:

    from stream.metrics import ISIM, INT
    
    metric = ISIM(dataset)
    metric.score(output)
    
  5. Score per topic

    metric.score_per_topic(output)
    
  6. Visualize the results:

    from stream.visuals import visualize_topic_model, visualize_topics
    
    visualize_topic_model(
        model,
        reduce_first=True,
        port=8051,
        )
    
Figure Description

Downstream Tasks#

Figure Description

The general formulation of a Neural Additive Model (NAM) can be summarized by the equation:

$$ E(y) = h(β + ∑_{j=1}^{J} f_j(x_j)), $$

where $h(·)$ denotes the activation function in the output layer, such as a linear activation for regression tasks or softmax for classification tasks. $x ∈ R^j$ represents the input features, and $β$ is the intercept. The function $f_j : R → R$ corresponds to the Multi-Layer Perceptron (MLP) for the $j$-th feature.

Let’s consider $x$ as a combination of categorical and numerical features $x{tab}$ and document features $x{doc}$. After applying a topic model, STREAM extracts topical prevalences from documents, effectively transforming the input into $z ≡ (x{tab}, x{top})$, a probability vector over documents and topics. Here, $x{j(tab)}^{(i)}$ indicates the $j$-th tabular feature of the $i$-th observation, and $x{k(top)}^{(i)}$ represents the $i$-th document’s topical prevalence for topic $k$.

For preserving interpretability, the downstream model is defined as:

$$ h(E[y]) = β + ∑_{j=1}^{J} fj(x{j(tab)}) + ∑_{k=1}^{K} fk(x{k(top)}), $$

In this setup, visualizing the shape function k reveals the impact of a topic on the target variable y. For example, in the context of the Spotify dataset, this could illustrate how a topic influences a song’s popularity.

Fitting a downstream model with a pre-trained topic model is straightforward using the PyTorch Trainer class. Subsequently, visualizing all shape functions can be done similarly to the approach described by Agarwal et al. (2021).

Python Example#

from pytorch_lightning import Trainer
from stream.NAM import DownstreamModel

# Instantiate the DownstreamModel
downstream_model = DownstreamModel(
    trained_topic_model=topic_model,
    target_column='popularity',  # Target variable
    task='regression',  # or 'classification'
    dataset=dataset,
    batch_size=128,
    lr=0.0005
)

# Use PyTorch Lightning's Trainer to train and validate the model
trainer = Trainer(max_epochs=10)
trainer.fit(downstream_model)

# Plotting
from stream.visuals import plot_downstream_model
plot_downstream_model(downstream_model)

Contributing and Testing New Models#

We welcome contributions to enhance the functionality of our topic modeling package. To ensure your new models integrate seamlessly, please follow the guidelines and testing instructions provided below.

Steps for Contributing#

  1. Fork the Repository:

    • Fork the repository to your GitHub account.

    • Clone the forked repository to your local machine.

    git clone https://github.com/your-username/your-repository.git
    
    cd your-repository
    
  2. Create a New Branch:

    • Ensure you are on the develop branch and create a new branch for your model development.

    git checkout develop
    git checkout -b new-model-branch
    
  3. Develop Your Model:

    • Navigate to the mypackage/models/ directory.

    • Create your model class file, ensuring it follows the expected structure and naming conventions.

    • Implement the required methods (get_info, fit, predict) and attributes (topic_dict). Optionally, implement beta, theta, or corresponding methods (get_beta, get_theta).

Example Model Structure#

Here is an example of how your model class should be structured:

import numpy as np
from mypackage.models.abstract_helper_models.base import BaseModel, TrainingStatus

class ExampleModel(BaseModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._status = TrainingStatus.NOT_STARTED

    def get_info(self):
        return {"model_name": "ExampleModel", "trained": False}

    def any_other_processing_functions(self):
        pass

    def fit(self, dataset, n_topics=3):
        # do what you do during fitting the models
        self._status = TrainingStatus.INITIALIZED
        self._status = TrainingStatus.RUNNING
        self._status = TrainingStatus.SUCCEEDED

    def predict(self, texts):
        return [0] * len(texts)

    # If self.beta or self.theta are not assigned during fitting, plese include these two methods
    def get_beta(self):
        return self.beta

    def get_theta(self):
        return self.theta

Testing Your Model#

  1. Install Dependencies:

    • Ensure all dependencies are installed.

    pip install -r requirements.txt
    
  2. Validate Your Model:

    • To validate your model, use tests/validate_new_model.py to include your new model class.

    from tests.model_validation import validate_model
    
    validate_model(NewModel)
    

    If this validation fails, it will tell you

Validation Criteria#

The following checks are performed during validation:

  • Presence of required methods (get_info, fit, predict).

  • Presence of required attributes (topic_dict).

  • Either presence of optional attributes (beta, theta) or corresponding methods (get_beta, get_theta).

  • Correct shape and sum of theta.

  • Proper status transitions during model fitting.

  • get_info method returns a dictionary with model_name and trained keys.

Refer to the tests/model_validation.py script for detailed validation logic.

Submitting Your Contribution#

  1. Commit Your Changes:

    • Commit your changes to your branch.

    git add .
    git commit -m "Add new model: YourModelName"
    
  2. Push to GitHub:

    • Push your branch to your GitHub repository.

    git push origin new-model-branch
    
  3. Create a Pull Request:

    • Go to the original repository on GitHub.

    • Create a pull request from your forked repository and branch.

    • Provide a clear description of your changes and request a review.

We appreciate your contributions and strive to make the integration process as smooth as possible. If you encounter any issues or have questions, feel free to open an issue on GitHub. Happy coding!

If you want to include a new model where these guidelines are not approriate please mark this in your review request.

Citation#

If you find this project useful in your research, please consider cite:

@inproceedings{
anonymous2024stream,
title={{STREAM}: Simplified Topic Retrieval, Exploration, and Analysis Module},
author={Anonymous},
booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics},
year={2024},
url={https://openreview.net/forum?id=9CPhnMmvgZ}
}