Homework 3: Topic Modeling
In this homework, you will work with the PubMed dataset, which you can find in Moodle. This is a dataset of 2684 articles on women's health retrieved from the PubMed database. The data set spans from the year 1990 to 2023. We'd like to observe the abstract of the articles, but there are some additional metadata, such as PMID number, date of publication, authors, and title.
[max 1 page] Familiarize yourself with the corpus and prepare it for downstream analysis.
- Use Distributions widget to determine the distribution of the articles by the year. What did you notice?
- Construct a preprocessing pipeline. Consider the options to remove numbers. Is the Word Cloud better or worse if the numbers are removed?
- Now use Document frequency filter to remove all the tokens that appear in less than 10 articles. Why is this useful? What is the alternative way of dealing with very rare words?
[max 1 page] Use Bag of Words with TF-IDF and feed the data to Topic Modelling. Use LDA to retrieve topics.
- Compare the results for 5 and 10 topics. Which results make more sense and why?
- For simplicity, use 5 topics and name them. Shortly describe how you decided on the name for each topic.
- Attach a screenshot of topic frequency. Which topic is the most important?
[max 2 pages] Export the results of Topic Modelling to .csv. Now use Python. Create a new variable Date, which extracts only the year from the Publication Date. Draw a plot, which shows topic frequencies in time (topic frequency per year). Attach a plot for each topic and comment it.
Alternatively, install Timeseries add-on in Orange (Options -- Add-ons). Use Moving Transform (Aggregate time periods: Years) to extract only year. Select all Topic features and use non-zero count. Then use Line Chart (with Type:column) to display instances by year. You can select multiple variables to compare one to another.
[max 2 pages] Use Python for this part. Install BerTopic (
pip install bertopic
) and familiarize yourself with the library (https://maartengr.github.io/BERTopic/index.html). Import PubMed data (hint: useskiprows
in pandas to skip the second and third row). Please, set a random seed for reproducible results.
from umap import UMAP
umap_model = UMAP(random_state=42)
topic_model = BERTopic(umap_model=umap_model)
Next, extract the topics.
- How many topics did you get?
- What is the first topic about?
- Which is the most frequent topic? Does it make sense in the context of medical research on women's health?
- Use dynamic topic modeling to extract topics over time. Set the number of bins to 23 (to have them correspond to years). Visualize the top 10 topics and comment the results. Don't forget to attach the plot in the report.
- [max 1 page] Compare the LDA and BerTopic results. Which are better and why?
It is advisable to attach screenshots to support your comments.
Submit the homework as a short report in PDF. The report should include the title of the homework ("Topic Modeling"), your name and email, and the following sections that directly follow the numbered homework items above. Also, please note:
All figures you will use in the report should include a caption and be numbered. For example, "Fig. 1: Workflow for reading the data, visualizing it as a scatter plot, and performing hierarchical clustering and analysis on a selected cluster." All the figures should be referred to in the text. For instance, "The hierarchical clustering dendrogram (Fig. 1) reveals three distinct clusters.".
The figures should be legible. Do not reduce the size to the point where the text is not readable. Do not distort or skew the screenshots and the graph; keep the aspect ratio when reducing size.
Spell-check the text in your report. Also, if possible, use a grammar checker. The report should read well and should be neatly set.
Read your final version of the report before submitting it.
- 30 November 2023, 8:45 AM
- 30 November 2023, 8:45 AM