Homework 2: Document Vectorization and Classification
In this homework, you will work with the mtsamples dataset. This is a dataset of medical transcription records, each labeled with a medical specialty. This is a small sample containing only three classes -- cardiovascular/pulmonary, neurology, and orthopedic.
[max 1 page] Familiarize yourself with the data and prepare it for downstream analysis.
- Look at the data in a Word Cloud. Which are the most frequent words? How are they called?
- After default preprocessing, does the Word Cloud look better? If not, what would you improve? Explain the preprocessing pipeline you intend to use. It can be simple, but defend the choice of each step in a sentence or two.
[max 1 page] Once you are satisfied with preprocessing, pass the data to Document Embedding. This time, use Multilingual SBERT.
- What data was SBERT trained on?
- Why is SBERT more appropriate for longer texts than FastText?
[max 1 page] Train a logistic regression and naive Bayes model.
- Which of the two reports a higher accuracy and AUC?
- Could both models be used in practice?
- Can any of the two models be explained? If yes, how?
[max 2 pages] Now use bag of words on the same preprocessed data.
- In your own words, explain what is TF-IDF transformation and use it. Provide a figure of the TF-IDF results. How different is it from the plain (count-only) vectorization?
- Construct an additional Test and Score pipeline for bag of words and once again train LR and NB. How is the accuracy and AUC this time?
- Explain the more accurate model. For each class, list the words that define it and in which direction (presence or absence).
[max 2 pages] Now use t-SNE after Document Embedding. Color the plot with class labels. Does the plot make sense? Use the Annotated Corpus Map to further explain the plot. Do the clusters from Annotated Corpus Map correspond with class labels? Use Gaussian mixture models and set the number of clusters to three. Are the results better or worse?
It is advisable to attach screenshots to support your comments.
Submit the homework as a short report in PDF. The report should include the title of the homework ("Document Vectorization and Classification"), your name and email, and the following sections that directly follow the numbered homework items above. Also, please note:
All figures you will use in the report should include a caption and be numbered. For example, "Fig. 1: Workflow for reading the data, visualizing it as a scatter plot, and performing hierarchical clustering and analysis on a selected cluster." All the figures should be referred to in the text. For instance, "The hierarchical clustering dendrogram (Fig. 1) reveals three distinct clusters.".
The figures should be legible. Do not reduce the size to the point where the text is not readable. Do not distort or skew the screenshots and the graph; keep the aspect ratio when reducing size.
Spell-check the text in your report. Also, if possible, use a grammar checker. The report should read well and should be neatly set.
Read your final version of the report before submitting it.