Course: Text Mining

Topic outline

General
Welcome to the home page of the Text Mining Course. The course consists of short chapters that we will cover in each lecture. Each lecture will finish with a homework assignment. Note that assignments have fixed dates.

This course is offered in the xAIM's master's program. The course is co-financed by the European Commission instrument "Connecting Europe Facility (CEF) Telecom" through the project "eXplainable Artificial Intelligence in Healthcare Management project" (2020-EU-IA-0098), grant agreement number INEA/CEF/ICT/A2020/ 2276680.

Copyright

This material is offered under a Creative Commons CC BY-NC-ND 4.0 license.

Software
- Orange Data Mining, visit this webpage to install Orange and check out its documentation
Additional Material and Communication Channels
- Please join an IDS communication server on Discord (this is an invite link) for course-related info, discussions and Q&A.
- Lectures start on Thursday, October 26, at 18.00 CET (sharp!). Lessons will be recorded.
Dates and Online Sessions
- Blaž: Oct 26, 2023, Zoom
- Ajda: Nov 8, 2023, Zoom
- Ajda: Nov 22, 2023 Zoom
- Blaž: Dec 7, 2023
Grading
- 70% Four homework assignments, one after online session
- 30% Exam, about 20 choice questions, lasting 1.5 hour
Exam

Between Feb 16, 2024 and Mar 15, 2024.
Words, Embedding, Clustering & Classification
In our 'Introduction to Data Science' course, we learned that machine learning techniques depend on feature-based characterization of objects. Our data was consistently represented as a matrix, with data instances (or objects) in rows and their feature-based representations in columns. The challenge with text is that we initially need such representation. We must devise one. First, we will focus solely on words, converting them into numbers using embeddings. We will employ existing neural networks that have already been trained on vast text corpora for these embeddings. Our first lesson will be about using these embeddings for clustering and classification.

Videos
- A brief intro to neural networks: Logistic Regression, Logistic Regression Nomograms, Neural Networks
- Word Embedding and Nearest Neighbors
- Semantic Word Search
- Word Clustering
Lecture notes
- Document Embedding
- Homework 1: Words Assignment
Document Vectorization and Classification
Text Preprocessing

One of the first steps, if not the first, when working with text data is to preprocess the data. This means defining the core units of the analysis. While in standard data mining we are usually already working with tabular data, where the instances are in rows and they are described with some features, with text this is often not the case. So we need some steps to prepare our texts for downstream analysis and this is called preprocessing.

Bag of Words

Text is a complex data form, that cannot be used for machine learning in its raw form. Hence we need to convert text into some the computer can work with. Such as numbers. Ideally, such numbers that would sum up the content of the document. A simple approach to describing a document with numbers is called bag-of-words, a text vectorization approach which takes the tokens, identified in preprocessing, and computes their occurrence in documents. If our tokens are words, then the new numeric columns will be words, with their values the number of times the word appears in the document. A more elegant approach to computing word frequencies is term frequency - inverse document frequency (TF-IDF), which decreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.

Document Embedding

While bag-of-words is a very simple vectorization technique, there are more advanced ones, too. They are called document embedding and they are based on pre-trained models. These models take tokens and embed each word separately. Popular word embedding models are fastText and BERT.

Document Classification

Document classification means training a classifier to label documents. There are many applications of document classification:
- predicting text authorship
- predicting spam mail
- predicting sentiment
For medical data sets, document classification is just as diverse. It ranges from predicting patient readmission from free-text notes and severely injured patients from dispatch calls to correlating sentiment prediction with self-reported depression and representing proteins as text to predict their characteristics.

Lecture Notes
Topic Modeling and Sentiment Analysis
Topic Modeling

Another way to organize unlabelled documents is with topic modelling. This is a technique that aims to discover latent topics in texts. Unlike clustering, topic modelling looks at word distributions and infers topics from them. The more frequently the words appear together, the more likely they form a topic.

One of the most popular topic modelling techniques is Latent Dirichlet Allocation or LDA in short. LDA is a generative models that starts with randomly assigned topics, which are iteratively updated based on probabilities of words in a topic. Nowadays, a popular topic model is BERTopic.

Sentiment Analysis

Sentiment analysis (or opinion mining) is a task of extracting sentiment from text data. Medical sentiment analysis deals with medical data, such as EHR, patient questionnaires, ... The approaches also consider the specifics of medical language (abbreviations, unique vocabulary, higher verb frequency).

There are three approaches to sentiment extraction:
- lexicon-based
- machine learning
- hybrid
Word Enrichment

If the corpus contains a target variable, for example, whether a patient is diagnosed with cancer or not, then we can build predictive models and observe their meaning and accuracy. However, most corpora do not contain a target variable, but we would still like to organize the data in a sensible way. For example in such a way, that similar documents would be grouped together. This is called clustering and it is a part of the unsupervised machine learning.

Literature

Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004, 168-177.

Hutto, C. and Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).

Lecture Notes
All Together Now
This lesson will review the most typical and prevalent text-mining tasks. We will discuss how various tasks can be accomplished using the techniques introduced in this course and Introduction to Data Science. We will review the topics of text classification and categorization, text clustering, information retrieval, topic modeling, sentiment analysis, named entity recognition, text segmentation, language detection, trend analysis, concept extraction, text summarization, relationship extraction, keyword extraction, and anomaly detection. Don't worry about repetition: there are several topics above that we have already covered in our lecture, and we will mention them mainly for completeness and to reiterate the main ideas. The second part of the lesson will introduce the final project.

Videos
- Cookie Mining
Homework
- Text, Images, Embedding, Analysis, and Interpretation Assignment

Topic outline

General

Copyright

Software

Additional Material and Communication Channels

Dates and Online Sessions

Grading

Exam

Words, Embedding, Clustering & Classification

Videos

Lecture notes

Document Vectorization and Classification

Text Preprocessing

Bag of Words

Document Embedding

Document Classification

Lecture Notes

Topic Modeling and Sentiment Analysis

Topic Modeling

Sentiment Analysis

Word Enrichment

Literature

Lecture Notes

All Together Now

Videos

Homework