Topic outline

  • General

    Welcome to the home page of the Text Mining Course. The course consists of short chapters that we will cover in each lecture. Each lecture will finish with a homework assignment. Note that assignments have fixed dates.

    This course is offered in the xAIM's master's program. The course is co-financed by the European Commission instrument "Connecting Europe Facility (CEF) Telecom" through the project "eXplainable Artificial Intelligence in Healthcare Management project" (2020-EU-IA-0098), grant agreement number INEA/CEF/ICT/A2020/ 2276680.

    eu-xaim

    Copyright

    This material is offered under a Creative Commons CC BY-NC-ND 4.0 license.

    Software

    • Orange Data Mining, visit this webpage to install Orange and check out its documentation

    Additional Material and Communication Channels

    Dates and Online Sessions

    • Blaž: Oct 26, 2023, Zoom
    • Ajda: Nov 8, 2023, Zoom
    • Ajda: Nov 22, 2023 Zoom
    • Blaž: Dec 7, 2023

    Grading

    • 70% Four homework assignments, one after online session
    • 30% Exam, about 20 choice questions, lasting 1.5 hour

    Exam

    Between Feb 16, 2024 and Mar 15, 2024.

  • Words, Embedding, Clustering & Classification

    In our 'Introduction to Data Science' course, we learned that machine learning techniques depend on feature-based characterization of objects. Our data was consistently represented as a matrix, with data instances (or objects) in rows and their feature-based representations in columns. The challenge with text is that we initially need such representation. We must devise one. First, we will focus solely on words, converting them into numbers using embeddings. We will employ existing neural networks that have already been trained on vast text corpora for these embeddings. Our first lesson will be about using these embeddings for clustering and classification.

    Videos

    Lecture notes

  • Document Vectorization and Classification

    Text Preprocessing

    One of the first steps, if not the first, when working with text data is to preprocess the data. This means defining the core units of the analysis. While in standard data mining we are usually already working with tabular data, where the instances are in rows and they are described with some features, with text this is often not the case. So we need some steps to prepare our texts for downstream analysis and this is called preprocessing.

    Bag of Words

    Text is a complex data form, that cannot be used for machine learning in its raw form. Hence we need to convert text into some the computer can work with. Such as numbers. Ideally, such numbers that would sum up the content of the document. A simple approach to describing a document with numbers is called bag-of-words, a text vectorization approach which takes the tokens, identified in preprocessing, and computes their occurrence in documents. If our tokens are words, then the new numeric columns will be words, with their values the number of times the word appears in the document. A more elegant approach to computing word frequencies is term frequency - inverse document frequency (TF-IDF), which decreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.

    Document Embedding

    While bag-of-words is a very simple vectorization technique, there are more advanced ones, too. They are called document embedding and they are based on pre-trained models. These models take tokens and embed each word separately. Popular word embedding models are fastText and BERT.

    Document Classification

    Document classification means training a classifier to label documents. There are many applications of document classification:

    For medical data sets, document classification is just as diverse. It ranges from predicting patient readmission from free-text notes and severely injured patients from dispatch calls to correlating sentiment prediction with self-reported depression and representing proteins as text to predict their characteristics.

    Lecture Notes

  • Topic Modeling and Sentiment Analysis

    Topic Modeling

    Another way to organize unlabelled documents is with topic modelling. This is a technique that aims to discover latent topics in texts. Unlike clustering, topic modelling looks at word distributions and infers topics from them. The more frequently the words appear together, the more likely they form a topic.

    One of the most popular topic modelling techniques is Latent Dirichlet Allocation or LDA in short. LDA is a generative models that starts with randomly assigned topics, which are iteratively updated based on probabilities of words in a topic. Nowadays, a popular topic model is BERTopic.

    Sentiment Analysis

    Sentiment analysis (or opinion mining) is a task of extracting sentiment from text data. Medical sentiment analysis deals with medical data, such as EHR, patient questionnaires, ... The approaches also consider the specifics of medical language (abbreviations, unique vocabulary, higher verb frequency).

    There are three approaches to sentiment extraction:

    • lexicon-based
    • machine learning
    • hybrid

    Word Enrichment

    If the corpus contains a target variable, for example, whether a patient is diagnosed with cancer or not, then we can build predictive models and observe their meaning and accuracy. However, most corpora do not contain a target variable, but we would still like to organize the data in a sensible way. For example in such a way, that similar documents would be grouped together. This is called clustering and it is a part of the unsupervised machine learning.

    Literature

    Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004, 168-177.

    Hutto, C. and Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (Vol. 8, No. 1, pp. 216-225).

    Lecture Notes

  • All Together Now

    This lesson will review the most typical and prevalent text-mining tasks. We will discuss how various tasks can be accomplished using the techniques introduced in this course and Introduction to Data Science. We will review the topics of text classification and categorization, text clustering, information retrieval, topic modeling, sentiment analysis, named entity recognition, text segmentation, language detection, trend analysis, concept extraction, text summarization, relationship extraction, keyword extraction, and anomaly detection. Don't worry about repetition: there are several topics above that we have already covered in our lecture, and we will mention them mainly for completeness and to reiterate the main ideas. The second part of the lesson will introduce the final project.

    Videos

    Homework