Homework 1: Words

In this homework, you will construct a data set of words and learn to use word clustering and classification. We will use fastText for word embedding, and we expect you will familiarize yourself with this method by performing literature research.

  1. [max two pages] To embed words into the vector space, we will use fastText embeddings. For a start, do some research and answer the following questions:

    • In one paragraph, what is fastText
    • What is the input of the fastText model? What is the output? What is the architecture of this model? Present the model's architecture graphically; make your own sketch; do not copy any figure from the web.
    • On what data was fastText trained? What does it optimize - what is the objective of training?
    • Training objective in fastText uses the "context"? What is this context, and how is it used in the training?

  2. [max 1 page] Compose the data of at least 100 words related to medicine or health. Provide class labels of words so that the data includes at least three labels, and each label is assigned to at least 30 words. Provide a short report (choice of labels, examples of words, basic statistics) on the chosen data set.

  3. [max 1 page] Perform hierarchical clustering and assess the quality of the resulting clustering. Report on results. Include the figure with the dendrogram. Report on the quality of the dendrogram; no need to assess it quantitatively, just provide your opinion on if the clusters make sense or not and why.

  4. [max 1 page] Choose a classification technique (say, logistic regression) and assess its accuracy using the cross-validation in the task of predicting the class label from the word. Report on classification accuracy and AUC; in one sentence, describe what these two scores measure. Include a confusion matrix in your report and comment on it, possibly focusing on misclassifications.

  5. [max 1 page] Propose a list of up to 10 more words not included in your training data set of 100 words, at least two for each class label. Check if the logistic regression classifier developed on the entire training data set data set resulted in the correct class and report on results.

For items 3, 4, and 5, the report should include a figure with the workflow you devised for your answer if you have used Orange, or the Python code, if you have decided to develop a script to find the solution.

Submit the homework as a short report in PDF. The report should include the title of the homework (“Exploratory analysis and Orange workflows”), your name and email, and the following sections that directly follow the numbered homework items above. Also, please note:

  • All figures you will use in the report should include a caption and be numbered. For example, "Fig. 1: Workflow for reading the data, visualizing it as a scatter plot, and performing hierarchical clustering and analysis on a selected cluster." All the figures should be referred to in the text. For instance, "The hierarchical clustering dendrogram (Fig. 1) reveals three distinct clusters.".

  • The figures should be legible. Do not reduce the size to the point where the text is not readable. Do not distort or skew the screenshots and the graph; keep the aspect ratio when reducing size.

  • Spell-check the text in your report. Also, if possible, use a grammar checker. The report should read well and should be neatly set.

  • Read your final version of the report before submitting it.