Dimensionality reduction and explanations of the data maps
In this homework, you will compare various dimensionality reduction techniques to plot multi-dimensional data in two dimensions and try to explain the results.
- Start with a data set called "Telecom customer churn," as available in Orange's Datasets widget. Use PCA and scatter plot to plot the data in the space of the first two principal components. Answer the following questions:
- How many clusters are depicted in the scatterplot? [one sentence]
- What characterizes the first principal component? To answer this question, find which features have the highest weight in this component, and try to summarize the finding, possibly in a few words (for example: "the location of the subscribes relative to the big cities" or "the number of calls on the weekends"). [an image of a workflow with the contents of essential components showing how you come up with an answer and an answer in one sentence]
- What characterizes the second principal component? [one sentence]
- Use the same data above, and observe it in a t-SNE plot.
- How many clusters are there? Run t-SNE with default parameters for this data set, Exaggeration=1, and PCA components=19. [an image with a workflow and content of the t-SNE component, one sentence with an answer]
- Propose a workflow to compare the presence of data instances in the t-SNE and PCA clusters. Comment on similarities or differences. [an image with a workflow with the contents of crucial widgets and an answer in one paragraph]
- Explain each of the clusters in the t-SNE plot. Hint: use a combination of t-SNE and Box Plot to characterize clusters. [t-SNE plot with circled clusters and labels that explain clusters]
- Consider the data set on bone marrow mononuclear cells with AML from the Datasets widget.
- This is a single-cell gene expression data set. This kind of data would normally be reported as counts per cell per gene, but the data you will work with is already normalized. Shortly describe what gene expression is and what a typical single-cell gene expression data contain. [one paragraph]
- What are B cells? [one paragraph]
- Use t-SNE and show where the cluster is with B cells. Hint: use a handbook of marker genes to find one gene that is highly expressed in the B cells. [a t-SNE plot with clearly marked cluster of B cells]
- Use a combination with Box Plot to find which other genes can characterize this cluster? Report on at least three such genes and check the handbook if they are indeed markers of B cells. [show a workflow and the contents of essential widgets, in one sentence, list the genes that you find overexpressed in the cluster of B cells and state if they are indeed known markers]
- Use your data set from the first homework, plot either PCA or t-SNE and report on possible clusters and their characterization. [the plot of the data with marked clusters and possible labels with cluster explanations, one paragraph with an explanation of the data that includes the reference to the data source (could be the same as from the first homework), one paragraph with the comment of results with cluster explanations]
Prepare a report with your answers and illustrations. Include the title of the homework, your name and email. The report should at most three pages long (this limit is strict!); use 11 pt Arial or similar sans-serif font. Please make sure the figures and the text on them are readable. Include only the visualizations, not the screenshots that include images of windows, programs, or desktops. Do not skew the images by changing their aspect ratio.