Text, Images, Embedding, Analysis, and Interpretation

Following is a list of tasks for your fourth assignment. Please collect the data, do the analysis, and prepare the report that includes the deliverables as mentioned below. Use the same numbering as in the following list of tasks when reporting on deliverables. Deliver the report as a PDF file. Check the spelling and grammar and make sure the report is well organized, readable, and neat.

  1. Collect a list of at least 50 named entities (e.g., sweets and cookies, tourist places you would like to visit, museums, pets, whatever) and organize their names in an Excel file. For example, if you chose tourist destinations, the rows in Excel would include Taj Mahal, Eiffel Tower, Great Barrier Reef, ... Deliverable: a short paragraph describing the data you collected, the collection method, and few examples of names of the entities.

  2. Provide an image for each named entity. Place the images in the folder (e.g., images) and choose an appropriate name (e.g., eiffel-tower.jpg). Add the image name as a second column in your named entity list. Deliverable: A screenshot of a workflow that loads the data, loads the images, merges the two datasets, and for a selection of entities in the data table, shows images with entity names in the captions.

  3. For each named entity, create a paragraph (one paragraph only!) that describes it. Include this paragraph in your data. Do this for all entities. The description should be the third column in your data set. You can find these descriptions manually, automate the process using the Python code, or use the ChatGPT Constructor widget. Deliverable: an example with three entities and their descriptors (you can use the Corpus Viewer). If you used Orange, include the workflow you used to create the data.

  4. Use the entity descriptions to cluster the entities. Display the dendrogram and comment on the result in terms of clustering quality and semantics. Suggest how you would quantitatively evaluate the quality of the clustering. What additional information do you need? Deliverable: the workflow, the annotated (e.g. with entity names) dendogram, a paragraph with your qualitative evaluation of the dendogram, a paragraph with your idea how to evaluate the quality quantitatively.

  5. Pick an interesting cluster from the dendogram. What are the characteristic words that define this cluster? Using the cluster data, write a prompt that would summarize the textual descriptions for the entities in the cluster (use ChatGPT Summarize). Deliverable: the workflow, the dendrogram, or a part of the dendrogram with a selected cluster, a list of characteristic words, and a summary.

  6. Cluster the entities according to their images. Comment on the quality of the cluster. Select a cluster and use ChatGPT Summarize to provide a summarized description of the images in the cluster (the summarization should of course come from the entity descriptions, and you would need to use the Merge Data widget to merge your textual data with the loaded images). Deliverable: the workflow, the dendogram with annotation (names of the entities) showing the selected group, the images of the selected group, their textual summary.

  7. Use t-SNE to find groups of entities based on their textual descriptions. Find a group and related images of interest and report whether the association makes sense. Characterize the group using text summarization. Deliverable: the t-SNE graph showing the selected group, the data table showing the names of the selected entities, and the appropriate characterization of the group.