IDS: Classifiers and their Decision Boundaries

Classifiers and their Decision Boundaries

In this homework, you will study and experiment with different classifiers and think about their decision boundaries.

Consider the following classifiers:

classification tree with a depth of 1, so-called a "stump," where you will set the parameter of the tree learner "Limit the maximal tree depth to" to 1
classification tree with a depth of 3
unregularized logistic regression (do not use regularization, set regularization type to None)
random forest with ten trees
support vector machines with radial basis function (RBF) kernel, implemented in Orange's widget called SVM, where you will make sure that kernel is set to RBF

For each of the classifiers above paint:

A. a data set where the classifier finds the "right" decision boundary,
B. a data set where the classifier fails to find the "right" decision boundary but where the decision boundary exists in the sense that you, a human, can find it and draw it on the scatterplot.

Do not paint the same data set for all the classifiers, that is, one data set where they all succeed or fail. Design the data sets with a particular learning algorithm in mind. That is, design a non-trivial data set that shows where the classifier succeeds, expressing the classifier's strength, and design a data set where we, humans, would be able to define the classification boundary but where the learning algorithm fails to find it.

Paint the data sets with exactly two classes. Do not paint the data set with three, four, or more classes.

Show A and B using scatter plots. A minimal workflow you could use contains the Paint Data, Predictions, and Scatter Plot, plus a learner (say, Classification Tree, receiving the data and passing a classifier to the Predictions). In the scatter plot, you can color the dots by the predicted class and set the shape to represent the true class value. There are examples of such plots in the lecture notes.

Submit the homework as a short report in PDF. The report should include the title of the homework, your name, and your email. The report should include each classifier's scatter plots of successes/failures (A and B). These would be best organized in a table with five rows (one for each machine learning algorithm) and two columns (one for success and one for failure). With each scatterplot, please also report on AUC scores you get using your data and 10-fold cross-validation (use the Test and Score widget). For this report, it is sufficient to know that AUC scores close to 1.0 are excellent, and scores around 0.5 are poor, but you are welcome to explore more about this score. Please also include an example workflow you used to test one of the classifiers. Besides the learner/classifier, the workflow should contain the Paint Data, Predictions, Scatterplot, and Test and Score widgets.

Note that for both parts of the homework, it is sufficient to include only one widget, as this can output both a classifier when presented with the input data and a learner that you can feed into Test and Score widget.

It may happen that you won't be able to paint a data set matching A and B for some of the classifiers. If this is the case, please provide your intuition why.

The report should not exceed one page. The limit on page length and the limits in the number of paragraphs and sentences are strict. This time, no extra text is needed, just a report header, a table with 2x5 scatterplots, a label with each graph reporting on cross-validated AUC, and a workflow you have used. Use 11 pt Calibre, Arial, or similar sans-serif font, and 1.2 spacing between lines. Use 6 pt separation between paragraphs.