Machine Learning 0 - Introduction¶
In this lab, we will introduce the classification problem that we will be working on for the last three labs. The objectives of the lab are:
- Examine the dataset and prepare the basic pipeline that will be used in the next lab.
- Make sure that we use a correct methodology for comparing the algorithms
- Get used to working with the scikit-learn library.
Important note¶
For the machine learning labs, each student must write a report (one report for all labs together) which will be used during the oral exam. This report should highlight the different methods used during the labs, but also how you validated each method and compared their results.
Introduction to the dataset¶
The CIFAR-10 dataset has been collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It consists of 60.000 32x32 colour images, split in 10 classes: 'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'.
Reference: Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. PDF available at https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
For the purpose of the INFO-H-501 laboratories, we will use a subset of those images by only taking 3 different classes: 'airplane', 'bird' and 'horse'.
Click here to download the modified dataset. Extract the ZIP file and put the CIFAR10 folder in the same directory as the notebook.
The images are 32x32 8-bit RGB, and from these we extracted Histogram of Gradient vectors (16 orientations x 16 blocks = 256 values in HoG vector).
The following code pre-loads all of this data (make sure that you have the lab_tools.py file in the same directory as the notebook, if you didn't clone the repository):
from lab_tools import CIFAR10, get_hog_image
dataset = CIFAR10('../../extern_data/CIFAR10/')
from matplotlib import pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,10))
for i in range(64):
plt.subplot(8,8,i+1)
plt.imshow(dataset.train['images'][i].reshape((32,32,3)), interpolation='none')
plt.title(dataset.labels[dataset.train['labels'][i]])
plt.axis('off')
plt.show()
Pre-loading training data Pre-loading test data
We can also have a look at some HoG images:
plt.figure(figsize=(8,8))
for i in range(16):
plt.subplot(4,4,i+1)
hog = dataset.train['hog'][i].reshape((4,4,16))
plt.imshow(get_hog_image(hog, 128), interpolation='none')
plt.title(dataset.labels[dataset.train['labels'][i]])
plt.axis('off')
plt.show()
Note that this is an "image" representation of the HoG, but that the actual data that we will be working on is a size 256 vector for each image. Our feature space has therefore 256 dimensions.
Exploring the dataset¶
The dataset object gives you access to different attributes:
- dataset.path contains the path to the CIFAR10 folder.
- dataset.labels contains the name of the three classes
- dataset.train and dataset.test are dictionaries containing three numpy arrays each:
- images contains the RGB images
- hog contains the HoG vectors
- labels contains the label for each image
print(dataset.path)
print(dataset.labels)
print(dataset.train.keys())
print(dataset.train['hog'].shape)
../../extern_data/CIFAR10/ ['Airplane', 'Bird', 'Horse'] dict_keys(['images', 'hog', 'labels']) (15000, 256)
Quick questions:¶
The dataset has already been split into a training set (dataset.train) and a test set (dataset.test).
- How many images are in the training set ?
- How many images are in the test set ?
- What is the class distribution of the dataset ?
# -- Your code here -- #
Descriptive data analysis¶
Look at the HoG data from the training set. What are the characteristics of the dataset ? Do you think that some pre-processing may be required to help with the different algorithms ?
# -- Your code here -- #
Introduction to scikit-learn¶
Scikit-learn is a very well documented machine learning library in Python. It contains many algorithms for classification, and makes the whole process of building a machine learning pipeline relatively straightforward. There are many examples in the documentation, as well as relatively complete theoretical explanations, so I really encourage you to take the time to read it if some things are not clear.
Let's make a very simple example. We are going to use the Ridge Classifier, which is a very basic linear model.
from sklearn.linear_model import RidgeClassifier
To use a classifier with scikit-image, we generally have three steps:
- Create an instance of the class of the classifier (here: RidgeClassifier). The constructor will generally contain many arguments that can be modified, and that are explained in the documentation. There will also generally be default values for all of them, so in this simple example we will just use those:
clf = RidgeClassifier()
- Use the fit method with, as arguments, the training data (in our case, the HoG vectors) and the corresponding labels. This will start the main training algorithm, trying to fit the parameters of the classifier with the training data:
clf.fit(dataset.train['hog'], dataset.train['labels'])
RidgeClassifier()
- Use the predict method to get the prediction of the classifier on the data given as argument. In this case, we get the predictions on the data that was just used for training. What kind of performance will that give us?
pred = clf.predict(dataset.train['hog'])
print(pred.shape)
(15000,)
We can then evaluate those predictions. Scikit-learn provides many different metrics for evaluating the performances of a classifier. The most simple of those is the accuracy, which is simply the number of correct predictions divided by the total number of predictions:
from sklearn.metrics import accuracy_score
score = accuracy_score(dataset.train['labels'], pred)
print(score)
# Not that it's fairly easy to compute that score "by hand":
T = (pred==dataset.train['labels']).sum()
print(T, len(pred), T/len(pred))
0.7356666666666667 11035 15000 0.7356666666666667
In a multiclass problem, it's often also very useful to look at the confusion matrix, which gives us more information on which classes are often mistaken for each other:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(dataset.train['labels'], pred)
print(cm)
[[3812 740 448] [ 742 3236 1022] [ 337 676 3987]]
Note that the rows represent the true labels and the columns the predicted labels. So this mean that, in this case, out of the 5000 images of class 0 ("Airplane"), 3812 were correctly classified, 740 were classified as "Bird" and 448 as "Horse").
Quick question¶
- Modify the code to estimate the predictive performance of the algorithm (without using the test set).
# -- Your code here -- #
How can we find the "best" algorithm?¶
What we have shown above served to illustrate how scikit-learn classifiers work, but we now have to build a valid machine learning pipeline to compare the different algorithms that we will use in the next labs.
As we have said earlier, we have already split the dataset into a "training" and a "test" set. It is clear that the final evaluation should take place on the test set.
But in addition to comparing the algorithms between each other, we also have to find the best "hyper-parameters" for each algorithm. For example, in our RidgeClassifier example, there is a regularization parameters, alpha, which by default was set to 1.0. Can we improve the performance of the algorithm by modifying this parameter ?
Side-note: parameter vs hyper-parameter¶
In general, when talking about machine learning models, parameters are what the algorithm learns with the data (if we are for instance learning a linear regression y = ax+b, the "parameters" would be a and b), while hyper-parameters are modifiers to the model or to the pipeline (for instance, if we are generalizing to a polynomial regression, the degree of the polynome would be an hyper-parameter).
Cross-validation¶
The most common way of finding the best hyper-parameters of a classifier is to use cross-validation.
- Using the cross_val_score method from scikit-learn, find the best alpha hyper-parameter for the RidgeClassifier:
# -- Your code here -- #
Comparing algorithms¶
Once you have found the best hyper-parameters for an algorithm, you can re-train the classifier on the whole training set, and finally use the test set to get a "final performance".
- How can you then decide which of the classifiers is best ?
- How can you decide if the different between two classifiers is significant ?
Try to compare the best RidgeClassifier with the original. Is it significantly better?
# -- Your code here -- #