How to Build an Image Classifier with Very Little Data (Part 2)

Hi! My name is Laura, and this is my blog. My opinions are my own, and you are free to challenge them, like them or share them.

How to Build an Image Classifier with Very Little Data (Part 2)

In the first part of this article I presented the results of the classical machine learning classifiers and introduced the convolutional neural network architecture. In this second part, let's take a look at different strategies to improve the accuracy of the model.

Before we go into the different techniques I tried to improve the acurracy, let's take a step back and go through a few basics.

While training a convolutional neural network myself, it can become overwhelming to keep track of all the number of decisions that need to be made and the what seems like almost infinite possibilities to be explored. This is where pipelines come in handy to experiment with different scenarios. Let's take a quick look at the exploratory data analysis which inspired the pipeline plan. 

Analysis:
• The images are in different sizes
• The image brightness is fairly random
• The images may be slightly rotated
• The images may not be facing straight
• The images may not be exactly centred
• The class are fairly equally distributed (most classes ~15% of overall dataset, one class 20% of overall dataset) and consistent across training, validation and test data set

Visualization helps me to intuitively understand what I’m dealing with. Some ideas I explored here are: resize all images into the same shape, image augmentation to compensate for few training data samples, data normalization, experimentation with different color spaces 

Data Normalization

There are three main types of pixel scaling techniques supported by the Keras ImageDataGenerator class:

•Pixel Normalization: scale pixel values to the range 0-1.

•Pixel Centering: scale pixel values to have a zero mean.

•Pixel Standardization: scale pixel values to have a zero mean and unit variance.

The pixel standardization is supported at two levels: either per-image (called sample-wise) or per-dataset (called feature-wise). Specifically, the mean and/or mean and standard deviation statistics required to standardize pixel values can be calculated from the pixel values in each image only (sample-wise) or across the entire training dataset (feature-wise).

Feature-wise normalization (left) implies the following:

- featurewise_center: set input mean to 0 over the dataset.

- featurewise_std_normalization: divide inputs by std of the dataset.

The results are a bit disappointing. I was expecting normalization to have a big effect, but the improvement is minimal.

Sample-wise normalization (right) implies the following:

- samplewise_center: set each sample mean to 0.

- samplewise_std_normalization: divide each input by its std.

Sample-wise normalization shows a bit better score (but higher loss) than the feature-wise normalization, but same as the feature-wise normalization not a significant improvement. And the model is still overfitting.

Color Space Transformations

Being able to handle distortion is a nice feature, but it isn't enough since our augmented samples are still highly correlated. So, I’ve used additional transformations like converting the color of an image from one color space to another to supplement our data while adding healthy variance.

  I've enjoyed this part very much, especially since it was the first time I've used color space transformations. After some research, I decided to use BGR, HSV, YUV, CrCb, HLS, Lab, Luv, XYZ.

My aim here was not to create a very high-tech state of the art CNN architecture to get the best accuracy but to compare the color spaces.


The results were surprising. I was expecting some color transformations to be more efficient as many images are more about shapes and less about colors. Also, I was thinking that the colors in cars are more saturated than that of backgrounds (i.e., trees), and the color space like HSV and HLS might contribute to superior performance. This was not the case.

The results are generally quite similar. XYZ colorspace gives slightly better results while HSV gives slightly poorer results. Why did this happen? Maybe because XYZ is pretty similar to RGB, whereas HSV is a cylindrical system and is the farthest off RGB in terms of similarity, hence gave the worst results.

So far I’ve explored the hypothesis that by augmenting our dataset, we allow our model to learn more robust and generalizable features and produce more accurate classifications than our previous model. But there are also ways to improve the performance of our model by fine-tuning the model itself. Tuning hyperparameters for deep neural network can be difficult as it can be slow to train a deep neural network and there are numerous parameters to configure. The most commonly used recipes are:

Dropout is a simple technique that will randomly drop nodes out of the network, forcing the remaining nodes to adapt and pick-up the slack of the removed nodes. In a way, it prevents a layer from seeing twice the exact same pattern.
- During training, there will be a point when the model will stop generalizing and start learning the statistical noise in the training dataset.  Early stopping refers stopping the training process before the learner passes that point.
Weight Decay tries to incentivize the network to use smaller weights by adding a penalty to the loss function.

But the list goes on and on (and this is by no means an exhaustive list): the number of convolutional layers and filters, the number of fully connected (dense) layers and neurons, the number of epochs, the batch size, the learning rate, using dropout and increasing dropout, batch normalization, different activations, different optimizers and much more.

As I experimented with more and more ideas, it became harder and harder for me to remember what I had tried. What changes made the network better or worse? I was inspired by this post to use a mind map to keep track of the important things I tried. 

I probably tried more than 15 different network architectures, with different network depths and different filter sizes, and learning rates. I got the the best results by using a very high filter size 11x11 (probably because since the pictures are mostly well centered, a big number of pixels is necessary for the network recognize the object) and a very low learning rate. After 1000 epochs and 25 hours later, I decided to stop the training. It seems to the right path, but the network could still learn and could take advantage of more training to reach convergence.

Classifier Optimization Acurracy
Dummy classifier (most-frequent) None 0.207
Logistic Regression Grid search for optimal regularization strength 0.878
k-Nearest Neighbors Grid search for optimal k (number of neighbors), weights, distance types 0.880
Linear SVM PCA to reduce the number of features and retain 90+% of the variance 
Grid search for optimal regularization parameter
0.863
(Simple) Decision Trees Grid search for optimal tree depth (minimal) 0.840
Random Forests Randomized search for optimal number of estimators (minimal) 0.880
Convolutional Neural Networks Baseline + Data Augmentation + Batch Normalization + Dropout (0.1/0.5) + Filter size (11x11) + Number of filters (16+)  0.726
Convolutional Neural Networks Baseline + Color Space Transformation (XYZ) + Dropout (0.3/0.5) + Filter size (3x3) + Number of filters (32+) 0.597

Reflections

  • Data is very important. While researching on the topic I came across many articles/papers claiming to have solutions for the problem of having very little data. On a closer look that was defined there as very little data was typically between 500 and 1000 samples per class. Having only 50 samples per class, this project was extremely challenging.
  • Transfer learning helped traditional ML models to compensate for the lack of data (even with minimal fine-tuning as was the case for Random Forests) and still outperform the convolutional neural network.
  • The Mind Map as a tool to find information quickly and get some inspiration (put the experiments in a map around the most successful path).
  • Time

Ideas

  • Different data augmentation (generative model)
  • Regularization techniques (L2 to force small parameters, L1 to set small parameters to 0), different activation/optimizer, filter sizes, learning rate reduction on plateau, color space transformation…)
  • Retrain on wrongly predicted training images (fine tune the model to specific set of images for which it previously miss predicted).
  • Ensembles or make a gradient boosted tree of neural networks.

References and interesting reads:

  • "The Effectiveness of Data Augmentation in Image Classification using Deep Learning", Perez et al., 2017.
  • "Why do deep convolutional networks generalize so poorly to small image transformations?", Azulay et al., 2019
  • "Evolution of Convolutional Neural Network Architecture in Image Classification Problems", Arsenov et al.      
  • Data Augmentation:  https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/
  • Keras Data Augmentation:
  • Pipeline and mindmap:
  • My Github: