Post 3

Before we go into the different techniques I tried to improve the acurracy, let's take a step back and go through a few basics.

While training a convolutional neural network myself, it can become overwhelming to keep track of all the number of decisions that need to be made and the what seems like almost infinite possibilities to be explored. This is where pipelines come in handy to experiment with different scenarios. Let's take a quick look at the exploratory data analysis which inspired the pipeline plan.

Analysis:
• The images are in different sizes
• The image brightness is fairly random
• The images may be slightly rotated
• The images may not be facing straight
• The images may not be exactly centred
• The class are fairly equally distributed (most classes ~15% of overall dataset, one class 20% of overall dataset) and consistent across training, validation and test data set

Visualization helps me to intuitively understand what I’m dealing with. Some ideas I explored here are: resize all images into the same shape, image augmentation to compensate for few training data samples, data normalization, experimentation with different color spaces

Data Normalization

There are three main types of pixel scaling techniques supported by the Keras ImageDataGenerator class:

•Pixel Normalization: scale pixel values to the range 0-1.

•Pixel Centering: scale pixel values to have a zero mean.

•Pixel Standardization: scale pixel values to have a zero mean and unit variance.

The pixel standardization is supported at two levels: either per-image (called sample-wise) or per-dataset (called feature-wise). Specifically, the mean and/or mean and standard deviation statistics required to standardize pixel values can be calculated from the pixel values in each image only (sample-wise) or across the entire training dataset (feature-wise).

Feature-wise normalization (left) implies the following:

- featurewise_center: set input mean to 0 over the dataset.

- featurewise_std_normalization: divide inputs by std of the dataset.

The results are a bit disappointing. I was expecting normalization to have a big effect, but the improvement is minimal.

Sample-wise normalization (right) implies the following:

- samplewise_center: set each sample mean to 0.

- samplewise_std_normalization: divide each input by its std.

Sample-wise normalization shows a bit better score (but higher loss) than the feature-wise normalization, but same as the feature-wise normalization not a significant improvement. And the model is still overfitting.

Data Augmentation

One missing ingredient required in modern convolutional neural networks is data augmentation. The human vision system is excellent at adapting to image translations, rotations and other forms of distortions. Take an image and flip it anyway, most people can still recognize it. However, CNNs are not very good at handling such distortions, they could fail terribly due to minor translations. They key to resolving this is to randomly distort the training images, using horizontal flipping, vertical flipping, rotation, zooming, shifting and other distortions. This would enable CNNs to learn how to handle these distortions, hence, they would be able to work well in the real world. For small datasets, data augmentation is used as a way to introduce variance and provide the classifier with a better sense of the world.

The performance with the augmentation is actually worse than without it. Moreover:

• the network is not robust to these changes.

• the training requires more epochs (it takes more time to train with larger data, and the network does not converge even after 500 epochs).

Still, I was not ready to abandon this path just yet.

Color Space Transformations

Being able to handle distortion is a nice feature, but it isn't enough since our augmented samples are still highly correlated. So, I’ve used additional transformations like converting the color of an image from one color space to another to supplement our data while adding healthy variance.

I've enjoyed this part very much, especially since it was the first time I've used color space transformations. After some research, I decided to use BGR, HSV, YUV, CrCb, HLS, Lab, Luv, XYZ.

My aim here was not to create a very high-tech state of the art CNN architecture to get the best accuracy but to compare the color spaces.

The results were surprising. I was expecting some color transformations to be more efficient as many images are more about shapes and less about colors. Also, I was thinking that the colors in cars are more saturated than that of backgrounds (i.e., trees), and the color space like HSV and HLS might contribute to superior performance. This was not the case.

The results are generally quite similar. XYZ colorspace gives slightly better results while HSV gives slightly poorer results. Why did this happen? Maybe because XYZ is pretty similar to RGB, whereas HSV is a cylindrical system and is the farthest off RGB in terms of similarity, hence gave the worst results.

So far I’ve explored the hypothesis that by augmenting our dataset, we allow our model to learn more robust and generalizable features and produce more accurate classifications than our previous model. But there are also ways to improve the performance of our model by fine-tuning the model itself. Tuning hyperparameters for deep neural network can be difficult as it can be slow to train a deep neural network and there are numerous parameters to configure. The most commonly used recipes are:

- Dropout is a simple technique that will randomly drop nodes out of the network, forcing the remaining nodes to adapt and pick-up the slack of the removed nodes. In a way, it prevents a layer from seeing twice the exact same pattern.
- During training, there will be a point when the model will stop generalizing and start learning the statistical noise in the training dataset. Early stopping refers stopping the training process before the learner passes that point.
- Weight Decay tries to incentivize the network to use smaller weights by adding a penalty to the loss function.

But the list goes on and on (and this is by no means an exhaustive list): the number of convolutional layers and filters, the number of fully connected (dense) layers and neurons, the number of epochs, the batch size, the learning rate, using dropout and increasing dropout, batch normalization, different activations, different optimizers and much more.

As I experimented with more and more ideas, it became harder and harder for me to remember what I had tried. What changes made the network better or worse? I was inspired by this post to use a mind map to keep track of the important things I tried.

I probably tried more than 15 different network architectures, with different network depths and different filter sizes, and learning rates. I got the the best results by using a very high filter size 11x11 (probably because since the pictures are mostly well centered, a big number of pixels is necessary for the network recognize the object) and a very low learning rate. After 1000 epochs and 25 hours later, I decided to stop the training. It seems to the right path, but the network could still learn and could take advantage of more training to reach convergence.

Classifier	Optimization	Acurracy
Dummy classifier (most-frequent)	None	0.207
Logistic Regression	Grid search for optimal regularization strength	0.878
k-Nearest Neighbors	Grid search for optimal k (number of neighbors), weights, distance types	0.880
Linear SVM	PCA to reduce the number of features and retain 90+% of the variance Grid search for optimal regularization parameter	0.863
(Simple) Decision Trees	Grid search for optimal tree depth (minimal)	0.840
Random Forests	Randomized search for optimal number of estimators (minimal)	0.880
Convolutional Neural Networks	Baseline + Data Augmentation + Batch Normalization + Dropout (0.1/0.5) + Filter size (11x11) + Number of filters (16+)	0.726
Convolutional Neural Networks	Baseline + Color Space Transformation (XYZ) + Dropout (0.3/0.5) + Filter size (3x3) + Number of filters (32+)	0.597

Reflections

Data is very important. While researching on the topic I came across many articles/papers claiming to have solutions for the problem of having very little data. On a closer look that was defined there as very little data was typically between 500 and 1000 samples per class. Having only 50 samples per class, this project was extremely challenging.
Transfer learning helped traditional ML models to compensate for the lack of data (even with minimal fine-tuning as was the case for Random Forests) and still outperform the convolutional neural network.
The Mind Map as a tool to find information quickly and get some inspiration (put the experiments in a map around the most successful path).
Time

Ideas

Different data augmentation (generative model)
Regularization techniques (L2 to force small parameters, L1 to set small parameters to 0), different activation/optimizer, filter sizes, learning rate reduction on plateau, color space transformation…)
Retrain on wrongly predicted training images (fine tune the model to specific set of images for which it previously miss predicted).
Ensembles or make a gradient boosted tree of neural networks.

References and interesting reads:

"The Effectiveness of Data Augmentation in Image Classification using Deep Learning", Perez et al., 2017.
"Why do deep convolutional networks generalize so poorly to small image transformations?", Azulay et al., 2019
"Evolution of Convolutional Neural Network Architecture in Image Classification Problems", Arsenov et al.
Data Augmentation: https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/
Keras Data Augmentation:
Pipeline and mindmap:
My Github:

How to Build an Image Classifier with Very Little Data (Part 2)