Decoding how convnets see the world

Understanding how convolutional neural networks (CNN) see the world has been a fascinating topic of experimentation since the original Google post on deep dreams came to life mid 2015. While I did read the aforementioned post sometime during the same year only recently I decided to get my hands dirty topic-wise. My main motivator for this was not the original Google post but the follow up article written here early 2016 by the main Keras developer Francois Chollet.

After deciding to spend some time learning a bit of Keras I thought it would be neat to retrace Chollet steps using the code he so gently uploaded for free to github. And so I did. As a professional mathematician but amateur data scientist many questions came to mind after rehearsing Chollet's code. One fact that got my attention was the high frequency patterns displayed by all filters irrespective of their depth within the convnet (see images in Chollet's article). To date my intuition tells me that this is only the result of the small convolutional kernel sizes architectures like VGG16 use. Small kernel sizes (3,3) or so cannot be able to produce low-frequency (long range) filter patterns unless the filters were to be applied in Fourier space.

Other results that I found yet more puzzling were the reconstructed input images generated after maximizing the output of an specific class. In his article Chollet provides two reconstructed inputs (I tried a few more) corresponding to what the VGG16 model confidently considers to be a sea snake and a magpie. While I did not find surprising the high-frequency nature of the reconstructed input, as already discussed, I did find a bit intriguing the delocalized nature of the image features. See for example how in the magpie case feather-like features appear scattered all over the image domain and this behaviour holds irrespective of how the input image and/or the random state of the model get initialized.
This certainly does not support convnets as models which intend to mimic how humans interpret the world via vision. Of course! will reply the professional ML scientist. Who has claimed convnets to follow strictly (or aim to) the human brain architecture? After all our planes do not flight like birds neither our helicopters like beetles...Ok , I get it, claims this newbie explorer of the ML field. But, can one find an alternative cause that could at least partially account for the lack of feature localization? One could think that part of the problem lies in the fact that all magpie images fed into the model at training time did not contain the bird located/oriented approximately in the same spatial region (forget about more complex transformations for now). Sounds logical right?

With the delocalization-based hypothesis in mind I moved to set a simple experiment where my intuition could be put to the test. However, for the experiment to move forward I was in need of a module that could be inserted within a convnet capable of performing simple affine transformations to the input image. To my delight researchers at Google, on their unstoppable path to send God into retirement sooner that later, had come up with exactly the tool I needed. They named it Spatial Transformer Networks (STNs). Furthermore, at the time when I started to work on this mini-project, a gentleman named Eder Santana had already written a Keras implementation for STNs with Theano as backend (there are others too) which can be found here for Keras v > 1.0 (for Tensorflow users this could be a very good starting point). And so I was set of to putting my ideas into action.

The simple experiment I envisioned (no GPU access at the time) relied on modifying the original MNIST repository such that each digit class was rotated a distinct angle (fixed for all exemplars of a given class) in small increments. The image below portraits an example of the outcome of such transformations.

Modified data

The idea was to construct two identical convnet models with the only difference being the addition/lack of a STN module connected to their input. The convnet lacking the STN was to be trained on the MNIST-rotated data only. The convnet incorporating the STN was to be trained on the dataset that resulted from grouping the original unmodified MNIST and the MNIST-rotated one (exemplars shuffled). After training, the reconstructed inputs that maximized each individual digit class were to be compared among the two models. It is important to note that the STN module was removed from the second convnet right after training. This ensured both architectures were identical at input reconstruction time.

The picture below shows the decoded classes as seen by the barebones CNN. I borrowed this particular architecture from the Keras distro set of examples as found in this link. Despite some signal noise one can easily identify digit-like rotated patterns that followed closely the spatial arrangement of the training data. I believe the extra texture seen in the patterns comes from the network assembling what a generic/universal digit from each class looks like according to its learned weights.

Decoded data

Similarly, the decoded classes as seen by a CNN+STN also display digit-like patterns (see pic below). As compared to the barebones case, a bit more texture had evolved after input reconstruction. Yet, rotation effects seemed to had been filtered-out in the decoded output for this case.

Decoded stn data

This may not appear obvious at first sight but becomes more clear after superimposing both images to generate the animated gift that follows. Most of the digits show clear rotational realignment with respect to the output generated by the first model lacking the STN module. Such effects proved less noticeable in the nine class but there is no reason to assume the STN module behaving differently for this specific case. Overall, I believe it is safe to conclude that the second model successfully understood physical rotations on input digits and filtered them out while rendering reconstructed images with minimized rotational noise.

Animated gif

I leave it to the reader when it comes to performing another straightforward experiment that can be devised to further corroborate my intuition, i.e., training the barebones CNN (no STN) on the dataset that contains both the rotated and unrotated MNIST-data and reconstructing input images back. The notebook I utilized for this project can be accessed in this public repository under my github account.

As general remarks I should state that despite expecting STN modules in CNN models being able to mitigate some of the delocalization issues that exist in real-life problems, care must be taken when the results of this particular experiment are extrapolated to more complex models/data sets. The MNIST dataset is very idealized in many ways. It contains just a handful of classes with a narrow range of differences between them. Furthermore, the digit images are not accompanied by a changing background and digit objects do not suffer from orientational/perspective variations that result from plane-projections of 3-dimensional objects. Never mind as well the case of adversarial examples. Nevertheless, it would be interesting to see what a newly-trained Imagenet VGG16-STN would render at input reconstruction time as compared to the original VGG16.

Comments !