Spatial Transformer Network

What is a Spatial Transformer Network?

A spatial transformer network is a specialized type of convoluted neural network, or CNN. Spatial transformer networks contain spatial transformer modules that attempt to make the network spatially invariant to its input data. In essence, a spatial transformer network is used when attempting to stabilize, or clarify an object within a processed image or video. This leads to more accurate object classification and identification.

How does a Spatial Transformer Network work?

As mentioned above, spatial transformer networks are a type of CNN. An important factor of CNNs is their ability to identify an object from an image or video, however sometimes that's not so easy. If a video is taken of an object from an inconsistent viewing angle, CNNs have a hard time distinguishing that object from the background. Therefore, the CNN has to be implemented in such a way as to minimize, or avoid, misidentification. Through the use of the spatial transformer module, the output of the network is invariant to the position, size and rotation of an object in the image. This means, that an object that may have previously been challenging to see, now becomes much easier to identify.

image2017-1-18 19_6_56 (2).png

Image Source

epoch_evolution.gif

Image Source

In the first image, the spatial transformer module is able to crop the image and remove potentially redundant information from the image and focusing on key elements of the image. The second image exemplifies the modules application across several different images. 

Spatial Transformer Networks and Machine Learning

Machine learning uses CNNs often when tasked with image classification and identification. In cases where the object in the image varies in size and position, spatial transformer networks provide the best solution for overcoming the visibility challenges. The tools provided by spatial transformer networks make for more intelligent machine learning algorithms that are better able to identify and classify objects from challenging visual data.