A lot of success in deep neural networks and Deep Learning lays in the meticulous design of the neural network architecture software development. Below you can see the top-1 one crop precision in proportion to the quantity of operations needed for one forward pass in numerous popular neural network architectures.
In 1994, the first complex neural network, LeNet 5 was created and this launched Deep Learning exploration. LeNet 5 architecture was rudimentary, especially the fact that image details were spread-out across the entire image and convolutions with learnable frameworks are a good way to pick out feature at multiple locations with little parameters.
From 1998-2010, neural networks were in the early development stages. Few people noticed their increasing power yet lots of researchers made progress in this area. More and more data became available due to the rise of cell-phone cameras as well as low cost digital cameras. Computing power was also gaining steam since CPUs became faster and the GPUs became a general-purpose computing tool. These trends contributed to the slow progress of neural networks and made the tasks they accomplished more interesting.
A deeper and much wider version of LeNet was created by Alex Krizhevsky in 2012 which won the ImageNet competition by a wide margin. It expanded the insights of LeNet into a significantly large neural network which can be used to learn much more complex objects and hierarchies. The innovations were the use of rectified linear units as non-linearities, overlapping max pooling and using GPUs NVIDIA GTX 580 to reduce training time.
During that time, GPUs provided much larger numbers of cores than CPUs as well as 10 times faster training time which led to the use of larger datasets and grander images.
The success of this project started a revolution. Complex neural networks became a staple of Deep Learning which became known as “big neural networks which can accomplish useful things.”
This network had the great yet simple idea of using 1X1 complexities to enable more combinational power to the layers of complexity layers.
NiN architecture uses MLP layers after each convolution to make a better combination of features before the next layer. You may think that the 1X1 convolutions conflict with the original ideas of LeNet, but in fact they help combine convolution features, something that is not possible by simply stacking convolutional layers. This is not the same as using raw pixels to insert into the next layer. 1X1 convolutions spatially incorporate features across feature maps so in fact they do not use many parameters shared across all pixels.
The strength of MLP can significantly increase the effectiveness of individual complexities features by combining them into even more complex groups. This is the very same idea used later on by ResNet and Inception. It incorporated an average pooling layer as a component of the last classifier, a practice that will become standard later on.
GoogLeNet and Inception
Christian Szegdy form Google, started a quest to find a method which reduces the computational demand of deep neural networks thus creating the GoogLeNet, the first of its kind inception architecture.
In the Fall of 2014, when Christian Szegedy created GoogLeNet deep learning models became very useful for categorizing the content of images and video frames. Even the most ardent skeptics conceded that Deep Learning and neural nets are here to stay. Since these techniques were very useful to internet giants such as Google, they became interested in efficient and wide allocation of architectures on their servers.
Christian Szegedy’s goal was to reduce the computational demand of deep neural nets, yet at the same time retaining state-of-the-art performance, or at the very least keeping the computational cost at the current level. With this goal in mind, he and his team created the Inception module:
At first it may seem that this is a parallel combination of 1X1, 3X3 and 5X5 convolution filters, but upon closer examination, we see that by using 1X1 convolution blocks he reduces the amount of features before the parallel blocks, something what we refer to today as “bottleneck.”
The “bottleneck layer deserves a section of its own, but GoogLeNet uses a classifier that has a low number of operations compared to AlexNet and VGG. It uses median pooling plus a softmax classifier similar to NiN and a stem without inception modules as part of the first layers.
The bottleneck layer of Inception was inspired by NiN and reduced the quantity of features and operations in all layers thereby keeping the inference time low. The number of features were reduced by nearly 4 times before passing the data to the expensive convolution modules. This led to significant savings in computational cost and ensured the success of this architecture.
Let’s take a look at this more closely. Imagine you have 252 features coming in and 252 coming out and the inception layer performs 3×3 convolutions. We end up with 250X250 X 3X3 convolutions that need to be performed. This is more than the entire computational budget we have to run this layer in .5 milli-seconds on a Google Server. We can reduce the number of features that must be convolved to 63 or 252/4. First, we perform 252-> 63 1X1 convolutions and then 63 convolution on each branch and an additional 1X1 convolution from 63->252 features back once more. In total, we get about 70,000 operations instead of almost 600,000 we had before.
Even though we are doing less operations, we retain generality in this layer. The bottleneck layers perform remarkably on the ImageNet dataset. The reason it was so successful is because the input features are correlated and the repetition can be removed by combining them as needed with the 1×1 convolutions. Then, they can be expanded again into a meaningful combination for the next layer after a convolution with a smaller number of features.
As for the future, we believe that creating neural networks is critically important to the progress of the Deep Learning field. We could speculate into why we have to invest so much time in the creations of architectures instead of using data to tell what to use and how to combine modules. Although this would be very helpful, it remains a work in progress. Also, it is important to note that so far, we have talked about architectures for computer vision. Neural networks architectures have been created in other areas as well and it is interesting to study how these architectures evolved for all the other tasks.
The future is sure to bring many wonderful innovations and software development solutions in the world of neural network architectures. These innovations will also have practical applications and making all of our devices smarter and easier to use.