Deep Learning


Fig1: Overview of AI
Deep learning is a subset of machine learning and it is a class of machine learning algorithms that use a cascade of layers of processing units or multilayer neural networks to extract features from data.

Fig2: Deep overview of AI
The Artificial Intelligence has two subsets, they are machine learning and deep learning. Deep learning is a sort of representation learning and it also comes under the kind of machine learning. Few popular examples of respective subsets were labelled in the fig2 and some of the famous examples like computer vision, speech and audio processing, NLP and many more. MLP is Multi-Layer Perceptron.
The two main concepts of deep learning for computer vision are convolutional neural networks and backpropagation. These two were already well understood by researchers in 1989 and backpropagation was not understood by many people in those days. After few years, The Long Short-Term Memory (LTSM) algorithm, which is a basic fundamental to deep learning for timeseries, was developed in 1997. Using deep learning, cats can be classified from images, blurred images, humans can be classified as male or female from images, birds can be differentiated from aeroplanes and many more through neural networks in deep learning.

•Near-human-level image classification
•Near-human-level speech recognition
•Near-human-level handwriting transcription
•Improved machine translation
•Improved text-to-speech conversion
•Digital assistants such as Google Now and Amazon Alexa
•Near-human-level autonomous driving
•Improved search results on the web
(source: Information from the book by Francois Chollet)
In deep learning, the layered representations are learned via models are called neural networks. A neural network has a input layer, output layer and the number of intermediate layers will depend on the user application.

Fig4: Logical Computations with Neurons
This is a different artificial neuron called a threshold logic unit (TLU) or also known as linear threshold unit (LTU). The TLU computes a weighted sum of its inputs (z = w1 x1 + w2 x2 + ⋯ + wn xn = xT w), then applies a step function to that sum and outputs the result: hw(x) = step(z), where z = xT w.

Fig5: Threshold Logic Unit
This is an artificial neural network with an input layer, output layer and one hidden layer. This was also called as neural network with single hidden layer.
EX: Perceptron

Fig6: Neural Network with single Hidden layer
This is a multi-layer artificial neural network with an input layer, output layer and many hidden layers. This was also called as Deep Neural Network or neural network with many hidden layers.

Fig7: Deep Neural Network
The three main reasons of going to deep learning are
•Datasets and benchmarks
•Algorithmic advances
If deep learning is the vehicle of this revolution, then fuel is the data. Now a days no one can do anything without data, for ex: large companies work with very large image datasets, video datasets, and natural language datasets that could have been collected from the own company that is other than the required data from internet. Generally, the datasets can be stored in cloud and databases of the company. These User-generated image tags on Flickr, for instance, have been the best data for computer vision. Wikipedia is one of the best key dataset for natural language processing (NLP).
Until the late 2000s, neural networks were still fairly shallow, using one or two or more layers of representations; such as support vector machines (SVM) and random forests.
Around 2009-2010 people used to train neural networks with many number of layers increased and better gradient propagation improved.
•Better activation functions for neural layers.
•Better weight-initialization schemes.
•Better optimization schemes such as RMSProp.
Algorithms of deep learning include Simplicity, Scalability, Versatility and Reusability.
Between 1990 and 2010, CPUs became faster by a factor of approximately 5000. Nowadays it’s possible to run small deep-learning models. Typical deep learning models used in computer vision need more computational power like GPUs holding NVIDIA and AMD. Large companies train deep-learning models on clusters of hundreds of GPUs of a type developed for the needs of deep learning, such as NVIDIA Tesla K80. (source: deep learning with python by Francois Chollet).

Generally, people who are in the deep learning field first try to understand their problem, they search for data, and the estimation of the neural network layers, inputs and desired output. Finally, they chose algorithms. So, if you don’t know the choice of choosing deep net then I am here to help you.
1.If user have an unlabelled data of unsupervised learning, then:
Required        Your choice
Feature Extraction    RBM or Autoencoders
Unsupervised Learning    RBM or Autoencoders
Pattern Recognition    RBM or Autoencoders
(RBM: Restricted Boltzmann Machine)
2.If user have a labelled data of supervised learning, then:
Required       Your Choice
Text Processing    RNTN, RNN
Image Recognition    DBN, CNN
Object Recognition    RNTN, CNN
Speech Recognition    RNN
(RNTN: Recursive neural tensor networks, DBN: Deep belief network)
3.For more examples
•Natural language processing: RNTN or RNN
•Classification: MLP/RELU
•Time Series Analysis: RNN
An old problem - The Vanishing Gradient:
To train a neural network over a large set of labelled data, you must continuously compute the difference between the network’s predicted output and the actual output. This difference is called the cost, and the process for training a net is known as backpropagation, or backprop. During backprop, weights and biases are tweaked slightly until the lowest possible cost is achieved. An important aspect of this process is the gradient, which is a measure of how much the cost changes with respect to a change in a weight or bias value.
Backprop suffers from a fundamental problem known as the vanishing gradient. During training, the gradient decreases in value back through the net. Because higher gradient values lead to faster training, the layers closest to the input layer take the longest to train. Unfortunately, these initial layers are responsible for detecting the simple patterns in the data, while the later layers help to combine the simple patterns into complex patterns. Without properly detecting simple patterns, a deep net will not have the building blocks necessary to handle the complexity. This problem is the equivalent of trying to build a house without the proper foundation.
So, what causes the gradient to decay back through the net? Backprop, as the name suggests, requires the gradient to be calculated first at the output layer, then backwards across the net to the first hidden layer. Each time the gradient is calculated, the net must compute the product of all the previous gradients up to that point. Since all the gradients are fractions between 0 and 1 – and the product of fractions in this range results in a smaller fraction – the gradient continues to shrink.
For example, if the first two gradients are one fourth and one third, then the next gradient would be one fourth of one third, which is one twelfth. The following gradient would be one twelfth of one fourth, which is one forty-eighth, and so on. Since the layers near the input layer receive the smallest gradients, the net would take a very long time to train. As a subsequent result, the overall accuracy would suffer.
(source: YouTube)