# Understanding Deep Learning

"It is like a voyage of discovery, seeking not for a new territory but new knowledge. It should appeal to those with a good sense of adventure," Dr. Frederick Sanger.

I hope every reader enjoys this voyage in deep learning and find their adventure.

Dr. Chitta Ranjan, Author.

## Recent

## Resources

# GitHub

Repository for the book Understanding Deep Learning: Application in Rare Event Prediction

# Video Lectures

## Chapter 4 Multilayer Perceptrons

Despite all the advancements, MLPs are still actively used. It is the “hello world” to deep learning. Similar to linear regression in machine learning, MLP is one of the immortal methods that remain active due to its robustness.

...

Multi-layer perceptrons are possibly one of the most visually illustratedneural networks. Yet most of them lack a few fundamental explanations. Since MLPs are the foundation of deep learning, this section attemptsat providing a clearer perspective.

...

A single perceptron works like a neuron in a human brain. It takes multiple inputs and, like a neuron emits an electric pulse, a perceptron emits a binary pulse which is treated as a response. The neuron-like behavior of perceptrons and an MLP being a network of perceptrons perhaps led to the term neural networks come forth in the early days.

...

Multi-layer perceptrons are complex nonlinear models. This chapter unfolds MLPs to simplify and explain its fundamentals. The section shows that an MLP is a collection of simple regression models placed on every node in each layer. How they come together with non-linear activations to deconstruct and solve complex problems becomes clearer in this section.

...

The input layer is followed by a stack of hidden layers till the last (output) layer. These layers perform the “complex” interconnected nonlinear operations. Although perceived as “complex,” the underlying operations are rather simple arithmetic computations.

...

The operation here is called a tensor operation. Tensor is a term used for any multi-dimensional matrix. Tensor operations are computationally efficient (especially in GPUs), hence, most steps in deep learning layers use them instead of iterative loops.

...

It is the nonlinear activation that dissociates the feature map of one layer from another. Without the activation, the feature map outputted from every layer will be just a linear transformation of the previous. This would mean the subsequent layers are not providing any additional information for a better prediction.

...

Put simply, backpropagation is an extension of the iterative stochastic gradient-descent based approach to train multi-layer deep learning networks. This is explained using a single-layer perceptron, also otherwise known as, logistic regression. The estimation approach in backpropagation is repeated on every layer. It can be imagined as updating/learning one layer at a time in the reverse order of prediction.

...

An end-to-end construction of a network and its evaluation is given with granular details including data preparation, viz.curve shifting for early prediction, data splitting, and features scaling.

Dropout is a useful technique (not limited to multi-layer perceptrons) that resolves co-adaptation issue in deep learning. How dropout addresses it and regularizes a network is given.

Activation functions are one of the most critical constructs in deep learning. Network performances are usually sensitive to activations due to their vanishing or exploding gradients. An understanding of activations is provided along with the story of activations laying discoveries such as non-decaying gradient, saturation region, and self-normalization.

Besides, a few customizations in TensorFlow implementations for a new thresholded exponential linear unit (telu) activation. Lastly, deep learning networks have several configurations and numerous choices for them, e.g., number of layers, their sizes, activations on them, and so on. To make a construction simpler, the chapter concludes with a few rules-of-thumb.

...

For early prediction, curve shifting moves the labels early in time. Doing so, the samples before the rare event get labeled as one. These prior samples are assumed to be the transitional phase that ultimately leads to the rare event.

...

During a batch processing, all the model parameters (weights and biases) are updated. Simultaneously, the states of a metric are updated. Upon processing all the batches in an epoch, both the estimated parameters and computed metrics are returned. Note that all these operations are enclosed within an epoch and no values are communicated between two epochs.

...

Looking at suitably chosen metrics for a problem tremendously increases the ability to develop better models. Although a metric does not directly improve model training but it helps in a better model selection. Several metrics are available outside TensorFlow such as in sklearn. However, they cannot be used directly during model training in TensorFlow. This is because the metrics are computed while processing batches during each training epoch.

...

Fortunately, TensorFlow provides the ability for this customization. The custom-defined metrics F1Score and FalsePositiveRate are provided in the user-defined performance metrics library. Learning the programmatic context for the customization is important and, therefore, is elucidated here.

...

If all the weights in a deep learning network are learned together, it is usual that some of the nodes have more predictive capability than the others. In such a scenario, as the network is trained iteratively these powerful (predictive) nodes start to suppress the weaker ones. These nodes usually constitute a fraction of all. But over many iterations, only these powerful nodes are trained. And the rest stop participating. This phenomenon is called co-adaptation.

...

Dropout changed the approach of learning weights. Instead of learning all the network weights together, dropout trains a subset of them in a batch training iteration.

...

Dropout is a regularization technique. It is closer to an L2regularization. This is shown mathematically that under linearity (activation) assumptions the loss function with dropout has the same form as L2 regularization.

...

Deep learning networks are learned with backpropagation. Backpropagation methods are gradient-based. The gradient guides the parameter to its optimal value. An apt gradient is, therefore, critical for the parameter’s journey to the optimal.

...

The gradient-based learning iteratively estimates the model. In each iteration, the parameter is moved “closer” to its optimal value.

The gradient when too small causes a vanishing gradient issue. On the other extreme, sometimes the gradient is massive. This is the exploding gradient phenomenon. Both the issues make reaching the optimal parameter values rather elusive.

...

The vanishing and exploding gradient issues were becoming a bottleneck in developing complex and large neural networks. They were first resolved to some extent with the rectified linear unit (relu) and leaky-relu in Maas, Hannun, and Ng 2013.

...

Lastly, deep learning networks have several configurations and numerous choices for them, e.g., number of layers, their sizes, activations on them, and so on. To make a construction simpler, the chapter concludes with a few rules-of-thumb.

...

Virtually every problem has more than one feature. The features can have a different range of values. For example, a paper manufacturing process has temperature and moisture features. Their units are different due to which their values are in different ranges. These differences may not pose theoretical issues. But, in practice, they cause difficulty in model training typically by converging at local minimas.

...

An end-to-end construction of a network and its evaluation is then given with the granular details on data preparation, viz. curve shifting for early prediction, data splitting, and features scaling. Thereafter, every construction element, e.g., layers, activations, evaluation metrics, and optimizers, are explained.

...

## Chapter 3 Setup

TensorFlow 2 was released in 2019 and is expected to change the landscape of deep learning. It has made, model building simpler, production deployment on any platform more robust, and item enables powerful experimentation for research. With these, TF 2 is likely to propel deep learning to mainstream applications in research and industry alike.

...

Using Google Colab environment is an alternative to this installation. Google Colab is generally an easier way to work with TensorFlow. It is a notebook on Google Cloud with all the TensorFlow requisites pre-installed.

...

A paper sheet-break problem in paper manufacturing is taken from Ranjan et al. 2018 as a working example in this book. The data is a two-minutes frequency multivariate time series. The system’s status (the response) with regards to normal versus break is present with values as 0 and 1.

# Articles

We have been pumped with the adage of the modern world, “follow your passion.” The questions are, what is passion? How to find one’s passion?

While the adage is clear, the answers to these aren’t. Without these answers, I have seen data scientists pursuing an apparition confused as passion.

In this article, we will get a starting point to build an initial Neural Network. We will learn the thumb-rules, e.g. the number of hidden layers, number of nodes, activation, etc., and see the implementations in TensorFlow 2.

Here we will learn the desired properties in Autoencoders derived from its similarity with PCA. From that, we will build custom constraints for Autoencoders in Part II for tuning and optimization.

In continuation of Part I, here we will define and implement custom constraints for building a well-posed Autoencoder. A well-posed Autoencoder is a regularized model that improves the test reconstruction error.

Due to its ease-of-use, efficiency, and cross-compatibility TensorFlow 2 is going to change the landscape of Deep Learning. Here we will learn to install and set it up. We will also implement the MNIST classification with TensorFlow 2.

Here we will break down an LSTM autoencoder network to understand them layer-by-layer. We will go over the input and output flow between the layers, and also, compare the LSTM Autoencoder with a regular LSTM network.

Here, we learn the fundamentals behind the Kernel Trick. How it works? How the Kernel Trick does the dot product (or similarity) in infinite dimension without increase in computation?

Here we will understand the Mathematics that drive Dropout. How it leads to regularization? Why Dropout rate of 0.5 leads to the most regularization? What is Gaussian-Dropout?

Autoencoders are modeled to reconstruct the input by learning their latent features. In this post, we will learn how to implement an autoencoder for building a rare-event classifier.

Here we will use an SGT embedding that embeds the long- and short-term patterns in a sequence into a finite-dimensional vector. The advantage of SGT embedding is that we can easily tune the amount of long- / short- term patterns without increasing the computation.

In this post, a new nonlinear correlation estimator nlcor is demonstrated. This estimator comes useful in data exploration and also variable selection for nonlinear predictive models, such as SVM.

# Podcast

In this ProcessMiner University podcast episode, Tom Tulloch chats with Dr. Chitta Ranjan, Director of Science at ProcessMiner, about his early life and education, what inspired him to write his book, "Understanding Deep Learning - Application in Rare Event Prediction," and the role deep learning plays in the evolution of artificial intelligence.

Those who know Dr. Ranjan, know he has a keen ability to take complex ideas and simplify them for all to understand... something he talks about during this session.

**Fun fact**

Throughout his experience writing this book, Dr. Ranjan became an unexpected expert chef, which he also delves into during this talk. Listeners will walk away feeling inspired, motivated to learn something new, and maybe a little hungry.

# Citation

@book{ranjan2020understanding,

doi = {10.13140/RG.2.2.34297.49765},

year = 2020,

month = {Dec},

author = {Chitta Ranjan},

title = {Understanding Deep Learning: Application in Rare Event Prediction},

publisher = {Connaissance Publishing},

note={URL: \url{www.understandingdeeplearning.com}}

}

# Testimonials

"This is the one of the best deep learning book that I have read. Dr. Ranjan has done a very good job in explaining core deep learning concepts without trying to incorporate too much. This is a fantastic book for both practitioners and researchers who want to understand and use deep learning in their respective fields. I find the illustrations and rule of thumbs are very very clear and useful. The bit that has been surprising (in a good way) is that Dr. Ranjan has found a way to incorporate some of the theoretical justifications for certain deep learning techniques. The part I enjoy the most (as a statistician) is around why pooling works for Convolutional Neural Nets. As far as I am concerned, this is the very first time I saw such clear and elegant explanation in any deep learning textbook."

- by** Zhen Zu, Stanford University.**

"This is an excellent textbook on Deep Learning with a focus on rare events. Chitta explains 'what' the primary constructs of deep learning are with clear visual illustrations. I particularly enjoy the details of the motivation and intuitions on the 'how' and 'why' we make different choices in building deep learning models. I personally learned a tremendous amount and would recommend this book to anyone who wants to dig deeper into the subject."

- by **Jason Wu, Ph.D., Partner, Ultrabright Education.**