Understanding Deep Learning

"It is like a voyage of discovery, seeking not for a new territory but new knowledge. It should appeal to those with a good sense of adventure," Dr. Frederick Sanger.

I hope every reader enjoys this voyage in deep learning and find their adventure.

Dr. Chitta Ranjan, Author.

understanding-deep-learning-chitta-ranjan-toc.pdf

Recent

Chapter 4 - Part 1 - Multilayer Perceptrons Background

Rules-of-thumb for building a Neural Network

Convolutional Neural Networks

Resources

Github repository for the book Understanding Deep Learning: Application in Rare Event Prediction

GitHub

Repository for the book Understanding Deep Learning: Application in Rare Event Prediction

Video Lectures

Chapter 4 Multilayer Perceptrons

Despite all the advancements, MLPs are still actively used. It is the “hello world” to deep learning. Similar to linear regression in machine learning, MLP is one of the immortal methods that remain active due to its robustness.

...

Multi-layer perceptrons are possibly one of the most visually illustratedneural networks. Yet most of them lack a few fundamental explanations. Since MLPs are the foundation of deep learning, this section attemptsat providing a clearer perspective.

...

A single perceptron works like a neuron in a human brain. It takes multiple inputs and, like a neuron emits an electric pulse, a perceptron emits a binary pulse which is treated as a response. The neuron-like behavior of perceptrons and an MLP being a network of perceptrons perhaps led to the term neural networks come forth in the early days.

...

Multi-layer perceptrons are complex nonlinear models. This chapter unfolds MLPs to simplify and explain its fundamentals. The section shows that an MLP is a collection of simple regression models placed on every node in each layer. How they come together with non-linear activations to deconstruct and solve complex problems becomes clearer in this section.

...

The input layer is followed by a stack of hidden layers till the last (output) layer. These layers perform the “complex” interconnected nonlinear operations. Although perceived as “complex,” the underlying operations are rather simple arithmetic computations.

...

The operation here is called a tensor operation. Tensor is a term used for any multi-dimensional matrix. Tensor operations are computationally efficient (especially in GPUs), hence, most steps in deep learning layers use them instead of iterative loops.

...

It is the nonlinear activation that dissociates the feature map of one layer from another. Without the activation, the feature map outputted from every layer will be just a linear transformation of the previous. This would mean the subsequent layers are not providing any additional information for a better prediction.

...

Put simply, backpropagation is an extension of the iterative stochastic gradient-descent based approach to train multi-layer deep learning networks. This is explained using a single-layer perceptron, also otherwise known as, logistic regression. The estimation approach in backpropagation is repeated on every layer. It can be imagined as updating/learning one layer at a time in the reverse order of prediction.

...

An end-to-end construction of a network and its evaluation is given with granular details including data preparation, viz.curve shifting for early prediction, data splitting, and features scaling.

Dropout is a useful technique (not limited to multi-layer perceptrons) that resolves co-adaptation issue in deep learning. How dropout addresses it and regularizes a network is given.

Activation functions are one of the most critical constructs in deep learning. Network performances are usually sensitive to activations due to their vanishing or exploding gradients. An understanding of activations is provided along with the story of activations laying discoveries such as non-decaying gradient, saturation region, and self-normalization.

Besides, a few customizations in TensorFlow implementations for a new thresholded exponential linear unit (telu) activation. Lastly, deep learning networks have several configurations and numerous choices for them, e.g., number of layers, their sizes, activations on them, and so on. To make a construction simpler, the chapter concludes with a few rules-of-thumb.

...

For early prediction, curve shifting moves the labels early in time. Doing so, the samples before the rare event get labeled as one. These prior samples are assumed to be the transitional phase that ultimately leads to the rare event.

...

During a batch processing, all the model parameters (weights and biases) are updated. Simultaneously, the states of a metric are updated. Upon processing all the batches in an epoch, both the estimated parameters and computed metrics are returned. Note that all these operations are enclosed within an epoch and no values are communicated between two epochs.

...

Looking at suitably chosen metrics for a problem tremendously increases the ability to develop better models. Although a metric does not directly improve model training but it helps in a better model selection. Several metrics are available outside TensorFlow such as in sklearn. However, they cannot be used directly during model training in TensorFlow. This is because the metrics are computed while processing batches during each training epoch.

...

Fortunately, TensorFlow provides the ability for this customization. The custom-defined metrics F1Score and FalsePositiveRate are provided in the user-defined performance metrics library. Learning the programmatic context for the customization is important and, therefore, is elucidated here.

...

If all the weights in a deep learning network are learned together, it is usual that some of the nodes have more predictive capability than the others. In such a scenario, as the network is trained iteratively these powerful (predictive) nodes start to suppress the weaker ones. These nodes usually constitute a fraction of all. But over many iterations, only these powerful nodes are trained. And the rest stop participating. This phenomenon is called co-adaptation.

...

Dropout changed the approach of learning weights. Instead of learning all the network weights together, dropout trains a subset of them in a batch training iteration.

...


Dropout is a regularization technique. It is closer to an L2regularization. This is shown mathematically that under linearity (activation) assumptions the loss function with dropout has the same form as L2 regularization.

...

Deep learning networks are learned with backpropagation. Backpropagation methods are gradient-based. The gradient guides the parameter to its optimal value. An apt gradient is, therefore, critical for the parameter’s journey to the optimal.

...

The gradient-based learning iteratively estimates the model. In each iteration, the parameter is moved “closer” to its optimal value.

The gradient when too small causes a vanishing gradient issue. On the other extreme, sometimes the gradient is massive. This is the exploding gradient phenomenon. Both the issues make reaching the optimal parameter values rather elusive.

...

The vanishing and exploding gradient issues were becoming a bottleneck in developing complex and large neural networks. They were first resolved to some extent with the rectified linear unit (relu) and leaky-relu in Maas, Hannun, and Ng 2013.

...

Lastly, deep learning networks have several configurations and numerous choices for them, e.g., number of layers, their sizes, activations on them, and so on. To make a construction simpler, the chapter concludes with a few rules-of-thumb.

...

Virtually every problem has more than one feature. The features can have a different range of values. For example, a paper manufacturing process has temperature and moisture features. Their units are different due to which their values are in different ranges. These differences may not pose theoretical issues. But, in practice, they cause difficulty in model training typically by converging at local minimas.

...

An end-to-end construction of a network and its evaluation is then given with the granular details on data preparation, viz. curve shifting for early prediction, data splitting, and features scaling. Thereafter, every construction element, e.g., layers, activations, evaluation metrics, and optimizers, are explained.

...

Chapter 3 Setup

TensorFlow 2 was released in 2019 and is expected to change the landscape of deep learning. It has made, model building simpler, production deployment on any platform more robust, and item enables powerful experimentation for research. With these, TF 2 is likely to propel deep learning to mainstream applications in research and industry alike.

...

Using Google Colab environment is an alternative to this installation. Google Colab is generally an easier way to work with TensorFlow. It is a notebook on Google Cloud with all the TensorFlow requisites pre-installed.

...

A paper sheet-break problem in paper manufacturing is taken from Ranjan et al. 2018 as a working example in this book. The data is a two-minutes frequency multivariate time series. The system’s status (the response) with regards to normal versus break is present with values as 0 and 1.

Articles

We have been pumped with the adage of the modern world, “follow your passion.” The questions are, what is passion? How to find one’s passion?

While the adage is clear, the answers to these aren’t. Without these answers, I have seen data scientists pursuing an apparition confused as passion.

Deep Learning provides a wide variety of models. With them, we can build extremely accurate predictive models. However, with the wide variety and a multitude of setting parameters it may be daunting to find a starting point. In this article, we will find a starting point for building a Neural Network, more specifically a Multilayer Perceptron as an example but most of it is generally applicable. The idea here is to go over the thumb-rules to build a first neural network model. Tune and optimize this first model if it performs reasonably (minimally acceptable accuracy). Otherwise, it is better to look in the problem, data, or a different approach. In the following, we have, Rules of thumb for building a Neural Network, and their implementation codes for a Binary Classification in TensorFlow 2. Neural Networks Neural Network has advanced tremendously with CNNs, RNNs, etc. and several subtypes within each of them developed over time. With each development we have successfully improved our prediction capabilities. But at the same time, we have successfully made it harder to find a starting point for model building. Each model has its own sky-reaching claims. With so many billboards around, it is easy to get distracted. In the following, we will sail through the distractions by using a few rules-of-thumb to build a first model. Rules of Thumb We have a variety of neural networks. Among them, a multilayer perceptron is the “hello world” of Deep Learning. And, therefore, a good place to start when you are learning about or developing a new model in Deep Learning. Following are the thumb-rules for building an MLP. However, most of them are applicable on other Deep Learning models. Number of Layers: Start with two hidden layers (this does not include the last layer). Number of nodes (size) of intermediate layers: a number from the geometric progression of 2, e.g., 4, 8, 16, 32, … . The first layer should be around half of the number of input data features. The next layer size as half of the previous. Number of nodes (size) of output layer for Classification: If binary classification then the size is one. For a multi-class classifier, the size is the number of classes. Size of output layer for regression: If single response then the size one. For multi-response regression, the size is the number of responses. Activation for intermediate layers: Use relu activation. Activation for output layer: Use sigmoid for binary classification, softmax for multi-class classifier, and linear for regression. For Autoencoders, the last layer should be linear if the input data is continuous, otherwise, sigmoid or softmax for binary or multi-level categorical input. Dropout layers: Add Dropout after every layer, except the Input layer (if defining the Input layer separately). Set Dropout rate to 0.5. Dropout rate > 0.5 is counter-productive. If you believe a rate of 0.5 is regularizing too many nodes, then increase the size of the layer instead of reducing the Dropout rate to less than 0.5. I prefer to not set any Dropout on the Input layer. But if you feel compelled to do that, set the Dropout rate < 0.2. Data preprocessing: I am assuming your predictors X is numeric and you have already converted any categorical columns into one-hot-encoding. Before using the data for model training, perform data scaling. UseMinMaxScaler from sklearn.preprocessing. If this does not work well, do StandardScaler present in the same library. The scaling is needed for y in regression. Split data to train, valid, test: Use train_test_split from sklearn.model_selection. See example below. Class weights: If you have unbalanced data, then set class weights to balance the loss in your model.fit . For a binary classifier, the weights should be: {0: number of 1s / data size, 1: number of 0s / data size}. For extremely unbalanced data (rare events), class weight may not work. Be cautious adding it. Optimizer: Use adam with its default learning rate. Loss in classification: For binary classification use binary_crossentropy. For multiclass, use categorical_crossentropy if the labels are one-hot-encoded, otherwise use sparse_categorical_crossentropy if the labels are integers. Loss in regression: Use mse. Metrics for Classification: Use accuracy that shows the percent of correct classifications. For unbalanced data, also include tf.keras.metrics.Recall() and tf.keras.metrics.FalsePositives(). Metric for Regression: Use tf.keras.metrics.RootMeanSquaredError(). Epochs: Start with 20 to see if the model training shows decreasing loss and any improvement in accuracy. If there is no minimal success with 20 epochs, move on. If you get some minimal success, make epoch as 100. Batch size: Choose the batch size from the geometric progression of 2. For unbalanced datasets have larger value, like 128, otherwise start with 16. Few Extras for Advanced Practitioners Oscillating loss: If you encounter oscillating loss upon training then there is a convergence issue. Try reducing the learning rate and/or change the batch size. Oversampling and undersampling: If your data is unbalanced, use SMOTE from imblearn.over_sampling . Curve shifting: If you have to do a shifted prediction, for example an early prediction, use curve shifting. An implementation curve_shift is shown below. Custom Metric: An important metric for unbalanced binary classification is the False Positive Rate. You can build this and similarly other custom metrics as shown below in class FalsePositiveRate() implementation below. Selu activation: selu activation has been deemed as better than all other existing activations. I have not observed that always, but if you want to use selu activation then use kernel_initializer=’lecun_normal’ and AlphaDropout. In AlphaDropout use the rate as 0.1, AlphaDropout(0.1) . Example implementation is shown below. Example Multilayer Perceptron (MLP) in TensorFlow 2 I have implemented the MLP Neural Network on the paper sheet break data set I have used in my previous articles (see Extreme Rare Event Classification using Autoencoders in Keras). In this implementation, we will see examples for the elements we mentioned in the above rules-of-thumb. The implementation is done in TensorFlow 2. I highly recommend migrating to TensorFlow 2, if not already. It has all the simplicity of Keras and significantly better computational efficiency. Follow Step-by-Step Guide to Install Tensorflow 2 for installation. In the following, I am not attempting to find the best model. The idea is to learn the implementations. No step is skipped to favor brevity. Instead, the steps are verbose to help the reader apply them directly. Libraries %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from pylab import rcParams from collections import Counter import tensorflow as tf from tensorflow.keras import optimizers from tensorflow.keras.models import Model, load_model, Sequential from tensorflow.keras.layers import Input, Dense, Dropout, AlphaDropout from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, precision_recall_curve from sklearn.metrics import recall_score, classification_report, auc, roc_curve from sklearn.metrics import precision_recall_fscore_support, f1_score from numpy.random import seed seed(1) SEED = 123 #used to help randomly select the data points DATA_SPLIT_PCT = 0.2 rcParams['figure.figsize'] = 8, 6 LABELS = ["Normal","Break"] To test whether you are with the correct TensorFlow version, run this, tf.__version__ Image for post Reading and Preparing the Data Download the data here. ''' Download data here: https://docs.google.com/forms/d/e/1FAIpQLSdyUk3lfDl7I5KYK_pw285LCApc-_RcoC0Tf9cnDnZ_TWzPAw/viewform ''' df = pd.read_csv("data/processminer-rare-event-mts - data.csv")  df.head(n=5)  # visualize the data. Image for post Convert categorical columns to one-hot-encoding hotencoding1 = pd.get_dummies(df['x28'])  # Grade&Bwt hotencoding1 = hotencoding1.add_prefix('grade_') hotencoding2 = pd.get_dummies(df['x61'])  # EventPress hotencoding2 = hotencoding2.add_prefix('eventpress_') df=df.drop(['x28', 'x61'], axis=1) df=pd.concat([df, hotencoding1, hotencoding2], axis=1) Curve Shift This is a timeseries data in which we have to predict the event (y = 1) ahead in time. In this data, consecutive rows are 2 minutes apart. We will shift the labels in column y by 2 rows to do a 4 minute ahead prediction. sign = lambda x: (1, -1)[x < 0] def curve_shift(df, shift_by):     '''     This function will shift the binary labels in a dataframe.     The curve shift will be with respect to the 1s.      For example, if shift is -2, the following process     will happen: if row n is labeled as 1, then     - Make row (n+shift_by):(n+shift_by-1) = 1.     - Remove row n.     i.e. the labels will be shifted up to 2 rows up.          Inputs:     df       A pandas dataframe with a binary labeled column.               This labeled column should be named as 'y'.     shift_by An integer denoting the number of rows to shift.          Output     df       A dataframe with the binary labels shifted by shift.     ''' vector = df['y'].copy()     for s in range(abs(shift_by)):         tmp = vector.shift(sign(shift_by))         tmp = tmp.fillna(0)         vector += tmp     labelcol = 'y'     # Add vector to the df     df.insert(loc=0, column=labelcol+'tmp', value=vector)     # Remove the rows with labelcol == 1.     df = df.drop(df[df[labelcol] == 1].index)     # Drop labelcol and rename the tmp col as labelcol     df = df.drop(labelcol, axis=1)     df = df.rename(columns={labelcol+'tmp': labelcol})     # Make the labelcol binary     df.loc[df[labelcol] > 0, labelcol] = 1 return df Shift up by two rows, df = curve_shift(df, shift_by = -2) Remove the time column now. It won’t be required from here. df = df.drop(['time'], axis=1) Divide the data into train, valid, and test df_train, df_test = train_test_split(df, test_size=DATA_SPLIT_PCT, random_state=SEED) df_train, df_valid = train_test_split(df_train, test_size=DATA_SPLIT_PCT, random_state=SEED) And separate the X and y. x_train = df_train.drop(['y'], axis=1) y_train = df_train.y.values x_valid = df_valid.drop(['y'], axis=1) y_valid = df_valid.y.values x_test = df_test.drop(['y'], axis=1) y_test = df_test.y Data Scaling scaler = MinMaxScaler().fit(x_train) # scaler = StandardScaler().fit(x_train) x_train_scaled = scaler.transform(x_train) x_valid_scaled = scaler.transform(x_valid) x_test_scaled = scaler.transform(x_test) MLP Models Custom metric: FalsePositiveRate() We will develop a FalsePositiveRate() metric that we will use in each model below. class FalsePositiveRate(tf.keras.metrics.Metric):     def __init__(self, name='false_positive_rate', **kwargs):         super(FalsePositiveRate, self).__init__(name=name, **kwargs)         self.negatives = self.add_weight(name='negatives', initializer='zeros')         self.false_positives = self.add_weight(name='false_negatives', initializer='zeros')              def update_state(self, y_true, y_pred, sample_weight=None):         '''         Arguments:         y_true  The actual y. Passed by default to Metric classes.         y_pred  The predicted y. Passed by default to Metric classes.                  '''         # Compute the number of negatives.         y_true = tf.cast(y_true, tf.bool)                  negatives = tf.reduce_sum(tf.cast(tf.equal(y_true, False), self.dtype))                  self.negatives.assign_add(negatives)                  # Compute the number of false positives.         y_pred = tf.greater_equal(y_pred, 0.5)  # Using default threshold of 0.5 to call a prediction as positive labeled.                  false_positive_values = tf.logical_and(tf.equal(y_true, False), tf.equal(y_pred, True))          false_positive_values = tf.cast(false_positive_values, self.dtype)         if sample_weight is not None:             sample_weight = tf.cast(sample_weight, self.dtype)             sample_weight = tf.broadcast_weights(sample_weight, values)             values = tf.multiply(false_positive_values, sample_weight)                  false_positives = tf.reduce_sum(false_positive_values)                  self.false_positives.assign_add(false_positives)              def result(self):         return tf.divide(self.false_positives, self.negatives) Custom performance plotting functions We will write two plot function to visualize the progress in loss, and accuracy measures. We will use them for the models built below. def plot_loss(model_history):     train_loss=[value for key, value in model_history.items() if 'loss' in key.lower()][0]     valid_loss=[value for key, value in model_history.items() if 'loss' in key.lower()][1] fig, ax1 = plt.subplots() color = 'tab:blue'     ax1.set_xlabel('Epoch')     ax1.set_ylabel('Loss', color=color)     ax1.plot(train_loss, '--', color=color, label='Train Loss')     ax1.plot(valid_loss, color=color, label='Valid Loss')     ax1.tick_params(axis='y', labelcolor=color)     plt.legend(loc='upper left')     plt.title('Model Loss') plt.show() def plot_model_recall_fpr(model_history):     train_recall=[value for key, value in model_history.items() if 'recall' in key.lower()][0]     valid_recall=[value for key, value in model_history.items() if 'recall' in key.lower()][1] train_fpr=[value for key, value in model_history.items() if 'false_positive_rate' in key.lower()][0]     valid_fpr=[value for key, value in model_history.items() if 'false_positive_rate' in key.lower()][1] fig, ax1 = plt.subplots() color = 'tab:red'     ax1.set_xlabel('Epoch')     ax1.set_ylabel('Recall', color=color)     ax1.set_ylim([-0.05,1.05])     ax1.plot(train_recall, '--', color=color, label='Train Recall')     ax1.plot(valid_recall, color=color, label='Valid Recall')     ax1.tick_params(axis='y', labelcolor=color)     plt.legend(loc='upper left')     plt.title('Model Recall and FPR') ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis color = 'tab:blue'     ax2.set_ylabel('False Positive Rate', color=color)  # we already handled the x-label with ax1     ax2.plot(train_fpr, '--', color=color, label='Train FPR')     ax2.plot(valid_fpr, color=color, label='Valid FPR')     ax2.tick_params(axis='y', labelcolor=color)     ax2.set_ylim([-0.05,1.05]) fig.tight_layout()  # otherwise the right y-label is slightly clipped     plt.legend(loc='upper right')     plt.show() Model 1. Baseline. n_features = x_train_scaled.shape[1] mlp = Sequential() mlp.add(Input(shape=(n_features, ))) mlp.add(Dense(32, activation='relu')) mlp.add(Dense(16, activation='relu')) mlp.add(Dense(1, activation='sigmoid')) mlp.summary() mlp.compile(optimizer='adam',             loss='binary_crossentropy',             metrics=['accuracy', tf.keras.metrics.Recall(), FalsePositiveRate()]            ) history = mlp.fit(x=x_train_scaled,                   y=y_train,                   batch_size=128,                   epochs=100,                   validation_data=(x_valid_scaled, y_valid),                   verbose=0).history See the model fitting loss and accuracies (recall and FPR) progress. plot_loss(history) Image for post plot_model_recall_fpr(history) Image for post Model 2. Class weights. Define the class weights as mentioned in the rules of thumb. class_weight = {0: sum(y_train == 1)/len(y_train), 1: sum(y_train == 0)/len(y_train)} Now, we will train the model. n_features = x_train_scaled.shape[1] mlp = Sequential() mlp.add(Input(shape=(n_features, ))) mlp.add(Dense(32, activation='relu')) mlp.add(Dense(16, activation='relu')) mlp.add(Dense(1, activation='sigmoid')) mlp.summary() mlp.compile(optimizer='adam',             loss='binary_crossentropy',             metrics=['accuracy', tf.keras.metrics.Recall(), FalsePositiveRate()]            ) history = mlp.fit(x=x_train_scaled,                   y=y_train,                   batch_size=128,                   epochs=100,                   validation_data=(x_valid_scaled, y_valid),                   class_weight=class_weight,                   verbose=0).history plot_loss(history) Image for post plot_model_recall_fpr(history) Image for post Model 3. Dropout Regularization. n_features = x_train_scaled.shape[1] mlp = Sequential() mlp.add(Input(shape=(n_features, ))) mlp.add(Dense(32, activation='relu')) mlp.add(Dropout(0.5)) mlp.add(Dense(16, activation='relu')) mlp.add(Dropout(0.5)) mlp.add(Dense(1, activation='sigmoid')) mlp.summary() mlp.compile(optimizer='adam',             loss='binary_crossentropy',             metrics=['accuracy', tf.keras.metrics.Recall(), FalsePositiveRate()]            ) history = mlp.fit(x=x_train_scaled,                   y=y_train,                   batch_size=128,                   epochs=100,                   validation_data=(x_valid_scaled, y_valid),                   class_weight=class_weight,                   verbose=0).history plot_loss(history) Image for post plot_model_recall_fpr(history) Image for post Model 4. Oversampling-Undersampling Using SMOTE resampler. from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=212) x_train_scaled_resampled, y_train_resampled = smote.fit_resample(x_train_scaled, y_train) print('Resampled dataset shape %s' % Counter(y_train_resampled)) Image for post n_features = x_train_scaled.shape[1] mlp = Sequential() mlp.add(Input(shape=(n_features, ))) mlp.add(Dense(32, activation='relu')) mlp.add(Dropout(0.5)) mlp.add(Dense(16, activation='relu')) mlp.add(Dropout(0.5)) mlp.add(Dense(1, activation='sigmoid')) mlp.summary() mlp.compile(optimizer='adam',             loss='binary_crossentropy',             metrics=['accuracy', tf.keras.metrics.Recall(), FalsePositiveRate()]            ) history = mlp.fit(x=x_train_scaled_resampled,                   y=y_train_resampled,                   batch_size=128,                   epochs=100,                   validation_data=(x_valid, y_valid),                   class_weight=class_weight,                   verbose=0).history plot_loss(history) Image for post plot_model_recall_fpr(history) Image for post Model 5. Selu activation. We use the selu activation that got popular due its self-normalizing properties. Note: We used a kernel_initializer=’lecun_normal’ , and, Dropout as AlphaDropout(0.1) . n_features = x_train_scaled.shape[1] mlp = Sequential() mlp.add(Input(shape=(n_features, ))) mlp.add(Dense(32, kernel_initializer='lecun_normal', activation='selu')) mlp.add(AlphaDropout(0.1)) mlp.add(Dense(16, kernel_initializer='lecun_normal', activation='selu')) mlp.add(AlphaDropout(0.1)) mlp.add(Dense(1, activation='sigmoid')) mlp.summary() mlp.compile(optimizer='adam',             loss='binary_crossentropy',             metrics=['accuracy', tf.keras.metrics.Recall(), FalsePositiveRate()]            ) history = mlp.fit(x=x_train_scaled,                   y=y_train,                   batch_size=128,                   epochs=100,                   validation_data=(x_valid, y_valid),                   class_weight=class_weight,                   verbose=0).history plot_loss(history) Image for post plot_model_recall_fpr(history) Image for post Conclusion With all the predictive modeling abilities that Deep Learning has offered, it can also be overwhelming to begin. The rules-of-thumb in this article provides a starting point to build an initial Neural Network. The model built from here should be further tuned to improve the performance. If the model performance built with these rules-of-thumb does not have some minimal performance. Tuning further may not bring much improvement. Try another approach. The article shows steps to implement the neural network in TensorFlow 2. If you do not have TensorFlow 2, it is recommend to migrate to it as it brings the ease of (keras) implementation and high performance. See instructions here, Step-by-Step Guide to Install Tensorflow 2.

In this article, we will get a starting point to build an initial Neural Network. We will learn the thumb-rules, e.g. the number of hidden layers, number of nodes, activation, etc., and see the implementations in TensorFlow 2.

The availability of Deep Learning APIs, such as Keras and TensorFlow, have made model building and experimentation extremely easy. However, a lack of clear understanding of the fundamentals may put us in a directionless race to the best model. Reaching the best model in such a race is left to chance. Here we will develop an understanding of the fundamental properties required in an Autoencoder. This will provide a well-directed approach for Autoencoder tuning and optimization. In Part I, we will focus on learning the properties and their benefits. In Part II, we will develop custom layer and constraints to incorporate the properties. The primary concept that we will learn here and that will enable us to construct a right Autoencoder is, Autoencoders are directly related to Principal Component Analysis (PCA). A “right” Autoencoder mathematically means a well-posed Autoencoder. A well-posed model is easier to tune and optimize. Autoencoder vis-à-vis PCA, A linearly activated Autoencoder approximates PCA. Mathematically, minimizing the reconstruction error in PCA modeling is the same as a single layer linear Autoencoder. An Autoencoder extends PCA to a nonlinear space. In other words, Autoencoders are a nonlinear extension of PCA. Therefore, an Autoencoder should ideally have the properties of PCA. These properties are, Tied Weights: equal weights on Encoder and the corresponding Decoder layer (clarified with Figure 1 in the next section). Orthogonal weights: each weight vector is independent of others. Uncorrelated features: output of the encoding layer are not correlated. Unit Norm: the weights on a layer have unit norm. However, Autoencoders as explained in most tutorials, e.g. Building Autoencoders in Keras [1], do not have these properties. A lack of which makes them sub-optimal. Therefore, it is important to incorporate these properties for a well-posed Autoencoder. By incorporating them, we will also Have Regularization. Orthogonality and Unit Norm constraint act as regularization. Additionally, the Tied Weights, as we will see later, reduces the number of network parameters to almost half — another type of regularization. Address Exploding and Vanishing gradient. The Unit Norm constraint prevents weights from becoming large, and hence, resolves exploding gradient problem. Additionally, due to the Orthogonality constraint, only important/informative weights are non-zero. Therefore, sufficient information flows through these non-zero weights during back-propagation, and thus, avoiding vanishing gradients. Have Smaller network: Without the orthogonality, the encoder has redundant weights and features. To compensate the redundancy, the encoder size is increased. On the contrary, the orthogonality ensures each encoded feature has a piece of unique information — independent of the other features. This obviates the redundancy and we can have the same amount information encoded with a smaller encoder (layer). With a smaller network, we bring Autoencoder closer to Edge Computing. This article will elucidate the above concept by showing the architectural similarity between PCA and Autoencoder, and suboptimality of the conventional Autoencoder. The article will be continued in Part II with detailed steps to optimize an Autoencoder. In Part II, we find that the optimizations improved the Autoencoder reconstruction error by more than 50%. This article assumes the reader has a basic understanding of PCA. If unfamiliar, please refer to Understanding PCA [2]. Architectural similarity between PCA and Autoencoder Image for post Figure 1. Single layer Autoencoder vis-à-vis PCA. For simplicity, we compare a linear single layer Autoencoder with PCA. There are multiple algorithms for PCA modeling. One of them is estimation by minimizing reconstruction error (see [3]). Following this algorithm gives a clearer understanding of the similarities between a PCA and an Autoencoder. Figure 1 visualizes a single layer linear Autoencoder. As shown in the bottom of the figure, the Encoding process is similar to PC transformation. PC transformation is projecting the original data on the Principal Components to yield orthogonal features, called Principal Scores. Similarly, the Decoding process is similar to reconstructing the data from the Principal Scores. In both, Autoencoder and PCA, the model weights can be estimated by minimizing the reconstruction error. In the following, we will further elaborate Figure 1 by showcasing the key Autoencoder components and their equivalent in PCA. Suppose we have data with p features. Input layer — data sample. In the Autoencoder, the data is inputted using an Input layer of size p. In PCA, the data is inputted as samples. Encoding — the projection of data on Principal Components. The size of the encoding layer is k. In PCA, k denotes the number of selected Principal Components (PCs). In both, we have k<p for dimension reduction. k ≥ p leads to an over-representative model, and consequently (close to) zero reconstruction error. A colored cell in the Encoding layer in Figure 1, is a computing node with p weights denoted as, Image for post That is, for each Encoding node in 1,…,k we have a p-dimensional weight vector. This is equivalent to an eigenvector in PCA. Encoding layer output in an Autoencoder is, Image for post x is the input and W is the weight matrix. The function g is an activation function. g(Wx) is the output of the Encoding layer. If the activation is linear, this is equivalent to the Principal Scores in PCA. Decoding — reconstruction of data from the Principal Scores. The size of the decoding layer in Autoencoder and in PCA reconstruction must be the size of the input data, p. In a decoder, the data is reconstructed from the encodings as, Image for post and similarly, in PCA, it is reconstructed as, Image for post Note that, we have W’ in Eq. 4 and W in Eq. 5. This is because, the weights on the Encoder and Decoder are not the same by default. The Decoder and PCA reconstructions will be the same if the Encoder and Decoder weights are tied, i.e. Image for post The multi-colors in the Decoder cells indicate that the weights in different cells in the Encoder are present in the same cell in the Decoder. This brings us to the mathematical comparisons between Autoencoder and PCA. Mathematically a Linear Autoencoder will be similar to PCA if, Tied Weights: In any general multilayer Autoencoder, the weight on layer l in the Encoder module is equal to the transpose of the weight on layer l from the end in the Decoder. Image for post Orthogonal weights: The weights on the Encoding layer is Orthogonal (see Eq. 7b). The same orthogonality constraint can be enforced on intermediate Encoder layers for regularization. Image for post Uncorrelated features: The output of PCA, i.e. Principal Scores, are uncorrelated. Therefore, the output of the encoder should have, Image for post Unit Norm: An eigenvector in PCA is constrained to have a Unit Norm. Without this constraint, we will not get a proper solution as the variance of the projection can become arbitrarily large as long as the norm of the vector increases. For the same reason, the weights on the Encoding layer should be unit norm (see Eq. 7d). This constraint should also be applied on other intermediate layers for regularization. Image for post Suboptimality of a regular unconstrained Autoencoder Here we will implement PCA and a typical unconstrained Autoencoder on a random dataset. We will show their outputs differ in every aspect discussed above. This results in a suboptimal Autoencoder. Post this discussion, we will show how we can constrain an Autoencoder for proper estimations (Part II). Complete code is available here. Load libraries from numpy.random import seed seed(123) from tensorflow import set_random_seed set_random_seed(234)  import sklearn from sklearn import datasets import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn import decomposition import scipy  import tensorflow as tf from keras.models import Model, load_model from keras.layers import Input, Dense, Layer, InputSpec from keras.callbacks import ModelCheckpoint, TensorBoard from keras import regularizers, activations, initializers, constraints, Sequential from keras import backend as K from keras.constraints import UnitNorm, Constraint Generate random data We generate multivariate correlated normal data. The steps for data generation are elaborated in the GitHub repository. n_dim = 5 cov = sklearn.datasets.make_spd_matrix(n_dim, random_state=None) mu = np.random.normal(0, 0.1, n_dim) n = 1000 X = np.random.multivariate_normal(mu, cov, n) X_train, X_test = train_test_split(X, test_size=0.5, random_state=123) # Data Preprocessing scaler = MinMaxScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) The test dataset will be used in Part II to compare the Autoencoder reconstruction accuracy. Autoencoder and PCA models We fit a single layer linear Autoencoder with encoding dimension as two. We also fit PCA with two components. # Fit Autoencoder nb_epoch = 100 batch_size = 16 input_dim = X_train_scaled.shape[1] #num of predictor variables,  encoding_dim = 2 learning_rate = 1e-3  encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True)  decoder = Dense(input_dim, activation="linear", use_bias = True)  autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder)  autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary()  autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) # Fit PCA pca = decomposition.PCA(n_components=2) pca.fit(X_train_scaled) Image for post Figure 2. Structure of the single-layer Autoencoder. Visually, the Encoder-Decoder structure developed here is shown in Figure 3 below. The figure helps understand how the Weight matrices are aligned. Image for post Figure 3. A simple linear Autoencoder to encode a 5-dimensional data into 2-dimensional features. To follow the PCA properties, the Autoencoder in Figure 3 should follow conditions in Eq. 7a-d. Below, we will show that this conventional Autoencoder does not meet any of them. 1. Tied Weights As we can see below, the weights on Encoder and Decoder are different. w_encoder = np.round(autoencoder.layers[0].get_weights()[0], 2).T  # W in Figure 3. w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 2)  # W' in Figure 3. print('Encoder weights \n', w_encoder) print('Decoder weights \n', w_decoder) Image for post 2. Weight Orthogonality As shown below, unlike PCA weights (i.e. the eigenvectors), the weights on Encoder and Decoder are not orthogonal. w_pca = pca.components_ np.round(np.dot(w_pca, w_pca.T), 3) Image for post np.round(np.dot(w_encoder, w_encoder.T), 3) Image for post np.round(np.dot(w_decoder, w_decoder.T), 3) Image for post 3. Features Correlation In PCA, the features are uncorrelated. pca_features = pca.fit_transform(X_train_scaled) np.round(np.cov(pca_features.T), 5) Image for post But the Encoded features are correlated. encoder_layer = Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[0].output) encoded_features = np.array(encoder_layer.predict(X_train_scaled)) print('Encoded feature covariance\n', np.cov(encoded_features.T)) Image for post Weight non-orthogonality and features correlations are undesirable because it brings redundancy in information contained within the Encoded features. 4. Unit Norm The unit norm for PCA weights is 1. This is a constraint applied in PCA estimation to yield a proper estimate. print('PCA weights norm, \n', np.sum(w_pca ** 2, axis = 1)) print('Encoder weights norm, \n', np.sum(w_encoder ** 2, axis = 1)) print('Decoder weights norm, \n', np.sum(w_decoder ** 2, axis = 1)) Image for post Github Repository The complete code is available here. cran2367/pca-autoencoder-relationship Understand the relationship between PCA and autoencoder - cran2367/pca-autoencoder-relationship github.com Conclusion As such, an Autoencoder model is ill-posed. An ill-posed model does not have robust estimates. This adversely affects its test accuracy, i.e. reconstruction error on a new data. Several recent research advancements are building and utilizing orthogonality conditions to improve Deep Learning model performance. Refer to [4] and [5] for some research directions. In the sequel, Part II, we will implement custom constraints to incorporate the abovementioned properties derived from PCA into Autoencoders. We will see that adding the constraints improves the test reconstruction error. Go to the sequel, Part II. References Building Autoencoders in Keras Understanding Principal Component Analysis Principal Component Analysis: Algorithm using Reconstruction Error (Page 15). Huang, Lei, et al. “Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018. Brock, Andrew, et al. “Neural photo editing with introspective adversarial networks.” arXiv preprint arXiv:1609.07093 (2016)

Here we will learn the desired properties in Autoencoders derived from its similarity with PCA. From that, we will build custom constraints for Autoencoders in Part II for tuning and optimization.

In Part I, we learned that PCA and Autoencoders share architectural similarities. But despite this, an Autoencoder by itself does not have PCA properties, e.g. orthogonality. We understood that incorporating the PCA properties will bring significant benefits to an Autoencoder, such as resolving vanishing and exploding gradient, and overfitting via regularization. Based on this, properties that we would like Autoencoders to inherit are, Tied weights, Orthogonal weights, Uncorrelated features, and Unit Norm. In this article, we will implement custom layer and constraints to incorporate them. demonstrate how they work, and the improvements in reconstruction errors that they bring. These implementations will enable constructing a well-posed Autoencoder and optimizing it. In our example, the optimizations improved the reconstruction error by more than 50%. Note: regularization techniques, such as, dropout, are popularly used. But without a well-posed model, these approaches take longer to optimize. The following section shows the implementation in detail. The reader can skip to Key Takeaways section for a brief summary. A well-posed Autoencoder Image for post Figure 1. Apply constraints for a well-posed Autoencoder. We will develop an Autoencoder for a randomly generated dataset with five features. We divide the dataset into train and test. As we add constraints, we will evaluate the performance with test data reconstruction error. This article contains the implementation details to help a practitioner try a variety of choices. The complete code is present here. Import libraries from numpy.random import seed seed(123) from tensorflow import set_random_seed set_random_seed(234) import sklearn from sklearn import datasets import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn import decomposition import scipy import tensorflow as tf from keras.models import Model, load_model from keras.layers import Input, Dense, Layer, InputSpec from keras.callbacks import ModelCheckpoint, TensorBoard from keras import regularizers, activations, initializers, constraints, Sequential from keras import backend as K from keras.constraints import UnitNorm, Constraint Generate and prepare Data n_dim = 5 cov = sklearn.datasets.make_spd_matrix(n_dim, random_state=None)mu = np.random.normal(0, 0.1, n_dim) n = 1000 X = np.random.multivariate_normal(mu, cov, n) X_train, X_test = train_test_split(X, test_size=0.5, random_state=123) # Scale the data between 0 and 1. scaler = MinMaxScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) X_train_scaled Image for post Estimation Parameters nb_epoch = 100 batch_size = 16 input_dim = X_train_scaled.shape[1] #num of predictor variables,  encoding_dim = 2 learning_rate = 1e-3 Baseline Model encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True)  decoder = Dense(input_dim, activation="linear", use_bias = True)  autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder)  autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary()  autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) Image for post Figure 2.1. Baseline Model Parameters. Baseline reconstruction error train_predictions = autoencoder.predict(X_train_scaled) print('Train reconstrunction error\n', sklearn.metrics.mean_squared_error(X_train_scaled, train_predictions)) test_predictions = autoencoder.predict(X_test_scaled) print('Test reconstrunction error\n', sklearn.metrics.mean_squared_error(X_test_scaled, test_predictions)) Image for post Figure 2.2. Baseline Autoencoder Reconstruction Error. Autoencoder Optimization Keras provides a variety of layers and constraints. We have an available constraint for Unit Norm. For others, we will build custom layer and constraints. Custom Layer: Tied weights. With this custom layer, we enforce the weights on encoder and decoder as equal. Mathematically, the transpose of decoder weights equals the encoder weights (Eq. 7a in Part I). class DenseTied(Layer):     def __init__(self, units,                  activation=None,                  use_bias=True,                  kernel_initializer='glorot_uniform',                  bias_initializer='zeros',                  kernel_regularizer=None,                  bias_regularizer=None,                  activity_regularizer=None,                  kernel_constraint=None,                  bias_constraint=None,                  tied_to=None,                  **kwargs):         self.tied_to = tied_to         if 'input_shape' not in kwargs and 'input_dim' in kwargs:             kwargs['input_shape'] = (kwargs.pop('input_dim'),)         super().__init__(**kwargs)         self.units = units         self.activation = activations.get(activation)         self.use_bias = use_bias         self.kernel_initializer = initializers.get(kernel_initializer)         self.bias_initializer = initializers.get(bias_initializer)         self.kernel_regularizer = regularizers.get(kernel_regularizer)         self.bias_regularizer = regularizers.get(bias_regularizer)         self.activity_regularizer = regularizers.get(activity_regularizer)         self.kernel_constraint = constraints.get(kernel_constraint)         self.bias_constraint = constraints.get(bias_constraint)         self.input_spec = InputSpec(min_ndim=2)         self.supports_masking = True                      def build(self, input_shape):         assert len(input_shape) >= 2         input_dim = input_shape[-1]          if self.tied_to is not None:             self.kernel = K.transpose(self.tied_to.kernel)             self._non_trainable_weights.append(self.kernel)         else:             self.kernel = self.add_weight(shape=(input_dim, self.units),                                           initializer=self.kernel_initializer,                                           name='kernel',                                           regularizer=self.kernel_regularizer,                                           constraint=self.kernel_constraint)         if self.use_bias:             self.bias = self.add_weight(shape=(self.units,),                                         initializer=self.bias_initializer,                                         name='bias',                                         regularizer=self.bias_regularizer,                                         constraint=self.bias_constraint)         else:             self.bias = None         self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dim})         self.built = True      def compute_output_shape(self, input_shape):         assert input_shape and len(input_shape) >= 2         output_shape = list(input_shape)         output_shape[-1] = self.units         return tuple(output_shape)      def call(self, inputs):         output = K.dot(inputs, self.kernel)         if self.use_bias:             output = K.bias_add(output, self.bias, data_format='channels_last')         if self.activation is not None:             output = self.activation(output)         return output Autoencoder with Tied Decoder. encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True)  decoder = DenseTied(input_dim, activation="linear", tied_to=encoder, use_bias = True) autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder) autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary() autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) Observations 1a. Equal weights. w_encoder = np.round(np.transpose(autoencoder.layers[0].get_weights()[0]), 3) w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 3) print('Encoder weights\n', w_encoder) print('Decoder weights\n', w_decoder) Image for post 1b. Biases are different. b_encoder = np.round(np.transpose(autoencoder.layers[0].get_weights()[1]), 3) b_decoder = np.round(np.transpose(autoencoder.layers[1].get_weights()[0]), 3) print('Encoder bias\n', b_encoder) print('Decoder bias\n', b_decoder) Image for post 2. Custom Constraint: Weights Orthogonality. class WeightsOrthogonalityConstraint (Constraint):     def __init__(self, encoding_dim, weightage = 1.0, axis = 0):         self.encoding_dim = encoding_dim         self.weightage = weightage         self.axis = axis              def weights_orthogonality(self, w):         if(self.axis==1):             w = K.transpose(w)         if(self.encoding_dim > 1):             m = K.dot(K.transpose(w), w) - K.eye(self.encoding_dim)             return self.weightage * K.sqrt(K.sum(K.square(m)))         else:             m = K.sum(w ** 2) - 1.             return m      def __call__(self, w):         return self.weights_orthogonality(w) Applying Orthogonality on both Encoder and Decoder Weights. encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias=True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=0))  decoder = Dense(input_dim, activation="linear", use_bias = True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=1))  autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder)  autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary()  autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) Observation. 2a. The weights are close to orthogonal for both Encoder and Decoder. w_encoder = autoencoder.layers[0].get_weights()[0] print('Encoder weights dot product\n', np.round(np.dot(w_encoder.T, w_encoder), 2))  w_decoder = autoencoder.layers[1].get_weights()[0] print('Decoder weights dot product\n', np.round(np.dot(w_decoder, w_decoder.T), 2)) Image for post 3. Custom Constraint: Uncorrelated Encoded features. For uncorrelated features, we will impose penalty on the sum of off-diagonal elements of the encoded features covariance. class UncorrelatedFeaturesConstraint (Constraint):          def __init__(self, encoding_dim, weightage = 1.0):         self.encoding_dim = encoding_dim         self.weightage = weightage          def get_covariance(self, x):         x_centered_list = []          for i in range(self.encoding_dim):             x_centered_list.append(x[:, i] - K.mean(x[:, i]))                  x_centered = tf.stack(x_centered_list)         covariance = K.dot(x_centered, K.transpose(x_centered)) / tf.cast(x_centered.get_shape()[0], tf.float32)                  return covariance                  # Constraint penalty     def uncorrelated_feature(self, x):         if(self.encoding_dim <= 1):             return 0.0         else:             output = K.sum(K.square(                 self.covariance - tf.math.multiply(self.covariance, K.eye(self.encoding_dim))))             return output      def __call__(self, x):         self.covariance = self.get_covariance(x)         return self.weightage * self.uncorrelated_feature(x) Applying the constraint in the Autoencoder. encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, activity_regularizer=UncorrelatedFeaturesConstraint(encoding_dim, weightage = 1.))  decoder = Dense(input_dim, activation="linear", use_bias = True)  autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder)  autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary()  autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) Observation. 3a. We have less correlated encoded features. Imposing this penalty is harder. A stronger constraint function can be explored. encoder_layer = Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[0].output) encoded_features = np.array(encoder_layer.predict(X_train_scaled)) print('Encoded feature covariance\n', np.round(np.cov(encoded_features.T), 3)) Image for post 4. Constraint: Unit Norm. UnitNorm constraint is prebuilt in Keras. We will apply this constraint on both Encoder and Decoder layers. It is important to note that we keep axis=0 for Encoder layer and axis=1 for the Decoder layer. encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, kernel_constraint=UnitNorm(axis=0))  decoder = Dense(input_dim, activation="linear", use_bias = True, kernel_constraint=UnitNorm(axis=1)) autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder) autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary() autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) Observation. 4.a. The norms of weights on Encoder and Decoder along the encoding axis is, w_encoder = np.round(autoencoder.layers[0].get_weights()[0], 2).T  # W in Figure 3. w_decoder = np.round(autoencoder.layers[1].get_weights()[0], 2)  # W' in Figure 3. print('Encoder weights norm, \n', np.round(np.sum(w_encoder ** 2, axis = 1),3)) print('Decoder weights norm, \n', np.round(np.sum(w_decoder ** 2, axis = 1),3)) Image for post As also mentioned before, the norms are not exactly 1.0 because this not a hard constraint. Putting Everything Together Here we will put together the above properties. Depending on the problem, a certain combination of these properties will work better than others. Applying several constraints together can sometimes harm the estimation. For example, in the dataset used here, combining Tied Layer, Weight Orthogonality, and UnitNorm worked the best. encoder = Dense(encoding_dim, activation="linear", input_shape=(input_dim,), use_bias = True, kernel_regularizer=WeightsOrthogonalityConstraint(encoding_dim, weightage=1., axis=0), kernel_constraint=UnitNorm(axis=0))  decoder = DenseTied(input_dim, activation="linear", tied_to=encoder, use_bias = False) autoencoder = Sequential() autoencoder.add(encoder) autoencoder.add(decoder) autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='sgd') autoencoder.summary() autoencoder.fit(X_train_scaled, X_train_scaled,                 epochs=nb_epoch,                 batch_size=batch_size,                 shuffle=True,                 verbose=0) train_predictions = autoencoder.predict(X_train_scaled) print('Train reconstrunction error\n', sklearn.metrics.mean_squared_error(X_train_scaled, train_predictions)) test_predictions = autoencoder.predict(X_test_scaled) print('Test reconstrunction error\n', sklearn.metrics.mean_squared_error(X_test_scaled, test_predictions)) Image for post GitHub Repository The complete mentioning the steps and more details on model tuning is here. cran2367/pca-autoencoder-relationship Understand the relationship between PCA and autoencoder - cran2367/pca-autoencoder-relationship github.com Key takeaways Improvement in Reconstruction Error Image for post Table 1. Summary of reconstruction errors. The reconstruction error on test data for baseline model is 0.027. Adding each property to the Autoencoder reduced the test error. The improvement ranges from 19% by Weight Orthogonality to 67% by Tied Weights. The improvements will vary from data to data. In our problem, combining Tied Weights, Weight Orthogonality, and UnitNorm yielded the optimal model with the best reconstruction error. Although the error in the optimal model is larger than the model with only Tied Weights, this is more stable and, hence, preferable. Key Implementation Notes Tied Weights In the Tied Weights layer, DenseTied, the biases will be different in the Encoder and Decoder. To have exactly all weights as equal, set use_bias=False. Weight Orthogonality kernel_regularizer is used for adding constraints or regularization on weights of a layer. The axis of orthogonality should be by row, axis=0, for encoder and by column, axis=1, for decoder. Uncorrelated Encoded features activity_regularizer is used to apply constraints on the output features of a layer. Therefore, it is used here to constrain off-diagonal covariance of encoded features to zero. This constraint is not strong. Meaning, it does not push the off-diagonal covariance elements extremely close to zero. Another customization of this constraint can be explored. Unit Norm UnitNorm should be on different axes for encoder and Decoder. Similar to weight orthogonality, this is applied on rows, axis=0, for encoder and on columns, axis=1, for decoder. General There are two classes, Regularizer and Constraints for building the custom functions. Practically, both are the same for our application. We used Constraints class for Weight Orthogonality and Uncorrelated features. All the three constraints — Unit Norm, Uncorrelated Encoded features, and Weight Orthogonality — are soft constraints. Meaning, they can bring model weights and features close to the desired property but not exactly. For example, the Weights are nearly Unit Norm and not exactly. In practice, The performance of incorporating these properties will differ across problems. Explore each property individually under different settings, e.g. with and without bias, and different weightages for orthogonality and uncorrelated features constraints on different layers. In addition to these, include popular regularization techniques, such as Dropout layers, in the Autoencoder. Summary In the prequel, Part I, we learned the important properties that Autoencoders should inherit from PCA. Here, we implemented custom layers and constraints to incorporate those properties. In Table 1, we showed that these properties significantly improve the test reconstruction error. As mentioned in the Key Takeaways, we require to do trial-and-error to find the best settings. However, these trials are in a direction with interpretative meanings. Go to the prequel, Part I.

In continuation of Part I, here we will define and implement custom constraints for building a well-posed Autoencoder. A well-posed Autoencoder is a regularized model that improves the test reconstruction error.

TensorFlow 2 is going to change the landscape of Deep Learning. It has made, model building simpler, production deployment on any platform more robust, and enables powerful experimentation for research. With these, Deep Learning is going to become more mainstream in various areas in research and industry. TensorFlow 2 has Keras API integrated in it. Keras is an extremely popular high-level API for building and training deep learning models. Before going forward it is important to know, TensorFlow 1.x also supports Keras, but in 2.0 Keras is integrated tightly with the rest of the TensorFlow platform. 2.0 is providing a single high-level API to reduce confusion and enable advanced capabilities. The Keras commonly used now is an independent open source project found at www.keras.io (June, 2019). However, Keras is an API spec that is now also available in TensorFlow (see [1] for details). I recommend reading [1] and [2] to know more details on the benefits of TensorFlow 2.0. In summary, TF 2.0 has brought the ease-of-implementation along with immense computational efficiency, and compatibility with any platform, such as, Android, iOS and embedded systems like a Raspberry Pi and Edge TPUs. Achieving these were difficult before and required investing time on finding alternate ways. As TensorFlow 2 has brought all of them, it is imperative to migrate to it sooner than later. To that end, here we will learn installing and setting up TensorFlow 2.0. Prerequisites Option 1: Python 3.4+ through Anaconda Anaconda with Jupyter provides a simpler approach for installing Python and working on it. Installing Anaconda is relatively straightforward. Follow this link with the latest Python 3.4+: https://jupyter.org/install Similar to pip, with Anaconda we have conda for creating virtual environments and installing packages. Option 2: Python (without Anaconda) a. Install Python 3.4+ Check your current versions. $ python --version or, $ python3 --version I have different Python on my Mac (Python 3.6 on Anaconda) and Ubuntu (Python 3.7). The output I see on them are, Python 3.6.8 :: Anaconda custom (x86_64)# Mac Python 3.7.1# Ubuntu Either Python within Anaconda or otherwise will work. If your version is not 3.4+, install it as follows. $ brew update $ brew install python # Installs Python 3 $ sudo apt install python3-dev python3-pip b. Install virtualenv virtualenv is required to create a virtual environment. Its requirement is explained in the next section. Mac OS $ sudo pip3 install -U virtualenv# system-wide install Ubuntu $ sudo pip3 install -U virtualenv# system-wide install Note: pip (instead of pip3) is also used sometimes. If unsure between the two, use pip3. You will not go wrong with pip3. If you want to know whether you could use pip , run the following $ pip3 --version pip 19.1.1 from /Users/inferno/anaconda/lib/python3.6/site-packages/pip (python 3.6) $ pip --version pip 19.1.1 from /Users/inferno/anaconda/lib/python3.6/site-packages/pip (python 3.6) In my system, the versions are the same for both pip and pip3. Therefore, I can use either of them. In the following, we will look at the installations steps with both. Step 1. Create a virtual environment in Python. Why we want a virtual environment? A virtual environment is an isolated environment for Python projects. Inside a virtual environment we can have a completely independent set of packages (dependencies) and settings that will not conflict with anything in other virtual environment or with the default local Python environment. This means we can keep different versions of the same package, e.g. we can use scikit-learn 0.1 for one project, and scikit-learn 0.22 for another project on the same system but in different virtual environments. Instantiate a virtual environment Ubuntu/Mac (Python without Anaconda) $ virtualenv --system-site-packages -p python3 tf_2 The above command will create a virtual environment tf_2. Understanding the command, virtualenv will create a virtual environment. --system-site-packages allows the projects within the virtual environment tf_2 access the global site-packages. The default setting does not allow this access (--no-site-packages was used before for this default setting but now deprecated.) -p python3 is used to set the Python interpreter for tf_2. This argument can be skipped if the virtualenv was installed with Python3. By default, that is the python interpreter for the virtual environment. Another option for setting Python3.x as interpreter is $ virtualenv --system-site-packages --python=python3.7 tf_2. This gives more control. tf_2 is the name of the virtual environment we created. This creates a physical directory at the location of the virtual environments. This /tf_2 directory contains a copy of the Python compiler and all the packages we will install later. Conda on Ubuntu/Mac (Python from Anaconda) If you are using Conda, you can create the virtual environment as, $ conda create -n tf_2 The above command will also create a virtual environment tf_2. Unlike before, we do not require to install a different package for creating a virtual environment. The in-built conda command provides this. Understanding the command, conda can be used to create virtual environments, install packages, list the installed packages in the environment, and so on. In short, conda performs operations that pip and virtualenv does. However, conda does not replace pip as some packages are available on pip but not on conda. create is used to create a virtual environment. -n is an argument specific to create. -n is used to name the virtual environment. The value of n, i.e. the environment name, here is tf_2. Additional useful arguments: similar to--system-site-packages in virtualenv, --use-local can be used. Step 2. Activate the virtual environment. Activate the virtual environment. Ubuntu/Mac (Python without Anaconda) $ source tf_2/bin/activate Conda on Ubuntu/Mac (Python from Anaconda) $ conda activate tf_2 After the activation, the terminal will change to this (tf_2) $ . Step 3. Install TensorFlow 2.0. The following instructions are the same for the both Python options. Before starting the TensorFlow installation, we will update pip. (tf_2) $ pip install --upgrade pip Now, install TensorFlow. (tf_2) $ pip install --upgrade tensorflow==2.0.0-beta1 The tensorflow argument above installs a 2.0.0-beta1 CPU-only version. Choose the appropriate TensorFlow version from https://www.tensorflow.org/install/pip . At the time of writing this article, we have tensorflow 2.0.0-beta1. This is recommended. We can change the argument to one of the following based on our requirement. tensorflow==2.0.0-beta1 -Preview TF 2.0 Beta build for CPU-only (recommended). tensorflow-gpu==2.0.0-beta1 -Preview TF 2.0 Beta build with GPU support. tensorflow -Latest stable release for CPU-only. tensorflow-gpu -Latest stable release with GPU support. tf-nightly -Preview nightly build for CPU-only. tf-nightly-gpu -Preview nightly build with GPU support. Note: we will use pip install for conda as well. TensorFlow is not available with conda. Step 4. Test the installation. To quickly test the installation through the terminal, use (tf_2) $ python -c "import tensorflow as tf; x = [[2.]]; print('tensorflow version', tf.__version__); print('hello, {}'.format(tf.matmul(x, x)))" The output will be (ignoring the system messages), tensorflow version 2.0.0-beta1 hello, [[4.]] Pay attention to the TensorFlow version output. If it is not the version you installed (2.0.0-beta1, in this case), then something went wrong. Most likely, there is a prior installed TensorFlow and/or the current installation failed. TensorFlow 2.0 Example We will test and learn the TensorFlow 2.0 with MNIST ( fashion_mnist) image classification example. import matplotlib.pyplot as plt  import tensorflow as tf layers = tf.keras.layers  import numpy as np  print(tf.__version__) Make sure the tf.__version__ outputs 2.x. If the version is older, check the installation or the virtual environment. Download the fashion_mnist data from the tf open datasets and pre-process it. mnist = tf.keras.datasets.fashion_mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 To get familiarized with the data, we will plot a few examples from it. class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] plt.figure(figsize=(10,10)) for i in range(25):  plt.subplot(5,5,i+1)  plt.xticks([])  plt.yticks([])  plt.grid(False)  plt.imshow(x_train[i], cmap=plt.cm.binary)  plt.xlabel(class_names[y_train[i]]) plt.show() Image for post Now, we will build the model layer-by-layer. model = tf.keras.Sequential() model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10, activation='softmax')) model.compile(optimizer='adam',  loss='sparse_categorical_crossentropy',  metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) Image for post Note that this model is only for demonstration and, therefore, trained on just five epochs. We will now test the model accuracy on the test data. model.evaluate(x_test, y_test) Image for post We will visualize one of the predictions. We will use some UDFs from [ 3]. def plot_image(i, predictions_array, true_label, img):  predictions_array, true_label, img = predictions_array[i], true_label[i], img[i]  plt.grid(False)  plt.xticks([])  plt.yticks([])    plt.imshow(img, cmap=plt.cm.binary)  predicted_label = np.argmax(predictions_array)    if predicted_label == true_label:  color = 'blue'  else:  color = 'red'    plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],  100*np.max(predictions_array),  class_names[true_label]),  color=color) def plot_value_array(i, predictions_array, true_label):  predictions_array, true_label = predictions_array[i], true_label[i]  plt.grid(False)  plt.xticks([])  plt.yticks([])  thisplot = plt.bar(range(10), predictions_array, color="#777777")  plt.ylim([0, 1])   predicted_label = np.argmax(predictions_array)    thisplot[predicted_label].set_color('red')  thisplot[true_label].set_color('blue') We will find the prediction, i.e. the probability of each image belonging to each of the 10 classes, for the test images. predictions = model.predict(x_test) i = 0 plt.figure(figsize=(6,3)) plt.subplot(1,2,1) plot_image(i, predictions, y_test, x_test) plt.subplot(1,2,2) plot_value_array(i, predictions, y_test) plt.show() Image for post As we can see in the plot above, the prediction probability of ‘Ankle boot’ is the highest. To further confirm, we output the predicted label as, predicted_label = class_names[np.argmax(predictions[0])] print('Actual label:', class_names[y_test[0]])  print('Predicted label:', predicted_label) Image for post Step 5. Deactivate the virtual environment Before closing, we will deactivate the virtual environment. For virtualenv use, (tf_2) $ deactivate For conda use, (tf_2) $ conda deactivate The GitHub repository with the MNIST example on TensorFlow 2.0 is here. Conclusion TensorFlow 2.0 has brought the easy-to-use capabilities of keras API, e.g. layer-by-layer modeling. We learned installing TensorFlow 2.0. We went through a real MNIST data classification example with TF 2.0. References Standardizing on Keras: Guidance on High-level APIs in TensorFlow 2.0 What’s coming in TensorFlow 2.0 Train your first neural network: basic classification

Due to its ease-of-use, efficiency, and cross-compatibility TensorFlow 2 is going to change the landscape of Deep Learning. Here we will learn to install and set it up. We will also implement the MNIST classification with TensorFlow 2.

 In my previous post, LSTM Autoencoder for Extreme Rare Event Classification [1], we learned how to build an LSTM autoencoder for a multivariate time-series data. However, LSTMs in Deep Learning is a bit more involved. Understanding the LSTM intermediate layers and its settings is not straightforward. For example, usage of return_sequences argument, and RepeatVector and TimeDistributed layers can be confusing. LSTM tutorials have well explained the structure and input/output of LSTM cells, e.g. [2, 3]. But despite its peculiarities, little is found that explains the mechanism of LSTM layers working together in a network. Here we will break down an LSTM autoencoder network to understand them layer-by-layer. Additionally, the popularly used seq2seq networks are similar to LSTM Autoencoders. Hence, most of these explanations are applicable for seq2seq as well. In this article, we will use a simple toy example to learn, Meaning of return_sequences=True, RepeatVector(), and TimeDistributed(). Understanding the input and output of each LSTM Network layer. Differences between a regular LSTM network and an LSTM Autoencoder. Understanding Model Architecture Importing our necessities first. # lstm autoencoder to recreate a timeseries import numpy as np from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import RepeatVector from keras.layers import TimeDistributed ''' A UDF to convert input data into 3-D array as required for LSTM network. '''  def temporalize(X, y, lookback):     output_X = []     output_y = []     for i in range(len(X)-lookback-1):         t = []         for j in range(1,lookback+1):             # Gather past records upto the lookback period             t.append(X[[(i+j+1)], :])         output_X.append(t)         output_y.append(y[i+lookback+1])     return output_X, output_y Creating an example data We will create a toy example of a multivariate time-series data. # define input timeseries timeseries = np.array([[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],                        [0.1**3, 0.2**3, 0.3**3, 0.4**3, 0.5**3, 0.6**3, 0.7**3, 0.8**3, 0.9**3]]).transpose()  timesteps = timeseries.shape[0] n_features = timeseries.shape[1] timeseries Image for post Figure 1.1. Raw dataset. As required for LSTM networks, we require to reshape an input data into n_samples x timesteps x n_features. In this example, the n_features is 2. We will make timesteps = 3. With this, the resultant n_samples is 5 (as the input data has 9 rows). timesteps = 3 X, y = temporalize(X = timeseries, y = np.zeros(len(timeseries)), lookback = timesteps)  n_features = 2 X = np.array(X) X = X.reshape(X.shape[0], timesteps, n_features)  X Image for post Figure 1.2. Data transformed to a 3D array for an LSTM network. Understanding an LSTM Autoencoder Structure In this section, we will build an LSTM Autoencoder network, and visualize its architecture and data flow. We will also look at a regular LSTM Network to compare and contrast its differences with an Autoencoder. Defining an LSTM Autoencoder. # define model model = Sequential() model.add(LSTM(128, activation='relu', input_shape=(timesteps,n_features), return_sequences=True)) model.add(LSTM(64, activation='relu', return_sequences=False)) model.add(RepeatVector(timesteps)) model.add(LSTM(64, activation='relu', return_sequences=True)) model.add(LSTM(128, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(n_features))) model.compile(optimizer='adam', loss='mse') model.summary() Image for post Figure 2.1. Model Summary of LSTM Autoencoder. # fit model model.fit(X, X, epochs=300, batch_size=5, verbose=0) # demonstrate reconstruction yhat = model.predict(X, verbose=0) print('---Predicted---') print(np.round(yhat,3)) print('---Actual---') print(np.round(X, 3)) Image for post Figure 2.2. Input Reconstruction of LSTM Autoencoder. The model.summary() provides a summary of the model architecture. For a better understanding, let’s visualize it in Figure 2.3 below. Image for post Figure 2.3. LSTM Autoencoder Flow Diagram. The diagram illustrates the flow of data through the layers of an LSTM Autoencoder network for one sample of data. A sample of data is one instance from a dataset. In our example, one sample is a sub-array of size 3x2 in Figure 1.2. From this diagram, we learn The LSTM network takes a 2D array as input. One layer of LSTM has as many cells as the timesteps. Setting the return_sequences=True makes each cell per timestep emit a signal. This becomes clearer in Figure 2.4 which shows the difference between return_sequences as True (Fig. 2.4a) vs False (Fig. 2.4b). Image for post Figure 2.4. Difference between return_sequences as True and False. In Fig. 2.4a, signal from a timestep cell in one layer is received by the cell of the same timestep in the subsequent layer. In the encoder and decoder modules in an LSTM autoencoder, it is important to have direct connections between respective timestep cells in consecutive LSTM layers as in Fig 2.4a. In Fig. 2.4b, only the last timestep cell emits signals. The output is, therefore, a vector. As shown in Fig. 2.4b, if the subsequent layer is LSTM, we duplicate this vector using RepeatVector(timesteps) to get a 2D array for the next layer. No transformation is required if the subsequent layer is Dense (because a Dense layer expects a vector as input). Coming back to the LSTM Autoencoder in Fig 2.3. The input data has 3 timesteps and 2 features. Layer 1, LSTM(128), reads the input data and outputs 128 features with 3 timesteps for each because return_sequences=True. Layer 2, LSTM(64), takes the 3x128 input from Layer 1 and reduces the feature size to 64. Since return_sequences=False, it outputs a feature vector of size 1x64. The output of this layer is the encoded feature vector of the input data. This encoded feature vector can be extracted and used as a data compression, or features for any other supervised or unsupervised learning (in the next post we will see how to extract this). Layer 3, RepeatVector(3), replicates the feature vector 3 times. The RepeatVector layer acts as a bridge between the encoder and decoder modules. It prepares the 2D array input for the first LSTM layer in Decoder. The Decoder layer is designed to unfold the encoding. Therefore, the Decoder layers are stacked in the reverse order of the Encoder. Layer 4, LSTM (64), and Layer 5, LSTM (128), are the mirror images of Layer 2 and Layer 1, respectively. Layer 6, TimeDistributed(Dense(2)), is added in the end to get the output, where “2” is the number of features in the input data. The TimeDistributed layer creates a vector of length equal to the number of features outputted from the previous layer. In this network, Layer 5 outputs 128 features. Therefore, the TimeDistributed layer creates a 128 long vector and duplicates it 2 (= n_features) times. The output of Layer 5 is a 3x128 array that we denote as U and that of TimeDistributed in Layer 6 is 128x2 array denoted as V. A matrix multiplication between U and V yields a 3x2 output. The objective of fitting the network is to make this output close to the input. Note that this network itself ensured that the input and output dimensions match. Comparing LSTM Autoencoder with a regular LSTM Network The above understanding gets clearer when we compare it with a regular LSTM network built for reconstructing the inputs. # define model model = Sequential() model.add(LSTM(128, activation='relu', input_shape=(timesteps,n_features), return_sequences=True)) model.add(LSTM(64, activation='relu', return_sequences=True)) model.add(LSTM(64, activation='relu', return_sequences=True)) model.add(LSTM(128, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(n_features))) model.compile(optimizer='adam', loss='mse') model.summary() Image for post Figure 3.1. Model Summary of LSTM Autoencoder. # fit model model.fit(X, X, epochs=300, batch_size=5, verbose=0) # demonstrate reconstruction yhat = model.predict(X, verbose=0) print('---Predicted---') print(np.round(yhat,3)) print('---Actual---') print(np.round(X, 3)) Image for post Figure 3.2. Input Reconstruction of regular LSTM network. Image for post Figure 3.3. Regular LSTM Network flow diagram. Differences between Regular LSTM network and LSTM Autoencoder We are using return_sequences=True in all the LSTM layers. That means, each layer is outputting a 2D array containing each timesteps. Thus, there is no one-dimensional encoded feature vector as output of any intermediate layer. Therefore, encoding a sample into a feature vector is not happening. Absence of this encoding vector differentiates the regular LSTM network for reconstruction from an LSTM Autoencoder. However, note that the number of parameters is the same in both, the Autoencoder (Fig. 2.1) and the Regular network (Fig. 3.1). This is because, the extra RepeatVector layer in the Autoencoder does not have any additional parameter. Most importantly, the reconstruction accuracies of both Networks are similar. Food for thought The rare-event classification using anomaly detection approach discussed in LSTM Autoencoder for rare-event classification [1] is training an LSTM Autoencoder to detect the rare events. The objective of the Autoencoder network in [1] is to reconstruct the input and classify the poorly reconstructed samples as a rare event. Since, we can also build a regular LSTM network to reconstruct a time-series data as shown in Figure 3.3, will that improve the results? The hypothesis behind this is, due to the absence of an encoding layer the accuracy of reconstruction can be better in some cases (because the dimension time-dimension is not reduced). Unless the encoded vector is required for any other analysis, trying a regular LSTM network is worth a try for a rare-event classification. Github Repository The complete code can be found here. cran2367/understanding-lstm-autoencoder Understanding an LSTM Autoencoder. Contribute to cran2367/understanding-lstm-autoencoder development by creating an… github.com Conclusion In this article, we worked with a toy example to understand an LSTM network layer-by-layer. understood the input and output flow from and between each layer. understood the meaning of return_sequences , RepeatVector() , and TimeDistributed(). compared and contrasted an LSTM Autoencoder with a regular LSTM network. In the next article, we will learn about optimizing a Network: how to decide on adding a new layer and its size? References LSTM Autoencoder for Extreme Rare Event Classification in Keras A Gentle Introduction to LSTM Autoencoders Understanding LSTM and its diagrams

Here we will break down an LSTM autoencoder network to understand them layer-by-layer. We will go over the input and output flow between the layers, and also, compare the LSTM Autoencoder with a regular LSTM network.

What is a Kernel Trick? In spite of its profound impact on the Machine Learning world, little is found that explains the fundamentals behind the Kernel Trick. Here we will take a look at it. By the end of this post, we will realize how simple the underlying concept is. And perhaps, this simplicity makes the Kernel Trick profound. If you’re reading this, you may already know as a fact that if there’s a dot product in a function we can use the Kernel trick. We typically come across this fact when learning about SVM. An SVM’s objective function is, Image for post In this objective function, we have the dot product 𝐱𝑖ᵀ⋅𝐱𝑗. Due to this dot product, SVM becomes extremely powerful because now we can use the Kernel trick. What’s this Kernel trick and how does it make SVM powerful? In the following, we will look at this concept and get to the smallest details to help us understand. This post should clear most of the why Kernel trick works questions, including what does it mean to work in infinite dimension? We’ll start with the most common example and then expand to the general case. Image for post Figure 1: Example of a labeled data inseparable in 2-Dimension is separable in 3-Dimension. Source: [2] In the above example, the original data is in 2-dimension. Suppose we denote it as, 𝐱={𝑥₁, 𝑥₂}. We can see in Fig.1 (left) that 𝐱 is inseparable in its space. But they are separable in a transformed space (see Fig.1, right) given by, Image for post where, Φ is a transform function from 2-D to 3-D applied on 𝐱. These points can also be separated with Φ(𝐱)→x₁², x₂² transformation, but the one in Eq. 2 above will help explain the use of a higher dimensional space. The √2 is not necessary but will make our further explanations mathematically convenient. With the Φ in Eq. 2, now we can have a decision boundary in a 3-D space that will look like, Image for post If we were doing a logistic regression, our model would be like Eq. 3. In SVM, a similar decision boundary (a classifier) can be found using the Kernel Trick. For that we need to find the dot products of ⟨Φ(𝐱𝑖),Φ(𝐱𝑗)⟩ (see Eq. 13 in [4]) Let’s do that. I’ll do it like this, My way: Image for post Instead, my friend Sam, who is smarter, did the following, Image for post What’s different? Computation operations: My way: To get to Eq. 4a, I perform 3×2 computations to transform each of 𝐱i and 𝐱𝑗 into the 3-D space of Φ. After that we perform a dot product between Φ(𝐱𝑖) and Φ(𝐱𝑗) which has 3 additional operations (in Eq. 4b). Total: 9 operations. Sam’s way: Until Eq. 5b, Sam did 2 computing operations. Finally, in Eq. 5c, he did one more computation. Total: 3 operations. Note that in Eq. 5b we are squaring a scalar, hence just one operation. Computation space: My way: I applied the mapping transform function Φ on my data 𝐱. And then performed my operations in the Φ space (a 3-D space). Sam’s way: Sam did not apply the transform function. He stayed in the original 2-D space and arrived at the same result as I had from computations in a 3-D space. Sam is definitely smarter than me. What he did turns out to be the Kernel trick. But I won’t leave just here. Let’s stew more on it with more examples. Suppose I wanted a bigger expression for my decision boundary than the one in Eq. 3 (because we expect it to work better), which has both first- and second-order terms as, Image for post Let’s see my way and Sam’s way again, My way: Image for post Sam’s way: Image for post Comparing my way and Sam’s way again: I had to explicitly define Φ. While one could argue that Sam had to explicitly know as well to add a 1 to the dot product of 𝐱𝑖, 𝐱𝑗, but as we will see soon that Sam’s approach can be easily generalized. My way took 16 operations, Sam’s way still took only 3 operations. (Again, note that in Eq. 8a, we are squaring a scalar, hence just one operation.) Sam again did not leave the original 2-D space to find the same similarity measure (dot product is also a similarity measure) I found in the 5-D space. Similarly, we can keep going to higher dimensions. If it was to Sam, he can easily find the similarity measure that has third-order terms in a 9-D space by, Sam’s way: Image for post At this point, I will not even bother to go my way. I hope you got the point why Sam’s way is clearly better than mine. He’s just computing the dot product in the original space and raising the result (a scalar) to a power. And this is exactly same as the dot product in a higher dimensional space. This is precisely the Kernel trick. Let’s summarize the Kernel trick from Sam’s method. Kernel: In the above examples, the Kernel functions used by Sam are, Image for post Mapping function: Sam did not need to know a 3-, 5-, or 9- dimensional mapping function, Φ, to get the similarity measure (the dot product) in these high-dimensions. Magic recap: All Sam needs to do is realize there is some higher dimensional space which can separate the data. Choose a corresponding Kernel and voila! He is now working in the high-dimension while doing computation in the original low dimensional space. And he is yet separating the earlier inseparable data. The Kernel examples Sam used so far are special cases of a Linear Kernel, Image for post How does the Kernel trick work in Infinite dimension? We know that Kernels can find the similarities in infinite dimensional spaces, and, without doing computation in the infinite space. If you’re still with me so far, now get ready for this crazy one. Here is the trick behind this magic. I will show the trick with a Gaussian Kernel (also called Radial Basis Function, RBF), and the same logic can be extended to other infinite-dimensional Kernels, such as, Exponential, Laplace, etc. A Gaussian Kernel is defined as, Image for post For simplicity, suppose 𝜎=1. The Gaussian Kernel in Eq. 10 can then the expanded as, Image for post From Sam’s way of calculations now we know that ⟨𝐱𝑖,𝐱𝑗⟩ⁿ will yield 𝑛-order terms. Since the Gaussian has an infinite series expansion, we get terms of all orders till infinity. And, therefore, a Gaussian Kernel enables us to find similarity in infinite dimension. In this instance too, all the computation we have to do is find the squared Euclidean distance between 𝐱𝑖 and 𝐱𝑗, and find its exponential (the computations happen in the original space). So Sam has nothing left to worry about while using Kernels? He does. At the end, he does need to choose which of the Kernel function to use. [1] has an exhaustive list of Kernels to choose from. And, tune the hyperparameters of a Kernel. For example, in Gaussian Kernel we need to tune 𝜎. [3] has a list of papers which talk about this. Do we need to know the appropriate Kernel function first? Yes, we do need to determine which Kernel function will be appropriate. However, we do not need to know it first. First, we have to realize that a linear decision boundary is not going to work. This is realized when we see a poor model accuracy, and some data visualization can be used (e.g. Figure 1), if possible. Upon realizing that linear boundary is not going to work, we go for a Kernel trick. Most Kernels will lead to a non-linear, and possibly better, decision boundary. [1] gives an exhaustive list of choices. But there is no direct way to know which Kernel function will be the best choice. Conventional Machine Learning model optimization methods, such as Cross Validation, can be used to find the Kernel function that performs the best. However, since with Kernel trick there is no additional computation for separating data points in some high-dimension or infinite dimension, people go with the infinite dimension by using the Gaussian (RBF) Kernel. RBF is the most commonly used Kernel. In short, as a rule of thumb, once you realize linear boundary is not going to work try a non-linear boundary with an RBF Kernel. Conclusion In this post, we went through the elementary details of the Kernel Trick. Our objective was to understand the Kernel Trick. We also found answer to how the Kernel Trick finds the dot product (similarity) in infinite dimension without increase in computation. Please leave a comment if you did not understand a part or any of it. Disclaimer: This is an edited version of my previous post on Quora for the question, what is the kernel trick and also a Quora blog here. References [1] Kernel Functions for Machine Learning Applications [2] Berkeley CS281B Lecture: The Kernel Trick [3] Support Vector Machines: Parameters [4] Support Vector Machines: Lecture (Stanford CS 229)

Here, we learn the fundamentals behind the Kernel Trick. How it works? How the Kernel Trick does the dot product (or similarity) in infinite dimension without increase in computation?

In spite of the groundbreaking results reported, little is known about Dropout from a theoretical standpoint. Likewise, the importance of Dropout rate as 0.5 and how it should be changed with layers are not evidently clear. Also, can we generalize Dropout to other approaches? The following will provide some explanations. Deep Learning architectures are now becoming deeper and wider. With these bigger networks, we are able to achieve good accuracies. However, this was not the case about a decade ago. Deep Learning was, in fact, infamous due to overfitting issue. Image for post Figure 1. A dense neural network. Then, around 2012, the idea of Dropout emerged. The concept revolutionized Deep Learning. Much of the success that we have with Deep Learning is attributed to Dropout. Quick recap: What is Dropout? Dropout changed the concept of learning all the weights together to learning a fraction of the weights in the network in each training iteration. Image for post Figure 2. Illustration of learning a part of the network in each iteration. This issue resolved the overfitting issue in large networks. And suddenly bigger and more accurate Deep Learning architectures became possible. In this post, our objective is to understand the Math behind Dropout. However, before we get to the Math, let’s take a step back and understand what changed with Dropout. This will be a motivation to touch the Math. Before Dropout, a major research area was regularization. Introduction of regularization methods in neural networks, such as L1 and L2 weight penalties, started from the early 2000s [1]. However, these regularizations did not completely solve the overfitting issue. The reason was Co-adaptation. Co-adaptation in Neural Network Image for post Figure 3. Co-adaption of node connections in a Neural Network. One major issue in learning large networks is co-adaptation. In such a network, if all the weights are learned together it is common that some of the connections will have more predictive capability than the others. In such a scenario, as the network is trained iteratively these powerful connections are learned more while the weaker ones are ignored. Over many iterations, only a fraction of the node connections is trained. And the rest stop participating. This phenomenon is called co-adaptation. This could not be prevented with the traditional regularization, like the L1 and L2. The reason is they also regularize based on the predictive capability of the connections. Due to this, they become close to deterministic in choosing and rejecting weights. And, thus again, the strong gets stronger and the weak gets weaker. A major fallout of this was: expanding the neural network size would not help. Consequently, neural networks’ size and, thus, accuracy became limited. Then came Dropout. A new regularization approach. It resolved the co-adaptation. Now, we could build deeper and wider networks. And use the prediction power of all of it. With this background, let’s dive into the Mathematics of Dropout. You may skip directly to Dropout equivalent to regularized Network section for the inferences. Math behind Dropout Consider a single layer linear unit in a network as shown in Figure 4 below. Refer [2] for details. Image for post Figure 4. A single layer linear unit out of network. This is called linear because of the linear activation, f(x) = x. As we can see in Figure 4, the output of the layer is a linear weighted sum of the inputs. We are considering this simplified case for a mathematical explanation. The results (empirically) hold for the usual non-linear networks. For model estimation, we minimize a loss function. For this linear layer, we will look at the ordinary least square loss, Image for post Eq. 1 shows loss for a regular network and Eq. 2 for a dropout network. In Eq. 2, the dropout rate is 𝛿, where 𝛿 ~ Bernoulli(p). This means 𝛿 is equal to 1 with probability p and 0 otherwise. The backpropagation for network training uses a gradient descent approach. We will, therefore, first look at the gradient of the dropout network in Eq. 2, and then come to the regular network in Eq. 1. Image for post Now, we will try to find a relationship between this gradient and the gradient of the regular network. To that end, suppose we make w’ = p*w in Eq. 1. Therefore, Image for post Taking the derivative of Eq. 4, we find, Image for post Now, we have the interesting part. If we find the expectation of the gradient of the Dropout network, we get, Image for post If we look at Eq. 6, the expectation of the gradient with Dropout, is equal to the gradient of Regularized regular network Eɴ if w’ = p*w. Dropout equivalent to regularized Network This means minimizing the Dropout loss (in Eq. 2) is equivalent to minimizing a regularized network, shown in Eq. 7 below. Image for post That is, if you differentiate a regularized network in Eq. 7, you will get to the (expectation of) gradient of a Dropout network as in Eq. 6. This is a profound relationship. From here, we can answer: Why dropout rate, p = 0.5, yields the maximum regularization? This is because the regularization parameter, p(1-p) in Eq. 7, is maximum at p = 0.5. What values of p should be chosen for different layers? In Keras, the dropout rate argument is (1-p). For intermediate layers, choosing (1-p) = 0.5 for large networks is ideal. For the input layer, (1-p) should be kept about 0.2 or lower. This is because dropping the input data can adversely affect the training. A (1-p) > 0.5 is not advised, as it culls more connections without boosting the regularization. Why we scale the weights w by p during the test or inferencing? Because the expected value of a Dropout network is equivalent to a regular network with its weights scaled with the Dropout rate p. The scaling makes the inferences from a Dropout network comparable to the full network. There are computational benefits as well, which is explained with an Ensemble modeling perspective in [1]. Before we go, I want to touch upon Gaussian-Dropout. What is Gaussian-Dropout? As we saw before, in Dropout we are dropping a connection with probability (1-p). Put mathematically, in Eq. 2 we have the connection weights multiplied with a random variable, 𝛿, where 𝛿 ~ Bernoulli(p). This Dropout procedure can be looked at as putting a Bernoulli gate on each connection. Image for post Figure 5. Dropout seen as a Bernoulli gate on connections. We can replace the Bernoulli gate with another gate. For example, a Gaussian Gate. And this gives us a Gaussian-Dropout. Image for post Figure 6. Dropout generalized to a Gaussian gate (instead of Bernoulli). The Gaussian-Dropout has been found to work as good as the regular Dropout and sometimes better. With a Gaussian-Dropout, the expected value of the activation remains unchanged (see Eq. 8). Therefore, unlike the regular Dropout, no weight scaling is required during inferencing. Image for post This property gives the Gaussian-Dropout a computational advantage as well. We will explore the performance of Gaussian-Dropout in an upcoming post. Until then, a word of caution. Although the idea of Dropout Gate can be generalized to distributions other than Bernoulli, it is advised to understand how the new distribution will affect the expectation of the activations. And based on this, appropriate scaling of the activations should be done. Conclusion In this post, we went through the Mathematics behind Dropout. We worked the Maths under some simplified conditions. However, the results extend to general cases in Deep Learning. In summary, we understood, Relationship between Dropout and Regularization, A Dropout rate of 0.5 will lead to the maximum regularization, and Generalization of Dropout to GaussianDropout. References Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958. Baldi, P., & Sadowski, P. J. (2013). Understanding dropout. In Advances in neural information processing systems (pp. 2814–2822).

Here we will understand the Mathematics that drive Dropout. How it leads to regularization? Why Dropout rate of 0.5 leads to the most regularization? What is Gaussian-Dropout?

Background What is an extreme rare event? In a rare-event problem, we have an unbalanced dataset. Meaning, we have fewer positively labeled samples than negative. In a typical rare-event problem, the positively labeled data are around 5–10% of the total. In an extreme rare event problem, we have less than 1% positively labeled data. For example, in the dataset used here it is around 0.6%. Such extreme rare event problems are quite common in the real-world, for example, sheet-breaks and machine failure in manufacturing, clicks or purchase in an online industry. Classifying these rare events is quite challenging. Recently, Deep Learning has been quite extensively used for classification. However, the small number of positively labeled samples prohibits Deep Learning application. No matter how large the data, the use of Deep Learning gets limited by the amount of positively labeled samples. Why should we still bother to use Deep Learning? This is a legitimate question. Why should we not think of using some another Machine Learning approach? The answer is subjective. We can always go with a Machine Learning approach. To make it work, we can undersample from negatively labeled data to have a close to a balanced dataset. Since we have about 0.6% positively labeled data, the undersampling will result in rougly a dataset that is about 1% of the size of the original data. A Machine Learning approach, e.g. SVM or Random Forest, will still work on a dataset of this size. However, it will have limitations in its accuracy. And we will not utilize the information present in the remaining ~99% of the data. If the data is sufficient, Deep Learning methods are potentially more capable. It also allows flexibility for model improvement by using different architectures. We will, therefore, attempt to use Deep Learning methods. In this post, we will learn how we can use a simple dense layers autoencoder to build a rare event classifier. The purpose of this post is to demonstrate the implementation of an Autoencoder for extreme rare-event classification. We will leave the exploration of different architecture and configuration of the Autoencoder on the user. Please share in the comments if you find anything interesting. Autoencoder for Classification The autoencoder approach for classification is similar to anomaly detection. In anomaly detection, we learn the pattern of a normal process. Anything that does not follow this pattern is classified as an anomaly. For a binary classification of rare events, we can use a similar approach using autoencoders (derived from here [2]). Quick revision: What is an autoencoder? An autoencoder is made of two modules: encoder and decoder. The encoder learns the underlying features of a process. These features are typically in a reduced dimension. The decoder can recreate the original data from these underlying features. Image for post Figure 1. Illustration of an autoencoder. [Source: Autoencoder by Prof. Seungchul Lee iSystems Design Lab] How to use an Autoencoder rare-event classification? We will divide the data into two parts: positively labeled and negatively labeled. The negatively labeled data is treated as normal state of the process. A normal state is when the process is eventless. We will ignore the positively labeled data, and train an Autoencoder on only negatively labeled data. This Autoencoder has now learned the features of the normal process. A well-trained Autoencoder will predict any new data that is coming from the normal state of the process (as it will have the same pattern or distribution). Therefore, the reconstruction error will be small. However, if we try to reconstruct a data from a rare-event, the Autoencoder will struggle. This will make the reconstruction error high during the rare-event. We can catch such high reconstruction errors and label them as a rare-event prediction. This procedure is similar to anomaly detection methods. Implementation Data and problem This is a binary labeled data from a pulp-and-paper mill for sheet breaks. Sheet breaks is severe problem in paper manufacturing. A single sheet break causes loss of several thousand dollars, and the mills see at least one or more break every day. This causes millions of dollors of yearly losses and work hazards. Detecting a break event is challenging due to the nature of the process. As mentioned in [1], even a 5% reduction in the breaks will bring significant benefit to the mills. The data we have contains about 18k rows collected over 15 days. The column y contains the binary labels, with 1 denoting a sheet break. The rest columns are predictors. There are about 124 positive labeled sample (~0.6%). Download data here. Code Import the desired libraries. %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from pylab import rcParams import tensorflow as tf from keras.models import Model, load_model from keras.layers import Input, Dense from keras.callbacks import ModelCheckpoint, TensorBoard from keras import regularizers from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, precision_recall_curve from sklearn.metrics import recall_score, classification_report, auc, roc_curve from sklearn.metrics import precision_recall_fscore_support, f1_score from numpy.random import seed seed(1) from tensorflow import set_random_seed set_random_seed(2) SEED = 123 #used to help randomly select the data points DATA_SPLIT_PCT = 0.2 rcParams['figure.figsize'] = 8, 6 LABELS = ["Normal","Break"] Note that we are setting the random seeds for reproducibility of the result. Data preprocessing Now, we read and prepare the data. df = pd.read_csv("data/processminer-rare-event-mts - data.csv") The objective of this rare-event problem is to predict a sheet-break before it occurs. We will try to predict the break 4 minutes in advance. To build this model, we will shift the labels 2 rows up (which corresponds to 4 minutes). We can do this as df.y=df.y.shift(-2). However, in this problem we would want to do the shifting as: if row n is positively labeled, Make row (n-2) and (n-1) equal to 1. This will help the classifier learn up to 4 minute ahead prediction. Delete row n. Because we do not want the classifier to learn predicting a break when it has happened. We will develop the following UDF for this curve shifting. sign = lambda x: (1, -1)[x < 0]  def curve_shift(df, shift_by):     '''     This function will shift the binary labels in a dataframe.     The curve shift will be with respect to the 1s.      For example, if shift is -2, the following process     will happen: if row n is labeled as 1, then     - Make row (n+shift_by):(n+shift_by-1) = 1.     - Remove row n.     i.e. the labels will be shifted up to 2 rows up.          Inputs:     df       A pandas dataframe with a binary labeled column.               This labeled column should be named as 'y'.     shift_by An integer denoting the number of rows to shift.          Output     df       A dataframe with the binary labels shifted by shift.     '''      vector = df['y'].copy()     for s in range(abs(shift_by)):         tmp = vector.shift(sign(shift_by))         tmp = tmp.fillna(0)         vector += tmp     labelcol = 'y'     # Add vector to the df     df.insert(loc=0, column=labelcol+'tmp', value=vector)     # Remove the rows with labelcol == 1.     df = df.drop(df[df[labelcol] == 1].index)     # Drop labelcol and rename the tmp col as labelcol     df = df.drop(labelcol, axis=1)     df = df.rename(columns={labelcol+'tmp': labelcol})     # Make the labelcol binary     df.loc[df[labelcol] > 0, labelcol] = 1      return df Before moving forward, we will drop the time, and also the categorical columns for simplicity. # Remove time column, and the categorical columns df = df.drop(['time', 'x28', 'x61'], axis=1) Now, we divide the data into train, valid, and test sets. Then we will take the subset of data with only 0s to train the autoencoder. df_train, df_test = train_test_split(df, test_size=DATA_SPLIT_PCT, random_state=SEED) df_train, df_valid = train_test_split(df_train, test_size=DATA_SPLIT_PCT, random_state=SEED) df_train_0 = df_train.loc[df['y'] == 0] df_train_1 = df_train.loc[df['y'] == 1] df_train_0_x = df_train_0.drop(['y'], axis=1) df_train_1_x = df_train_1.drop(['y'], axis=1) df_valid_0 = df_valid.loc[df['y'] == 0] df_valid_1 = df_valid.loc[df['y'] == 1] df_valid_0_x = df_valid_0.drop(['y'], axis=1) df_valid_1_x = df_valid_1.drop(['y'], axis=1) df_test_0 = df_test.loc[df['y'] == 0] df_test_1 = df_test.loc[df['y'] == 1] df_test_0_x = df_test_0.drop(['y'], axis=1) df_test_1_x = df_test_1.drop(['y'], axis=1) Standardization It is usually better to use a standardized data (transformed to Gaussian, mean 0 and variance 1) for autoencoders. scaler = StandardScaler().fit(df_train_0_x) df_train_0_x_rescaled = scaler.transform(df_train_0_x) df_valid_0_x_rescaled = scaler.transform(df_valid_0_x) df_valid_x_rescaled = scaler.transform(df_valid.drop(['y'], axis = 1)) df_test_0_x_rescaled = scaler.transform(df_test_0_x) df_test_x_rescaled = scaler.transform(df_test.drop(['y'], axis = 1)) Autoencoder Classifier Initialization First, we will initialize the Autoencoder architecture. We are building a simple autoencoder. More complex architectures and other configurations should be explored. nb_epoch = 200 batch_size = 128 input_dim = df_train_0_x_rescaled.shape[1] #num of predictor variables,  encoding_dim = 32 hidden_dim = int(encoding_dim / 2) learning_rate = 1e-3  input_layer = Input(shape=(input_dim, )) encoder = Dense(encoding_dim, activation="relu", activity_regularizer=regularizers.l1(learning_rate))(input_layer) encoder = Dense(hidden_dim, activation="relu")(encoder) decoder = Dense(hidden_dim, activation="relu")(encoder) decoder = Dense(encoding_dim, activation="relu")(decoder) decoder = Dense(input_dim, activation="linear")(decoder) autoencoder = Model(inputs=input_layer, outputs=decoder) autoencoder.summary() Image for post Training We will train the model and save it in a file. Saving a trained model is a good practice for saving time for future analysis. autoencoder.compile(metrics=['accuracy'],                     loss='mean_squared_error',                     optimizer='adam') cp = ModelCheckpoint(filepath="autoencoder_classifier.h5",                                save_best_only=True,                                verbose=0) tb = TensorBoard(log_dir='./logs',                 histogram_freq=0,                 write_graph=True,                 write_images=True) history = autoencoder.fit(df_train_0_x_rescaled, df_train_0_x_rescaled,                     epochs=nb_epoch,                     batch_size=batch_size,                     shuffle=True,                     validation_data=(df_valid_0_x_rescaled, df_valid_0_x_rescaled),                     verbose=1,                     callbacks=[cp, tb]).history Image for post Figure 2. Loss for Autoencoder Training. Classification In the following, we show how we can use an Autoencoder reconstruction error for the rare-event classification. As mentioned before, if the reconstruction error is high, we will classify it as a sheet-break. We will need to determine the threshold for this. We will use the validation set to identify the threshold. valid_x_predictions = autoencoder.predict(df_valid_x_rescaled) mse = np.mean(np.power(df_valid_x_rescaled - valid_x_predictions, 2), axis=1) error_df = pd.DataFrame({'Reconstruction_error': mse,                         'True_class': df_valid['y']}) precision_rt, recall_rt, threshold_rt = precision_recall_curve(error_df.True_class, error_df.Reconstruction_error) plt.plot(threshold_rt, precision_rt[1:], label="Precision",linewidth=5) plt.plot(threshold_rt, recall_rt[1:], label="Recall",linewidth=5) plt.title('Precision and recall for different threshold values') plt.xlabel('Threshold') plt.ylabel('Precision/Recall') plt.legend() plt.show() Image for post Figure 3. A threshold of 0.4 should provide a reasonable trade-off between precision and recall. Now, we will perform classification on the test data. We should not estimate the classification threshold from the test data. It will result in overfitting. test_x_predictions = autoencoder.predict(df_test_x_rescaled) mse = np.mean(np.power(df_test_x_rescaled - test_x_predictions, 2), axis=1) error_df_test = pd.DataFrame({'Reconstruction_error': mse,                         'True_class': df_test['y']}) error_df_test = error_df_test.reset_index() threshold_fixed = 0.4 groups = error_df_test.groupby('True_class') fig, ax = plt.subplots() for name, group in groups:     ax.plot(group.index, group.Reconstruction_error, marker='o', ms=3.5, linestyle='',             label= "Break" if name == 1 else "Normal") ax.hlines(threshold_fixed, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold') ax.legend() plt.title("Reconstruction error for different classes") plt.ylabel("Reconstruction error") plt.xlabel("Data point index") plt.show(); Image for post Figure 4. Using threshold = 0.4 for classification. The orange and blue dots above the threshold line represents the True Positive and False Positive, respectively. In Figure 4, the orange and blue dot above the threshold line represents the True Positive and False Positive, respectively. As we can see, we have good number of false positives. To have a better look, we can see a confusion matrix. pred_y = [1 if e > threshold_fixed else 0 for e in error_df.Reconstruction_error.values] conf_matrix = confusion_matrix(error_df.True_class, pred_y) plt.figure(figsize=(12, 12)) sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d"); plt.title("Confusion matrix") plt.ylabel('True class') plt.xlabel('Predicted class') plt.show() Image for post Figure 5. Confusion Matrix on the test predictions. We could predict 8 out of 41 breaks instances. Note that these instances include 2 or 4 minute ahead predictions. This is around 20%, which is a good recall rate for the paper industry. The False Positive Rate is around 6%. This is not ideal but not terrible for a mill. Still, this model can be further improved to increase the recall rate with smaller False Positive Rate. We will look at the AUC below and then talk about the next approach for improvement. ROC curve and AUC false_pos_rate, true_pos_rate, thresholds = roc_curve(error_df.True_class, error_df.Reconstruction_error) roc_auc = auc(false_pos_rate, true_pos_rate,) plt.plot(false_pos_rate, true_pos_rate, linewidth=5, label='AUC = %0.3f'% roc_auc) plt.plot([0,1],[0,1], linewidth=5) plt.xlim([-0.01, 1]) plt.ylim([0, 1.01]) plt.legend(loc='lower right') plt.title('Receiver operating characteristic curve (ROC)') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show() Image for post The AUC is 0.69. Github repository The entire code with comments are present here. cran2367/autoencoder_classifier Autoencoder model for rare event classification. Contribute to cran2367/autoencoder_classifier development by creating… github.com What can be done better here? Autoencoder Optimization Autoencoders are a nonlinear extension of PCA. However, the conventional Autoencoder developed above does not follow the principles of PCA. In Build the right Autoencoder — Tune and Optimize using PCA principles. Part I and Part II, the required PCA principles that should be incorporated in an Autoencoder for optimization are explained and implemented. LSTM Autoencoder The problem discussed here is a (multivariate) time series. However, in the Autoencoder model we are not taking into account the temporal information/patterns. In the next post, we will explore if it is possible with an RNN. We will try a LSTM autoencoder. Conclusion We worked on an extreme rare event binary labeled data from a paper mill to build an Autoencoder Classifier. We achieved reasonable accuracy. The purpose here was to demonstrate the use of a basic Autoencoder for rare event classification. We will further work on developing other methods, including an LSTM Autoencoder that can extract the temporal features for better accuracy. The next post on LSTM Autoencoder is here, LSTM Autoencoder for rare event classification. Recommended Follow-up Reads Build the right Autoencoder — Tune and Optimize using PCA principles. Part I. Build the right Autoencoder — Tune and Optimize using PCA principles. Part II. LSTM Autoencoder for Extreme Rare Event Classification in Keras. References Ranjan, C., Mustonen, M., Paynabar, K., & Pourak, K. (2018). Dataset: Rare Event Classification in Multivariate Time Series. arXiv preprint arXiv:1809.10717. https://www.datascience.com/blog/fraud-detection-with-tensorflow Github repo: https://github.com/cran2367/autoencoder_classifier

Autoencoders are modeled to reconstruct the input by learning their latent features. In this post, we will learn how to implement an autoencoder for building a rare-event classifier.

Here we will learn an approach to get vector embeddings for string sequences. These embeddings can be used for Clustering and Classification. Sequence modeling has been a challenge. This is because of the inherent un-structuredness of sequence data. Just like texts in Natural Language Processing (NLP), sequences are arbitrary strings. For a computer these strings have no meaning. As a result, building a data mining model is difficult. For texts, we have come up with embeddings, such as word2vec, that converts a word into a n-dimensional vector. Essentially, bringing it in a Euclidean space. In this post, we will learn to do the same for sequences. Here we will go over an approach to create embeddings for sequences that brings a sequence in a Euclidean space. With these embeddings, we can perform conventional Machine Learning and Deep Learning, e.g. kmeans, PCA, and Multi-Layer Perceptron on sequence datasets. We provide and work on two datasets — protein sequences and weblogs. Sequence datasets are commonly found around us. For example, clickstreams, music listening history, and weblogs in tech industries. In BioInformatics, we have large databases of Protein sequences. A protein sequence is made of some combination of 20 amino acids. A typical protein sequence is shown below where each letter corresponds to an amino acid. Image for post Fig. 1. Example of a protein sequence. A protein sequence does not necessarily contain all the 20 amino acids but some subset of it. For clarity, we will define some keywords used in this post. alphabet: the discrete elements that make up a sequence. E.g. an amino acid. alphabet-set: set of all alphabets that will make sequences in a corpus. E.g. all protein sequences in a corpus are made of a set of 20 amino acids. sequence: an ordered series of discrete alphabets. A sequence in a corpus contains a subset of alphabet-set. Sequence corpus typically contains thousands to millions of sequences. Clustering and Classification are often required given we have labeled or unlabeled data. However, doing this is not straightforward due to the un-structuredness of sequences — arbitrary strings of arbitrary length. To overcome this, sequence embeddings can be used. Here we will use an SGT embedding that embeds the long- and short- term patterns in a sequence into a finite-dimensional vector. The advantage of SGT embedding is that we can easily tune the amount of long- / short- term patterns without increasing the computation. The source code and data in the following is here. Before moving forward, we will need to install sgt package. $ pip install sgt Clustering Protein Sequence Clustering The data used here is taken from www.uniprot.org. This is a public database for proteins. The data contains the protein sequences and their function. In this section, we will cluster the protein sequences, and in the next we will use their functions as labels for building a classifier. We first read the sequence data, and convert it into a list of lists. As shown below, each sequence is a list of alphabets. >>> protein_data = pd.DataFrame.from_csv('../data/protein_classification.csv') >>> X = protein_data['Sequence'] >>> def split(word):  >>>    return [char for char in word] >>> sequences = [split(x) for x in X] >>> print(sequences[0]) ['M', 'E', 'I', 'E', 'K', 'T', 'N', 'R', 'M', 'N', 'A', 'L', 'F', 'E', 'F', 'Y', 'A', 'A', 'L', 'L', 'T', 'D', 'K', 'Q', 'M', 'N', 'Y', 'I', 'E', 'L', 'Y', 'Y', 'A', 'D', 'D', 'Y', 'S', 'L', 'A', 'E', 'I', 'A', 'E', 'E', 'F', 'G', 'V', 'S', 'R', 'Q', 'A', 'V', 'Y', 'D', 'N', 'I', 'K', 'R', 'T', 'E', 'K', 'I', 'L', 'E', 'D', 'Y', 'E', 'M', 'K', 'L', 'H', 'M', 'Y', 'S', 'D', 'Y', 'I', 'V', 'R', 'S', 'Q', 'I', 'F', 'D', 'Q', 'I', 'L', 'E', 'R', 'Y', 'P', 'K', 'D', 'D', 'F', 'L', 'Q', 'E', 'Q', 'I', 'E', 'I', 'L', 'T', 'S', 'I', 'D', 'N', 'R', 'E'] Next, we generate the sequence embeddings. >>> from sgt import Sgt >>> sgt = Sgt(kappa = 10, lengthsensitive = False) >>> embedding = sgt.fit_transform(corpus=sequences) The embedding is in 400-dimensional space. Let’s first do PCA on it and reduce the dimension to two. This will also help visualize the clusters. >>> pca = PCA(n_components=2) >>> pca.fit(embedding) >>> X = pca.transform(embedding) >>> print(np.sum(pca.explained_variance_ratio_)) 0.6019403543806409 The top two PCs are explaining about 60% of the variance. We will cluster them into 3 clusters. >>> kmeans = KMeans(n_clusters=3, max_iter =300) >>> kmeans.fit(df) >>> labels = kmeans.predict(df) >>> centroids = kmeans.cluster_centers_ >>> fig = plt.figure(figsize=(5, 5)) >>> colmap = {1: 'r', 2: 'g', 3: 'b'} >>> colors = list(map(lambda x: colmap[x+1], labels)) >>> plt.scatter(df['x1'], df['x2'], color=colors, alpha=0.5, edgecolor=colors) Image for post Moving on to building classifiers. Classification Protein Sequence Classification We will start with building a classifier on the same protein dataset we used earlier. The proteins in the dataset have two functions. Therefore, we will build a binary classifier. We will first convert the function [CC] column in the data into labels that can be ingested in a MLP model built in keras. >>> y = protein_data['Function [CC]'] >>> encoder = LabelEncoder() >>> encoder.fit(y) >>> encoded_y = encoder.transform(y) In the following, we build the MLP classifier and run a 10-fold cross-validation. >>> kfold = 10 >>> X = pd.DataFrame(embedding) >>> y = encoded_y >>> random_state = 1 >>> test_F1 = np.zeros(kfold) >>> skf = KFold(n_splits = kfold, shuffle = True, random_state = random_state) >>> k = 0 >>> epochs = 50 >>> batch_size = 128 >>> for train_index, test_index in skf.split(X, y): >>>     X_train, X_test = X.iloc[train_index], X.iloc[test_index] >>>     y_train, y_test = y[train_index], y[test_index] >>>     X_train = X_train.as_matrix(columns = None) >>>     X_test = X_test.as_matrix(columns = None) >>>      >>>     model = Sequential() >>>     model.add(Dense(64, input_shape = (X_train.shape[1],), init = 'uniform'))  >>>     model.add(Activation('relu')) >>>     model.add(Dropout(0.5)) >>>     model.add(Dense(32, init='uniform')) >>>     model.add(Activation('relu')) >>>     model.add(Dropout(0.5)) >>>     model.add(Dense(1, init='uniform')) >>>     model.add(Activation('sigmoid')) >>>     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) >>>      >>>     model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=0) >>>      >>>     y_pred = model.predict_proba(X_test).round().astype(int) >>>     y_train_pred = model.predict_proba(X_train).round().astype(int) >>>     test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred) >>>     k+=1      >>> print ('Average test f1-score', np.mean(test_F1)) Average test f1-score 1.0 This data turns out to be too good for the classifier. We have another dataset that is more challenging. Let’s go over it. Network Log Data Classification This data sample is taken from here. This is a network intrusion data containing audit logs and any attack as a positive label. Since, network intrusion is a rare event, the data is unbalanced. Additionally, we have a small dataset of only 111 records. Here we will build a sequence classification model to predict a network intrusion. Each sequence contains in the data is a series of activity, for example, {login, password, …}. The alphabets in the input data sequences are already encoded into integers. The original sequences data file is present here. Similar as before, we will first prepare the data for a classifier. >>> darpa_data = pd.DataFrame.from_csv('../data/darpa_data.csv') >>> X = darpa_data['seq'] >>> sequences = [x.split('~') for x in X] >>> y = darpa_data['class'] >>> encoder = LabelEncoder() >>> encoder.fit(y) >>> y = encoder.transform(y) In this data, the sequence embeddings should be length-sensitive. The lengths are important here because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: {login, pswd, login, pswd,…} and {login, pswd,…(repeated several times)…, login, pswd}. While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns. >>> sgt_darpa = Sgt(kappa = 5, lengthsensitive = True) >>> embedding = sgt_darpa.fit_transform(corpus=sequences) The embedding we find here is sparse. Therefore, we will perform dimension reduction using PCA before we train a classifier. >>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=35) >>> pca.fit(embedding) >>> X = pca.transform(embedding) >>> print(np.sum(pca.explained_variance_ratio_)) 0.9862350164327149 The selected top-35 PCAs are explaining more than 98% of the variance. We will now go ahead and build a Multi-Layer Perceptron using keras . Since the data size is small and, also, the number of positive labeled points, we will perform a 3-fold validation. >>> kfold = 3 >>> random_state = 11 >>> test_F1 = np.zeros(kfold) >>> time_k = np.zeros(kfold) >>> skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state) >>> k = 0 >>> epochs = 300 >>> batch_size = 15 >>> class_weight = {0 : 0.12, 1: 0.88,}  # The weights can be changed and made inversely proportional to the class size to improve the accuracy. >>> for train_index, test_index in skf.split(X, y): >>>     X_train, X_test = X[train_index], X[test_index] >>>     y_train, y_test = y[train_index], y[test_index] >>>      >>>     model = Sequential() >>>     model.add(Dense(128, input_shape=(X_train.shape[1],), init='uniform'))  >>>     model.add(Activation('relu')) >>>     model.add(Dropout(0.5)) >>>     model.add(Dense(1, init='uniform')) >>>     model.add(Activation('sigmoid')) >>>     model.summary() >>>     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) >>>     >>>     start_time = time.time() >>>     model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=1, class_weight=class_weight) >>>     end_time = time.time() >>>     time_k[k] = end_time-start_time >>>     y_pred = model.predict_proba(X_test).round().astype(int) >>>     y_train_pred = model.predict_proba(X_train).round().astype(int) >>>     test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred) >>>     k += 1 >>> print ('Average Test f1-score', np.mean(test_F1)) Average Test f1-score 0.5236467236467236 >>> print ('Average Run time', np.mean(time_k)) Average Run time 9.076935768127441 This was a difficult data to classify. To have a loose benchmark, let’s build a fancier LSTM classifier on the same data. >>> X = darpa_data['seq'] >>> encoded_X = np.ndarray(shape=(len(X),), dtype=list) >>> for i in range(0,len(X)): >>>     encoded_X[i]=X.iloc[i].split("~") >>> max_seq_length = np.max(darpa_data['seqlen']) >>> encoded_X = sequence.pad_sequences(encoded_X, maxlen=max_seq_length) >>> kfold = 3 >>> random_state = 11 >>> test_F1 = np.zeros(kfold) >>> time_k = np.zeros(kfold) >>> epochs = 50 >>> batch_size = 15 >>> skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state) >>> k = 0 >>> for train_index, test_index in skf.split(encoded_X, y): >>>     X_train, X_test = encoded_X[train_index], encoded_X[test_index] >>>     y_train, y_test = y[train_index], y[test_index] >>>  >>>     embedding_vecor_length = 32 >>>     top_words=50 >>>     model = Sequential() >>>     model.add(Embedding(top_words, embedding_vecor_length, input_length=max_seq_length)) >>>     model.add(LSTM(32)) >>>     model.add(Dense(1, init='uniform')) >>>     model.add(Activation('sigmoid')) >>>     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) >>>     start_time = time.time() >>>     model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1) >>>     end_time=time.time() >>>     time_k[k]=end_time-start_time >>>     y_pred = model.predict_proba(X_test).round().astype(int) >>>     y_train_pred=model.predict_proba(X_train).round().astype(int) >>>     test_F1[k]=sklearn.metrics.f1_score(y_test, y_pred) >>>     k+=1 >>> print ('Average Test f1-score', np.mean(test_F1)) Average Test f1-score 0.0 >>> print ('Average Run time', np.mean(time_k)) Average Run time 425.68603706359863 We find that the LSTM classifier gives an F1 score of 0. This may be improved by changing the model. However, we find that the SGT embedding could work with a small and unbalanced data without the need of a complicated classifier. Besides, on the runtime SGT embedded network was significantly faster. It took on average 9.1 secs while the LSTM model took 425.6 secs. Concluding Remarks We learned about using a sequence embedding for sequence clustering and classification. This embedding is an implementation of this paper. Not covered in this post, but refer to this paper to see the accuracy comparison of SGT embedding over others. Due to SGT embedding’s ability to capture long- and short- term patterns, it works better than most other sequence modeling approaches. It is recommended that you play with the tuning parameter, kappa , to see its affect on the accuracy. Credits: Samaneh Ebrahimi, PhD, who is a co-author of the SGT paper and has significantly contributed to the above codes. Yassine Khelifi, a Data Science expert who wrote the first Python version for SGT.

Here we will use an SGT embedding that embeds the long- and short-term patterns in a sequence into a finite-dimensional vector. The advantage of SGT embedding is that we can easily tune the amount of long- / short- term patterns without increasing the computation.

In this post, we will learn about using a nonlinear correlation estimation function in R. We will also look at a few examples. Background Correlation estimations are commonly used in various data mining applications. In my experience, nonlinear correlations are quite common in various processes. Due to this, nonlinear models, such as SVM, are employed for regression, classification, etc. However, there are not many approaches to estimate nonlinear correlations between two variables. Typically linear correlations are estimated. However, the data may have a nonlinear correlation but little to no linear correlation. In such cases, nonlinearly correlated variables are sometimes overlooked during data exploration or variable selection in high-dimensional data. We have developed a new nonlinear correlation estimator, nlcor. This estimator comes useful in data exploration and also variable selection for nonlinear predictive models, such as SVM. Installing nlcor To install nlcor in R, follow these steps: Install the devtools package. You can do this from CRAN. You can do it directly in R console by typing, > install.packages("devtools") 2. Load the devtools package. > library(devtools)  3. Install nlcor from its GitHub repository by typing this in R console. > install_github("ProcessMiner/nlcor") Nonlinear Correlation Estimator: nlcor In this package, we provide an implementation of a nonlinear correlation estimation method using an adaptive local linear correlation computation in nlcor. The function nlcor returns the nonlinear correlation estimate, the corresponding adjusted p-value, and an optional plot visualizing the nonlinear relationships. The correlation estimate will be between 0 and 1. The higher the value the more is the nonlinear correlation. Unlike linear correlations, a negative value is not valid here. Due to multiple local correlation computations, the net p-value of the correlation estimate is adjusted (to avoid false positives). The plot visualizes the local linear correlations. In the following, we will show its usage with a few examples. In the given examples, the linear correlations between x and y is small, however, there is a visible nonlinear correlation between them. This package contains the data for these examples and can be used for testing the package. nlcor package has few sample x and y vectors that are demonstrated in the following examples. First, we will load the package. > library(nlcor) Example 1. A data with cyclic nonlinear correlation. > plot(x1, y1) Image for post The linear correlation of the data is, > cor(x1, y1) [1] 0.008001837 As expected, the correlation is close to zero. We estimate the nonlinear correlation using nlcor. > c <- nlcor(x1, y1, plt = T) > c$cor.estimate [1] 0.8688784 > c$adjusted.p.value [1] 0 > print(c$cor.plot) Image for post The plot shows the piecewise linear correlations present in the data. Example 2. A data with non-uniform piecewise linear correlations. > plot(x2, y2) Image for post The linear correlation of the data is, > cor(x2, y2) [1] 0.828596 The linear correlation is quite high in this data. However, there is significant and higher nonlinear correlation present in the data. This data emulates the scenario where the correlation changes its direction after a point. Sometimes that change point is in the middle causing the linear correlation to be close to zero. Here we show an example when the change point is off center to show that the implementation works in non-uniform cases. We estimate the nonlinear correlation using nlcor. > c <- nlcor(x2, y2, plt = T) > c$cor.estimate [1] 0.897205 > c$adjusted.p.value [1] 0 > print(c$cor.plot) Image for post It is visible from the plot that nlcor could estimate the piecewise correlations in a non-uniform scenario. Also, the nonlinear correlation comes out to be higher than the linear correlation. Example 3. A data with higher and multiple frequency variations. > plot(x3, y3) Image for post The linear correlation of the data is, > cor(x3, y3) [1] -0.1337304 The linear correlation is expectedly small, albeit not close to zero due to some linearity. Here we show we can refine the granularity of the correlation computation. Under default settings, the output of nlcor will be, > c <- nlcor(x3, y3, plt = T) > c$cor.estimate [1] 0.7090148 > c$adjusted.p.value [1] 0 > print(c$cor.plot) Image for post As can be seen in the figure, nlcor overlooked some of the local relationships. We can refine the correlation estimation by changing the refine parameter. The default value of refine is set as 0.5. It can be set as any value between 0 and 1. A higher value enforces higher refinement. However, higher refinement adversely affects the p-value. Meaning, the resultant correlation estimate may be statistically insignificant (similar to overfitting). Therefore, it is recommended to avoid over refinement. For this data, we rerun the correlation estimation with refine = 0.9. > c <- nlcor(x3, y3, refine = 0.9, plt = T) > c$cor.estimate [1] 0.8534956 > c$adjusted.p.value [1] 2.531456e-06 > print(c$cor.plot) Warning: Removed 148 rows containing missing values (geom_path). Image for post As can be seen in the figure, nlcor could identify the granular piecewise correlations. In this data, the p-value still remains extremely small—the correlation is statistically significant. Summary This package provides an implementation of an efficient heuristic to compute the nonlinear correlations between numeric vectors. The heuristic works by adaptively identifying multiple local regions of linear correlations to estimate the overall nonlinear correlation. Its usages are demonstrated here with few examples. Citation Package ‘nlcor’: Compute Nonlinear Correlations @article{ranjan2020packagenlcor, title={Package ‘nlcor’: Compute Nonlinear Correlations}, author={Ranjan, Chitta and Najari, Vahab}, journal={Research Gate}, year={2020}, doi={10.13140/RG.2.2.33716.68480} } Chitta Ranjan and Vahab Najari. “Package ’nlcor’: Compute Nonlinear Correlations”. In:Research Gate(2020).doi:10.13140/RG.2.2.33716.68480. nlcor: Nonlinear Correlation @article{ranjan2019nlcor, title={nlcor: Nonlinear Correlation}, author={Ranjan, Chitta and Najari, Vahab}, journal={Research Gate}, year={2019}, doi={10.13140/RG.2.2.10123.72488} } Chitta Ranjan and Vahab Najari. “nlcor: Nonlinear Correlation”. In:Research Gate(2019).doi:10.13140/RG.2.2.10123.72488.

In this post, a new nonlinear correlation estimator nlcor is demonstrated. This estimator comes useful in data exploration and also variable selection for nonlinear predictive models, such as SVM.

Podcast

In this ProcessMiner University podcast episode, Tom Tulloch chats with Dr. Chitta Ranjan, Director of Science at ProcessMiner, about his early life and education, what inspired him to write his book, "Understanding Deep Learning - Application in Rare Event Prediction," and the role deep learning plays in the evolution of artificial intelligence.

Those who know Dr. Ranjan, know he has a keen ability to take complex ideas and simplify them for all to understand... something he talks about during this session.

Fun fact

Throughout his experience writing this book, Dr. Ranjan became an unexpected expert chef, which he also delves into during this talk. Listeners will walk away feeling inspired, motivated to learn something new, and maybe a little hungry.

Citation

@book{ranjan2020understanding,

doi = {10.13140/RG.2.2.34297.49765},

year = 2020,

month = {Dec},

author = {Chitta Ranjan},

title = {Understanding Deep Learning: Application in Rare Event Prediction},

publisher = {Connaissance Publishing},

note={URL: \url{www.understandingdeeplearning.com}}

}

Testimonials

"This is the one of the best deep learning book that I have read. Dr. Ranjan has done a very good job in explaining core deep learning concepts without trying to incorporate too much. This is a fantastic book for both practitioners and researchers who want to understand and use deep learning in their respective fields. I find the illustrations and rule of thumbs are very very clear and useful. The bit that has been surprising (in a good way) is that Dr. Ranjan has found a way to incorporate some of the theoretical justifications for certain deep learning techniques. The part I enjoy the most (as a statistician) is around why pooling works for Convolutional Neural Nets. As far as I am concerned, this is the very first time I saw such clear and elegant explanation in any deep learning textbook."

- by Zhen Zu, Stanford University.

"This is an excellent textbook on Deep Learning with a focus on rare events. Chitta explains 'what' the primary constructs of deep learning are with clear visual illustrations. I particularly enjoy the details of the motivation and intuitions on the 'how' and 'why' we make different choices in building deep learning models. I personally learned a tremendous amount and would recommend this book to anyone who wants to dig deeper into the subject."

- by Jason Wu, Ph.D., Partner, Ultrabright Education.

"Among the several Deep Learning courses and bootcamps I have had, Dr. Ranjan’s lectures and his book are one of the best Deep Learning resources. I have a background in engineering/statistics and I am often overwhelmed by the fast-evolving techniques in the Deep Learning field, inundated by “this model works very well” not knowing the why. Dr. Ranjan did an amazing job at pulling out the intuitions behind “why it works” and at connecting the dots of deep learning with concepts in traditional statistics. In addition to concepts, Dr. Ranjan also walks modeling considerations of real business problems as well as practical implementation tips for practitioners. I recommend the book to beginners who want to build up a solid foundational understanding of deep learning and to those more experienced who are seeking for additional insights."

- by Lingchao Mao, Stanford University.