.. _DLmath:
Deep Learning
------------------------------
Deep Learning relies on interconnected nodes and weighted
information paths, which are adapted to minimize prediction error via
back propagation, to produce non-linear models of complex
relationships.
Defining a Deep Learning Model
""""""""""""""""""""""""""""""""
**Response:**
The dependent or target variable of interest.
**Ignored Columns:**
This field will auto populate a list of the columns from the data
set in use. The user selected set of columns are the features
that will be omitted from the model. Additionally - users can
specify whether the model should omit constant columns by
selecting expert settings and checking the tic box indicating
**Ignore const cols**.
**Classification**
Checkbox indicating whether the dependent variable is to be
treated as a factor or a continuous variable.
**Validation**
A unique data set with the same shape and features as the
training data to be used in model validation (i.e. production of
error rates on data not used in model building).
**Checkpoint**
A model key associated with a previously run deep learning
model. This option allows users to build a new model as a
continuation of a previously generated model.
**Expert mode**
When selected **Expert mode** allows users to specify expert
settings, explained in more detail below.
*max w2*
**Activation**
The activation function to be used at each of the nodes in the
hidden layers.
*Tanh* :Hyperbolic tangent function
*Rectifier:* Chooses the maximum of (0, x) where x is the input value for a feature.
*Maxout:* Choose the maximum coordinate of the input vector.
*With Dropout* A percentage of the data will be omitted from
training as data are presented to each hidden layer in order to
improve generalization.
**Hidden:**
The number of hidden layers in the model. Multiple models can be
specified and generated simultaneously. For example if a user
specifies (300,300,100) a model with 3 layers off 100 hidden nodes
each will be produced. To specify several different models with
different dimensions enter information in the format (300, 300, 100),
(200, 200), (200, 20).
**Epochs**
The number of iterations to be carried out. In model training
data is fed into an input layer and passed down weighted
information paths, through each of the hidden layers, and a
prediction is returned at the output layer. Deviations between
the predicted values and the actual values are then calculated,
and used to adjust the path weights to reduce the error between
the predicted and true value. One full backward
pass over the weighted paths is one epoch.
**Mini Batch**
Batch learning is a method in which the aggregated gradient
contributions for all observations in the training set are
obtained before weights are updated. Alternatively, users can specify
mini-batch to update weights more frequently. If users specify
mini-batch = 2000, the training data will be split into chunks
of 2000 observations, and the model weights will be updated
after each chunk is passed through the network.
**Seed**
Because of the random nature of the algorithm, models with the
same specification can sometimes produce slightly different
results. To control this behavior, users can specify a seed,
which will produce the same values for random components on
independent tries.
**Adaptive Rate:**
In the even that a model is specified over a topology with
local minima or long plateaus, it is possible for a constant
learning rate to produce sub-optimal results. When the gradient
is being estimated in a well, a large learning rate can cause
the gradient to oscillate and move in the wrong direction. When
the gradient is being taken on a relatively flat surface, the
model can converge far slower than necessary for small learning
rates. Adaptive learning rates self adjust to avoid local
minima or slow convergence.
**Momentum:**
The magnitude of the weight updates are determined by the user specified
learning rate, and are a function of the difference between the
predicted value and the target value. That difference,
generally called delta, is only available at the output
layer. To correct the output at each hidden layer, back
propagation is used. Momentum modifies back propagation
by allowing prior iterations to influence the current
update. Using the momentum parameter can aid in avoiding local
minima and the associated instability.
*Momentum start* The weight assigned to the results of the
first sample passed through the model.
*Momentum ramp* The number of data samples for which results
will be weighted.
*Momentum stable* The minimum weight to be attributed to the
last weighted output.
**Nestrov Accelerated**
The Nestrov Accelerated Gradient Descent method is a
modification to traditional gradient descent for convex
functions. The method relies on gradient information at
various points to build a polynomial approximation that
minimizes the residuals in fewer iterations of the
descent.
**Input dropout ratio**
A percentage of the data to be omitted from training in order
to improve generalization.
**L1 regularization**
A regularization method that constrains the size of individual
coefficients and has the net effect of dropping some
coefficients from a model to reduce complexity and avoid
overfitting.
**L2 regularization**
A regularization method that constrains the sum of the squared
coefficients. This method introduces bias into parameter
estimates, but frequently produces substantial gains in
modeling as estimate variance is reduced.
**Max W2**
A maximum on the sum of the squared weights of information
paths input into any one unit. This tuning parameter functions
in a manner similar to L2 Regularization on the hidden layers
of the network.
**Initial weight distribution**
The distribution from which initial path weights are to be
drawn. When the norma option is selected weights are drawn
from the standard normal with a mean of 0 and a standard
deviation of 1.
**Loss function**
The loss function to be optimized by the model.
*Cross Entropy* Used when the model output consists of
independent hypothesis, and the outputs can be interpreted as
the probability that each hypothesis is true. Cross entropy is
the recommended loss function when the target values are
classifications, and especially when data are unbalanced.
*Mean Square* Used when the model output are continuous real
values.
**Score Interval**
The number of seconds to elapse between model scoring.
**Score Training Samples**
The number of training set observations to be used in scoring.
**Score Validation Samples**
The number of validation set observations to be used in
scoring.
**Classification Stop**
The stopping criteria in terms of classification error. When
error is at or below this threshold, the algorithm stops.
**Regression Stop**
The stopping criteria in terms of error. When error is at or
below this threshold, the algorithm stops.
**Max Confusion Matrix**
The maximum number of classes to be shown in the returned
confusion matrix for classification models.
**Max Hit Ratio K**
The maximum frequency of actual class label to be among the top-K
predicted class labels).
**Balance Classes**
When data are unbalanced selecting this option will
oversample the minority class to train on.
**Variable Importance**
Report variable importance in the model output.
**Force Load Balance**
Increase training speed on small data sets to utilize all
cores.
**Shuffle Training Data**
When data include classes with unbalanced distributions, or
when data are ordered, it is possible to run the algorithm
on chunks of data that do not accurately reflect the shape
of the data as a whole, which can produce poor
models. Shuffling training data ensures that all prediction
classes are present in all chunks of data.
Interpreting the Model
""""""""""""""""""""""""
**Progress Table**
The Progress table displays information about each of the
hidden layers in the deep learning model.
*Units* The number of units or nodes in the layer
*Type* The type of layer or activation function. Each model
will have one input and one softmax layer, where the softmax
layer is the output of the model. Hidden layers are
identified by the activation function specified.
*Dropout* The percentage of training data dropped from
training at that layer.
*L1, L2* The L1 and L2 regularization penalty applied to the
layer.
**Classification Error**
The percentage of times that a class was incorrectly
predicted by the model.
**Epochs**
The final number of full epochs carried out.
**Mini Batch Size**
The numebr of observations in each mini-batch used to update
path weights.
**Confusion Matrix**
A table showing the number of actual observations in a
particular class relative to the number of predicted
observations in a class. This is omitted when the model
specified is regression.
**Hit Ratio Table**
A table displaying the percentage of instances where the
actual class label assigned to an observation is in the top
K classes predicted by the network. For instance, in a four
class classifier on values A, B, C, D, a particular
observation is labeled class A, with a probability of .6 of
being A, .2 probability of being B, a .1 probability of
being C, and a .1 probability of being D. If the true class
is B, the observation will be counted in the hit rate for
K=2, but not in the hit rate of K=1.
**Variable Importance**
A table listing the importance of variables listed from
greatest importance, to least importance.