Deep Learning

Deep Learning relies on interconnected nodes and weighted information paths, which are adapted to minimize prediction error via back propagation, to produce non-linear models of complex relationships.

Defining a Deep Learning Model

The dependent or target variable of interest.

Ignored Columns:

This field will auto populate a list of the columns from the data set in use. The user selected set of columns are the features that will be omitted from the model. Additionally - users can specify whether the model should omit constant columns by selecting expert settings and checking the tic box indicating Ignore const cols.


Checkbox indicating whether the dependent variable is to be treated as a factor or a continuous variable.


A unique data set with the same shape and features as the training data to be used in model validation (i.e. production of error rates on data not used in model building).


A model key associated with a previously run deep learning model. This option allows users to build a new model as a continuation of a previously generated model.

Expert mode

When selected Expert mode allows users to specify expert settings, explained in more detail below. max w2


The activation function to be used at each of the nodes in the hidden layers.

Tanh :Hyperbolic tangent function

Rectifier: Chooses the maximum of (0, x) where x is the input value for a feature.

Maxout: Choose the maximum coordinate of the input vector.

With Dropout A percentage of the data will be omitted from training as data are presented to each hidden layer in order to improve generalization.


The number of hidden layers in the model. Multiple models can be specified and generated simultaneously. For example if a user specifies (300,300,100) a model with 3 layers off 100 hidden nodes each will be produced. To specify several different models with different dimensions enter information in the format (300, 300, 100), (200, 200), (200, 20).


The number of iterations to be carried out. In model training data is fed into an input layer and passed down weighted information paths, through each of the hidden layers, and a prediction is returned at the output layer. Deviations between the predicted values and the actual values are then calculated, and used to adjust the path weights to reduce the error between the predicted and true value. One full backward pass over the weighted paths is one epoch.

Mini Batch

Batch learning is a method in which the aggregated gradient contributions for all observations in the training set are obtained before weights are updated. Alternatively, users can specify mini-batch to update weights more frequently. If users specify mini-batch = 2000, the training data will be split into chunks of 2000 observations, and the model weights will be updated after each chunk is passed through the network.


Because of the random nature of the algorithm, models with the same specification can sometimes produce slightly different results. To control this behavior, users can specify a seed, which will produce the same values for random components on independent tries.

Adaptive Rate:

In the even that a model is specified over a topology with local minima or long plateaus, it is possible for a constant learning rate to produce sub-optimal results. When the gradient is being estimated in a well, a large learning rate can cause the gradient to oscillate and move in the wrong direction. When the gradient is being taken on a relatively flat surface, the model can converge far slower than necessary for small learning rates. Adaptive learning rates self adjust to avoid local minima or slow convergence.


The magnitude of the weight updates are determined by the user specified learning rate, and are a function of the difference between the predicted value and the target value. That difference, generally called delta, is only available at the output layer. To correct the output at each hidden layer, back propagation is used. Momentum modifies back propagation by allowing prior iterations to influence the current update. Using the momentum parameter can aid in avoiding local minima and the associated instability.

Momentum start The weight assigned to the results of the first sample passed through the model.

Momentum ramp The number of data samples for which results will be weighted.

Momentum stable The minimum weight to be attributed to the last weighted output.

Nestrov Accelerated

The Nestrov Accelerated Gradient Descent method is a modification to traditional gradient descent for convex functions. The method relies on gradient information at various points to build a polynomial approximation that minimizes the residuals in fewer iterations of the descent.

Input dropout ratio

A percentage of the data to be omitted from training in order to improve generalization.

L1 regularization

A regularization method that constrains the size of individual coefficients and has the net effect of dropping some coefficients from a model to reduce complexity and avoid overfitting.

L2 regularization

A regularization method that constrains the sum of the squared coefficients. This method introduces bias into parameter estimates, but frequently produces substantial gains in modeling as estimate variance is reduced.

Max W2

A maximum on the sum of the squared weights of information paths input into any one unit. This tuning parameter functions in a manner similar to L2 Regularization on the hidden layers of the network.

Initial weight distribution

The distribution from which initial path weights are to be drawn. When the norma option is selected weights are drawn from the standard normal with a mean of 0 and a standard deviation of 1.

Loss function

The loss function to be optimized by the model.

Cross Entropy Used when the model output consists of independent hypothesis, and the outputs can be interpreted as the probability that each hypothesis is true. Cross entropy is the recommended loss function when the target values are classifications, and especially when data are unbalanced.

Mean Square Used when the model output are continuous real values.

Score Interval

The number of seconds to elapse between model scoring.

Score Training Samples

The number of training set observations to be used in scoring.

Score Validation Samples

The number of validation set observations to be used in scoring.

Classification Stop

The stopping criteria in terms of classification error. When error is at or below this threshold, the algorithm stops.

Regression Stop

The stopping criteria in terms of error. When error is at or below this threshold, the algorithm stops.

Max Confusion Matrix

The maximum number of classes to be shown in the returned confusion matrix for classification models.

Max Hit Ratio K

The maximum frequency of actual class label to be among the top-K predicted class labels).

Balance Classes

When data are unbalanced selecting this option will oversample the minority class to train on.

Variable Importance

Report variable importance in the model output.

Force Load Balance

Increase training speed on small data sets to utilize all cores.

Shuffle Training Data

When data include classes with unbalanced distributions, or when data are ordered, it is possible to run the algorithm on chunks of data that do not accurately reflect the shape of the data as a whole, which can produce poor models. Shuffling training data ensures that all prediction classes are present in all chunks of data.

Interpreting the Model

Progress Table

The Progress table displays information about each of the hidden layers in the deep learning model.

Units The number of units or nodes in the layer

Type The type of layer or activation function. Each model will have one input and one softmax layer, where the softmax layer is the output of the model. Hidden layers are identified by the activation function specified.

Dropout The percentage of training data dropped from training at that layer.

L1, L2 The L1 and L2 regularization penalty applied to the layer.

Classification Error

The percentage of times that a class was incorrectly predicted by the model.


The final number of full epochs carried out.

Mini Batch Size

The numebr of observations in each mini-batch used to update path weights.

Confusion Matrix

A table showing the number of actual observations in a particular class relative to the number of predicted observations in a class. This is omitted when the model specified is regression.

Hit Ratio Table

A table displaying the percentage of instances where the actual class label assigned to an observation is in the top K classes predicted by the network. For instance, in a four class classifier on values A, B, C, D, a particular observation is labeled class A, with a probability of .6 of being A, .2 probability of being B, a .1 probability of being C, and a .1 probability of being D. If the true class is B, the observation will be counted in the hit rate for K=2, but not in the hit rate of K=1.

Variable Importance

A table listing the importance of variables listed from greatest importance, to least importance.