Random Forest (RF)
------------------
RF is a powerful classification tool. When given a set of data, RF
generates a forest of classification trees, rather than a single
classification tree. Each of these trees generates a classification
for a given set of attributes. The classification from each H\ :sub:`2`\ Otree
can be thought of as a vote; the most votes determines the
classification.
When to use RF
""""""""""""""
RF is a good choice when your objective is classification.
For example:
"Given a large set of observations and attributes, the goal is to
classify observations by spending habits."
Defining a Model
""""""""""""""""""
**Response Variable:**
The variable on which you would like to classify
**N Tree:**
The number of trees the user would like to generate for
classification
**Features:**
A user defined tuning parameter for controlling model complexity
(by number of nodes); the number of features on which the trees are
to split. In practice features is bounded between 1 and the total
number of features in the data. In different fields features may
also be called attributes or traits.
**Depth:**
A user defined tuning parameter for controlling model complexity
(by number of edges); depth is the longest path from root to the
furthest leaf.
**Stat type:**
A choice of criteria that determines the optimum split at each
node.
*Entropy:*
This is also known as information gain, entropy is a measure of
uncertainty in a classification scheme. For example, if a two
class population is 90% class A, and 10% class B, then there is a
.90 probability that a randomly selected member of the population
is A. This scheme has lower entropy than a population where 50% is
class A, and 50% is class B. the objective of using the entropy
impurity measure is to minimize this
uncertainty.
*Gini:*
An impurity measure based on the disparities in attribute
correlation between the most and least dominant classes in a
node. The objective of using this impurity measure is to choose
the feature split that best isolates the dominant class.
**Ignore:**
Is the set of columns other than the response variable that
should be omitted from building the tree.
**Sampling Strategy:**
This allows the user to define whether or not the model needs to
correct for unbalanced data by changing the mechanism through
which training samples are generated.
Serves a similar purpose as class weights; It ensures that in
unbalanced data the sample split for testing and training
(used to calculate the out of bag error) every class is
represented at least once. This insures that every class is
included in the model, rather than being omitted by chance.
*Random Sampling:*
Samples subsets on which trees are built such that every
observation has an equal chance of being drawn.
*Stratified Sampling:*
Partitions data set by classification before
sampling, and then samples from each subset. This insures
that each class will be represented in every split, even if the
class being drawn in a random sample was a low probability event. It
guarantees that when data are unbalanced, no class is omitted from
the model by chance.
**Sample:**
User defined percentage of the observations in the data set to
sample for the building of each tree.
**Out of Bag Error Estimate:**
Every tree RF internally constructs a test/ train split. The Kth
tree is built by pulling a sample on the data set, bootstrapping,
and using the result to build a tree. Observations not used to
build the tree are then run down the tree to see what
classification they are assigned. The OOB error rate is
calculated by calculating the error rate for each class and then
averaging over all classes.
**Bin Limit:**
A user defined tuning parameter for controlling model complexity,
bin limit caps the the maximum number of groups into which the
orginal data are to be categorized.
**Seed:**
A large number that allows the analyst to recreate an analysis by
specifying a starting point for black box processes that would
otherwise occur at a randomly chosen place within the data.
**Class Weight:**
When observed classifications in training data are uneven, users
may wish to correct this by weighting. Weights should be assigned
so that if chosen at random, an observation of each
classification has an equal chance. For example, if there are two
classifications A and B in a data set, such that As occur
about 10% of the time, and Bs occur the rest, A should given a
weight of 5, and B of .56.
Interpreting Results
""""""""""""""""""""
RF results are comprised of a model key and a confusion matrix. The
model key specifies the full forest of trees to be used for
predicting classifications.
An example of a confusion matrix is given below:
The highlighted fields across the diagonal indicate the number the
number of true members of the class who were correctly predicted as
true. In this case, of 111 total members of class F, 44 were correctly
identified as class F, while a total of 80 observations were
incorrectly classified as M or I, yielding an error rate of 0.654.
In the column for class F, 11 members of I were incorrectly classified
as F, 56 as male, and a total of 111 observations in the set were
identified as F.
The overall error rate is shown in the bottom right field. It reflects
the total number of incorrect predictions divided by the total number
of rows.
.. image:: RFtable.png
:width: 90%
RF Error Rates
""""""""""""""
H\ :sub:`2`\ O's Random Forest Algo produces a dynamic confusion matrix. As each
tree is built and OOBE (out of bag error estimate) is recalculated,
expected behavior is that error rate increases before it decreases.
This is a natural outcome of Random Forest's learning process. When
there are only a few trees, built on random subsets, the error rate is
expected to be relatively high. As more trees are added, and thus
more trees are "voting" for the correct classification of the OOB
data, the error rate should decrease.
Random Forest Data Science
--------------------------
.. raw:: html