public final class DHistogram
extends water.Iced
A DHistogram bins every value added to it, and computes a the
vec min and max (for use in the next split), and response mean and variance
for each bin. DHistograms are initialized with a min, max and
number-of- elements to be added (all of which are generally available from
a Vec). Bins run from min to max in uniform sizes. If the DHistogram can determine that fewer bins are needed (e.g. boolean columns
run from 0 to 1, but only ever take on 2 values, so only 2 bins are
needed), then fewer bins are used.
DHistogram are shared per-node, and atomically updated. There's
an add call to help cross-node reductions. The data is stored in
primitive arrays, so it can be sent over the wire.
If we are successively splitting rows (e.g. in a decision tree), then a
fresh DHistogram for each split will dynamically re-bin the data.
Each successive split will logarithmically divide the data. At the first
split, outliers will end up in their own bins - but perhaps some central
bins may be very full. At the next split(s), the full bins will get split,
and again until (with a log number of splits) each bin holds roughly the
same amount of data. This dynamic binning resolves a lot of problems with
picking the proper bin count or limits - generally a few more tree levels
will equal any fancy but fixed-size binning strategy.
| Modifier and Type | Field and Description |
|---|---|
double[] |
_bins |
byte |
_isInt |
float |
_maxEx |
protected float |
_maxIn |
float |
_min |
protected float |
_min2 |
java.lang.String |
_name |
char |
_nbin |
float |
_step |
| Constructor and Description |
|---|
DHistogram(java.lang.String name,
int nbins,
int nbins_cats,
byte isInt,
float min,
float maxEx) |
| Modifier and Type | Method and Description |
|---|---|
double |
bins(int b) |
static DHistogram[] |
initialHist(water.fvec.Frame fr,
int ncols,
int nbins,
int nbins_cats,
DHistogram[] hs) |
static DHistogram |
make(java.lang.String name,
int nbins,
int nbins_cats,
byte isInt,
float min,
float maxEx) |
int |
nbins() |
DTree.Split |
scoreMSE(int col,
double min_rows) |
java.lang.String |
toString() |
public final transient java.lang.String _name
public final byte _isInt
public final char _nbin
public final float _step
public final float _min
public final float _maxEx
public double[] _bins
protected float _min2
protected float _maxIn
public DHistogram(java.lang.String name,
int nbins,
int nbins_cats,
byte isInt,
float min,
float maxEx)
public int nbins()
public double bins(int b)
public static DHistogram[] initialHist(water.fvec.Frame fr, int ncols, int nbins, int nbins_cats, DHistogram[] hs)
public static DHistogram make(java.lang.String name, int nbins, int nbins_cats, byte isInt, float min, float maxEx)
public java.lang.String toString()
toString in class java.lang.Objectpublic DTree.Split scoreMSE(int col, double min_rows)