public abstract class DHistogram<TDH extends DHistogram>
extends water.Iced
A DHistogram
bins every value added to it, and computes a the
vec min and max (for use in the next split), and response mean and variance
for each bin. DHistogram
s are initialized with a min, max and
number-of- elements to be added (all of which are generally available from
a Vec). Bins run from min to max in uniform sizes. If the DHistogram
can determine that fewer bins are needed (e.g. boolean columns
run from 0 to 1, but only ever take on 2 values, so only 2 bins are
needed), then fewer bins are used.
DHistogram
are shared per-node, and atomically updated. There's
an add
call to help cross-node reductions. The data is stored in
primitive arrays, so it can be sent over the wire.
If we are successively splitting rows (e.g. in a decision tree), then a
fresh DHistogram
for each split will dynamically re-bin the data.
Each successive split will logarithmically divide the data. At the first
split, outliers will end up in their own bins - but perhaps some central
bins may be very full. At the next split(s), the full bins will get split,
and again until (with a log number of splits) each bin holds roughly the
same amount of data. This dynamic binning resolves a lot of problems with
picking the proper bin count or limits - generally a few more tree levels
will equal any fancy but fixed-size binning strategy.
Modifier and Type | Field and Description |
---|---|
double[] |
_bins |
byte |
_isInt |
float |
_maxEx |
protected float |
_maxIn |
float |
_min |
protected float |
_min2 |
java.lang.String |
_name |
char |
_nbin |
float |
_step |
Constructor and Description |
---|
DHistogram(java.lang.String name,
int nbins,
int nbins_cats,
byte isInt,
float min,
float maxEx) |
Modifier and Type | Method and Description |
---|---|
double |
bins(int b) |
long |
byteSize() |
abstract long |
byteSize0() |
float |
find_maxEx() |
static float |
find_maxEx(float maxIn,
int isInt) |
float |
find_maxIn() |
float |
find_min() |
static DHistogram[] |
initialHist(water.fvec.Frame fr,
int ncols,
int nbins,
int nbins_cats,
DHistogram[] hs) |
boolean |
isConstantResponse() |
static DHistogram |
make(java.lang.String name,
int nbins,
int nbins_cats,
byte isInt,
float min,
float maxEx) |
abstract double |
mean(int b) |
int |
nbins() |
abstract DTree.Split |
scoreMSE(int col,
double min_rows) |
void |
setMax(float max) |
void |
setMin(float min) |
java.lang.String |
toString() |
abstract double |
var(int b) |
public final transient java.lang.String _name
public final byte _isInt
public final char _nbin
public final float _step
public final float _min
public final float _maxEx
public double[] _bins
protected float _min2
protected float _maxIn
public DHistogram(java.lang.String name, int nbins, int nbins_cats, byte isInt, float min, float maxEx)
public void setMin(float min)
public void setMax(float max)
public int nbins()
public double bins(int b)
public abstract double mean(int b)
public abstract double var(int b)
public float find_min()
public float find_maxIn()
public float find_maxEx()
public static float find_maxEx(float maxIn, int isInt)
public abstract DTree.Split scoreMSE(int col, double min_rows)
public static DHistogram[] initialHist(water.fvec.Frame fr, int ncols, int nbins, int nbins_cats, DHistogram[] hs)
public static DHistogram make(java.lang.String name, int nbins, int nbins_cats, byte isInt, float min, float maxEx)
public boolean isConstantResponse()
public java.lang.String toString()
toString
in class java.lang.Object
public abstract long byteSize0()
public long byteSize()