public class MRUtils
extends java.lang.Object
Modifier and Type | Class and Description |
---|---|
static class |
MRUtils.ClassDist
Compute the class distribution from a class label vector
(not counting missing values)
Usage 1: Label vector is categorical
------------------------------------
Vec label = ...;
assert(label.isEnum());
long[] dist = new ClassDist(label).doAll(label).dist();
Usage 2: Label vector is numerical
----------------------------------
Vec label = ...;
int num_classes = ...;
assert(label.isInt());
long[] dist = new ClassDist(num_classes).doAll(label).dist();
|
Constructor and Description |
---|
MRUtils() |
Modifier and Type | Method and Description |
---|---|
static Frame |
sampleFrame(Frame fr,
long rows,
long seed)
Sample rows from a frame.
|
static Frame |
sampleFrameStratified(Frame fr,
Vec label,
float[] sampling_ratios,
long seed,
boolean debug)
Stratified sampling
|
static Frame |
sampleFrameStratified(Frame fr,
Vec label,
float[] sampling_ratios,
long maxrows,
long seed,
boolean allowOversampling,
boolean verbose)
Stratified sampling for classifiers
|
static Frame |
shuffleAndBalance(Frame fr,
int splits,
long seed,
boolean local,
boolean shuffle)
Global redistribution of a Frame (balancing of chunks), done by calling process (all-to-one + one-to-all)
|
static Frame |
shuffleFramePerChunk(Frame fr,
long seed)
Row-wise shuffle of a frame (only shuffles rows inside of each chunk)
|
static Frame |
shuffleFramePerChunk(Key outputFrameKey,
Frame fr,
long seed) |
public static Frame sampleFrame(Frame fr, long rows, long seed)
fr
- Input framerows
- Approximate number of rows to sample (across all chunks)seed
- Seed for RNGpublic static Frame shuffleFramePerChunk(Frame fr, long seed)
fr
- Input framepublic static Frame shuffleFramePerChunk(Key outputFrameKey, Frame fr, long seed)
public static Frame shuffleAndBalance(Frame fr, int splits, long seed, boolean local, boolean shuffle)
fr
- Input frameseed
- RNG seedshuffle
- whether to shuffle the data globallypublic static Frame sampleFrameStratified(Frame fr, Vec label, float[] sampling_ratios, long maxrows, long seed, boolean allowOversampling, boolean verbose)
fr
- Input framelabel
- Label vector (must be enum)sampling_ratios
- Optional: array containing the requested sampling ratios per class (in order of domains), will be overwritten if it contains all 0smaxrows
- Maximum number of rows in the returned frameseed
- RNG seed for samplingallowOversampling
- Allow oversampling of minority classesverbose
- Whether to print verbose infopublic static Frame sampleFrameStratified(Frame fr, Vec label, float[] sampling_ratios, long seed, boolean debug)
fr
- Input framelabel
- Label vector (from the input frame)sampling_ratios
- Given sampling ratios for each class, in order of domainsseed
- RNG seeddebug
- Whether to print debug info