public class Frame extends Lockable<Frame>
Vec
s, essentially an R-like Distributed Data Frame.
Frames represent a large distributed 2-D table with named columns
(Vec
s) and numbered rows. A reasonable column limit is
100K columns, but there's no hard-coded limit. There's no real row
limit except memory; Frames (and Vecs) with many billions of rows are used
routinely.
A Frame is a collection of named Vecs; a Vec is a collection of numbered
Chunk
s. A Frame is small, cheaply and easily manipulated, it is
commonly passed-by-Value. It exists on one node, and may be
stored in the DKV
. Vecs, on the other hand, must be stored in the
DKV
, as they represent the shared common management state for a collection
of distributed Chunks.
Multiple Frames can reference the same Vecs, although this sharing can
make Vec lifetime management complex. Commonly temporary Frames are used
to work with a subset of some other Frame (often during algorithm
execution, when some columns are dropped from the modeling process). The
temporary Frame can simply be ignored, allowing the normal GC process to
reclaim it. Such temp Frames usually have a null
key.
All the Vecs in a Frame belong to the same Vec.VectorGroup
which
then enforces Chunk
row alignment across Vecs (or at least enforces
a low-cost access model). Parallel and distributed execution touching all
the data in a Frame relies on this alignment to get good performance.
Example: Make a Frame from a CSV file:
File file = ... NFSFileVec nfs = NFSFileVec.make(file); // NFS-backed Vec, lazily read on demand Frame fr = water.parser.ParseDataset.parse(Key.make("myKey"),nfs._key);
Example: Find and remove the Vec called "unique_id" from the Frame, since modeling with a unique_id can lead to overfitting:
Vec uid = fr.remove("unique_id");
Example: Move the response column to the last position:
fr.add("response",fr.remove("response"));
Modifier and Type | Class and Description |
---|---|
class |
Frame.CSVStream |
static class |
Frame.DeepSelect
Last column is a bit vec indicating whether or not to take the row.
|
static class |
Frame.VecSpecifier
Pair of (column name, Frame key).
|
Modifier and Type | Field and Description |
---|---|
java.lang.String[] |
_names
Vec names
|
Constructor and Description |
---|
Frame(Frame fr)
Deep copy of Vecs and Keys and Names (but not data!) to a new random Key.
|
Frame(Key key)
Creates an empty frame with given key.
|
Frame(Key key,
java.lang.String[] names,
Vec[] vecs)
Creates a frame with given key, names and vectors.
|
Frame(Key key,
Vec[] vecs,
boolean noChecks)
Special constructor for data with unnamed columns (e.g.
|
Frame(java.lang.String[] names,
Vec[] vecs)
Creates an internal frame composed of the given Vecs and names.
|
Frame(Vec... vecs)
Creates an internal frame composed of the given Vecs and default names.
|
Modifier and Type | Method and Description |
---|---|
Frame |
add(Frame fr)
Append a Frame onto this Frame.
|
void |
add(java.lang.String[] names,
Vec[] vecs) |
void |
add(java.lang.String[] names,
Vec[] vecs,
int cols) |
Vec |
add(java.lang.String name,
Vec vec)
Append a named Vec to the Frame.
|
Vec |
anyVec()
Returns the first readable vector.
|
Vec[] |
bulkRollups() |
long |
byteSize()
The
Vec.byteSize of all Vecs |
int[] |
cardinality()
Number of categorical levels for categorical columns; -1 for non-categorical columns.
|
protected long |
checksum_impl()
64-bit checksum of the checksums of the vecs.
|
Frame |
deepCopy(java.lang.String keyName)
Create a copy of the input Frame and return that copied Frame.
|
Frame |
deepSlice(java.lang.Object orows,
java.lang.Object ocols)
In support of R, a generic Deep Copy and Slice.
|
static java.lang.String |
defaultColName(int col)
Default column name maker
|
java.lang.String[][] |
domains()
All the domains for categorical columns; null for non-categorical columns.
|
static Job |
export(Frame fr,
java.lang.String path,
java.lang.String frameName,
boolean overwrite) |
Frame |
extractFrame(int startIdx,
int endIdx)
Split this Frame; return a subframe created from the given column interval, and
remove those columns from this Frame.
|
int |
find(Key key)
Finds the matching column index, or -1 if missing
|
int |
find(java.lang.String name)
Finds the column index with a matching name, or -1 if missing
|
int[] |
find(java.lang.String[] names)
Bulk
find(String) api |
int |
find(Vec vec)
Finds the matching column index, or -1 if missing
|
boolean |
hasNAs() |
void |
insertVec(int i,
java.lang.String name,
Vec vec) |
boolean |
isCompatible(Frame fr)
Quick compatibility check between Frames.
|
Key[] |
keys()
The array of keys.
|
Vec |
lastVec()
Convenience to accessor for last Vec
|
java.lang.String |
lastVecName()
Convenience to accessor for last Vec name
|
Vec[] |
makeCompatible(Frame f) |
Vec[] |
makeCompatible(Frame f,
boolean force)
Return array of Vectors if 'f' is compatible with 'this', else return a new
array of Vectors compatible with 'this' and a copy of 'f's data otherwise.
|
java.lang.Class<KeyV3.FrameKeyV3> |
makeSchema() |
double[] |
means()
All the column means.
|
int[] |
modes()
Majority class for categorical columns; -1 for non-categorical columns.
|
void |
moveFirst(int[] cols)
move the provided columns to be first, in-place.
|
double[] |
mults()
One over the standard deviation of each column.
|
java.lang.String |
name(int i)
A single column name.
|
java.lang.String[] |
names()
The array of column names.
|
int |
numCols()
Number of columns
|
long |
numRows()
Number of rows
|
Futures |
postWrite(Futures fs)
Allow rollups for all written-into vecs; used by
MRTask once
writing is complete. |
Frame |
prepend(java.lang.String name,
Vec vec)
Insert a named column as the first column
|
protected Keyed |
readAll_impl(AutoBuffer ab,
Futures fs) |
Vec[] |
reloadVecs()
Force a cache-flush and reload, assuming vec mappings were altered
remotely, or that the _vecs array was shared and now needs to be a
defensive copy.
|
protected Futures |
remove_impl(Futures fs)
Actually remove/delete all Vecs from memory, not just from the Frame.
|
Vec |
remove(int idx)
Removes a numbered column.
|
Vec[] |
remove(int[] idxs)
Removes a list of columns by index; the index list must be sorted
|
Vec |
remove(java.lang.String name)
Removes the column with a matching name.
|
Frame |
remove(java.lang.String[] names) |
Vec |
replace(int col,
Vec nv)
Replace one column with another.
|
void |
restructure(java.lang.String[] names,
Vec[] vecs)
Restructure a Frame completely
|
void |
restructure(java.lang.String[] names,
Vec[] vecs,
int cols)
Restructure a Frame completely, but only for a specified number of columns (counting up)
|
void |
setNames(java.lang.String[] columns) |
Frame |
subframe(int startIdx,
int endIdx)
Create a subframe from given interval of columns.
|
Frame |
subframe(java.lang.String[] names)
Returns a subframe of this frame containing only vectors with desired names.
|
void |
swap(int lo,
int hi)
Swap two Vecs in-place; useful for sorting columns by some criteria
|
java.io.InputStream |
toCSV(boolean headers,
boolean hex_string)
Convert this Frame to a CSV (in an
InputStream ), that optionally
is compatible with R 3.1's recent change to read.csv()'s behavior. |
java.lang.String |
toString() |
java.lang.String |
toString(long off,
int len) |
java.lang.String |
toString(long off,
int len,
boolean rollups) |
TwoDimTable |
toTwoDimTable(long off,
int len) |
TwoDimTable |
toTwoDimTable(long off,
int len,
boolean rollups) |
byte[] |
types()
Type for every Vec
|
java.lang.String[] |
typesStr()
String name for each Vec type
|
java.lang.String |
uniquify(java.lang.String name) |
Vec |
vec(int idx)
Returns the Vec by given index, implemented by code:
vecs()[idx] . |
Vec |
vec(java.lang.String name)
Return a Vec by name, or null if missing
|
Vec[] |
vecs()
The internal array of Vecs.
|
Vec[] |
vecs(int[] idxs) |
Vec[] |
vecs(java.lang.String[] names) |
protected AutoBuffer |
writeAll_impl(AutoBuffer ab)
Write out K/V pairs, in this case Vecs.
|
delete_and_lock, delete_and_lock, delete_and_lock, delete, delete, delete, read_lock, read_lock, read_lock, unlock_all, unlock, unlock, unlock, unlock, update, update, update, write_lock, write_lock, write_lock
checksum, readAll, remove, remove, remove, remove, writeAll
asBytes, clone, copyOver, frozenType, read, readExternal, readJSON, reloadFromBytes, toJsonString, write, writeExternal, writeJSON
public Frame(Vec... vecs)
public Frame(java.lang.String[] names, Vec[] vecs)
public Frame(Key key)
public Frame(Key key, Vec[] vecs, boolean noChecks)
key
- vecs
- noChecks
- public Frame(Key key, java.lang.String[] names, Vec[] vecs)
public Frame(Frame fr)
public boolean hasNAs()
public void setNames(java.lang.String[] columns)
public static java.lang.String defaultColName(int col)
public java.lang.String uniquify(java.lang.String name)
public boolean isCompatible(Frame fr)
public int numCols()
public long numRows()
public final Vec anyVec()
public java.lang.String[] names()
public java.lang.String name(int i)
public Key[] keys()
public final Vec[] vecs()
DKV
.public final Vec[] vecs(int[] idxs)
public Vec[] vecs(java.lang.String[] names)
public Vec lastVec()
public java.lang.String lastVecName()
public final Vec[] reloadVecs()
public final Vec vec(int idx)
vecs()[idx]
.idx
- idx of columnnull
public Vec vec(java.lang.String name)
public int find(java.lang.String name)
public int find(Vec vec)
public int find(Key key)
public int[] find(java.lang.String[] names)
find(String)
apinames
arraypublic void insertVec(int i, java.lang.String name, Vec vec)
public byte[] types()
public java.lang.String[] typesStr()
public java.lang.String[][] domains()
public int[] cardinality()
public Vec[] bulkRollups()
public int[] modes()
public double[] means()
public double[] mults()
public long byteSize()
Vec.byteSize
of all VecsVec.byteSize
of all Vecsprotected long checksum_impl()
checksum_impl
in class Keyed<Frame>
public void add(java.lang.String[] names, Vec[] vecs)
public void add(java.lang.String[] names, Vec[] vecs, int cols)
public Vec add(java.lang.String name, Vec vec)
public Frame add(Frame fr)
public Frame prepend(java.lang.String name, Vec vec)
public void swap(int lo, int hi)
public void moveFirst(int[] cols)
public Frame subframe(java.lang.String[] names)
names
- list of vector namesjava.lang.IllegalArgumentException
- if there is no vector with desired name in this frame.public Futures postWrite(Futures fs)
MRTask
once
writing is complete.protected Futures remove_impl(Futures fs)
remove_impl
in class Keyed<Frame>
protected AutoBuffer writeAll_impl(AutoBuffer ab)
writeAll_impl
in class Keyed<Frame>
protected Keyed readAll_impl(AutoBuffer ab, Futures fs)
readAll_impl
in class Keyed<Frame>
public Vec replace(int col, Vec nv)
public Frame subframe(int startIdx, int endIdx)
startIdx
- index of first column (inclusive)endIdx
- index of the last column (exclusive)public Frame extractFrame(int startIdx, int endIdx)
startIdx
- index of first column (inclusive)endIdx
- index of the last column (exclusive)public Vec remove(java.lang.String name)
public Frame remove(java.lang.String[] names)
public Vec[] remove(int[] idxs)
public final Vec remove(int idx)
public void restructure(java.lang.String[] names, Vec[] vecs)
public void restructure(java.lang.String[] names, Vec[] vecs, int cols)
public Frame deepSlice(java.lang.Object orows, java.lang.Object ocols)
Semantics are a little odd, to match R's. Each dimension spec can be:
The numbering is 1-based; zero's are not allowed in the lists, nor are out-of-range values.
public java.lang.String toString()
toString
in class java.lang.Object
public java.lang.String toString(long off, int len)
public java.lang.String toString(long off, int len, boolean rollups)
public TwoDimTable toTwoDimTable(long off, int len)
public TwoDimTable toTwoDimTable(long off, int len, boolean rollups)
public Frame deepCopy(java.lang.String keyName)
keyName
- Key for resulting frame. If null, no key will be given.public Vec[] makeCompatible(Frame f, boolean force)
this
s' data.f
.public static Job export(Frame fr, java.lang.String path, java.lang.String frameName, boolean overwrite)
public java.io.InputStream toCSV(boolean headers, boolean hex_string)
InputStream
), that optionally
is compatible with R 3.1's recent change to read.csv()'s behavior.public java.lang.Class<KeyV3.FrameKeyV3> makeSchema()
makeSchema
in class Keyed<Frame>