public class Vec extends Keyed<Vec>
A distributed vector has a count of elements, an element-to-chunk mapping, a Java-like type (mostly determines rounding on store and display), and functions to directly load elements without further indirections. The data is compressed, or backed by disk or both.
A Vec is a collection of Chunk
s, each of which holds between 1,000
and 1,000,000 elements. Operations on a Chunk are intended to be
single-threaded; operations on a Vec are intended to be parallel and
distributed on Chunk granularities, with each Chunk being manipulated by a
separate CPU. The standard Map/Reduce (MRTask
) paradigm handles
parallel and distributed Chunk access well.
Individual elements can be directly accessed like a (very large and
distributed) array - however this is not the fastest way to access the
data. Direct access from Chunks is faster, avoiding several layers of
indirection. In particular accessing a random row from the Vec will force
the containing Chunk data to be cached locally (and network traffic to
bring it local); accessing all rows from a single machine will force all
the Big Data to be pulled local typically resulting in swapping and very
poor performance. The main API is provided for ease of small-data
manipulations and is fairly slow for writing; when touching ALL the data
you are much better off using e.g. MRTask
.
The main API is at(long)
, set(long, long)
, and isNA(long)
:
Returns | Call | Missing? | Notes |
---|---|---|---|
double | at(long) | NaN | |
long | at8(long) | throws | |
long | at16l(long) | throws | Low half of 128-bit UUID |
long | at16h(long) | throws | High half of 128-bit UUID |
BufferedString | atStr(water.parser.BufferedString, long) | null | Updates BufferedString in-place and returns it for flow-coding |
boolean | isNA(long) | true | |
set(long,double) | NaN | ||
set(long,float) | NaN | Limited precision takes less memory | |
set(long,long) | Cannot set | ||
set(long,String) | null | Convenience wrapper for String | |
setNA(long) |
Example manipulating some individual elements:
double r1 = vec.at(0x123456789L); // Access element 0x1234567889 as a double double r2 = vec.at(-1); // Throws AIOOBE long r3 = vec.at8_abs(1); // Element #1, as a long vec.set(2,r1+r3); // Set element #2, as a double
Vecs have a loosely enforced type: one of numeric, UUID
or String
. Numeric types are further broken down into integral
(long
) and real (double
) types. The categorical
type is
an integral type, with a String mapping side-array. Most of the math
algorithms will treat categoricals as small dense integers, and most categorical
printouts will use the String mapping. Time is another special integral
type: it is represented as milliseconds since the unix epoch, and is mostly
treated as an integral type when doing math but it has special time-based
printout formatting. All types support the notion of a missing element; for
real types this is always NaN. It is an error to attempt to fetch a
missing integral type, and isNA(long)
must be called first. Integral
types are losslessly compressed. Real types may lose 1 or 2 ULPS due to
compression.
Reading elements as doubles, or checking for an element missing is always safe. Reading a missing integral type throws an exception, since there is no NaN equivalent in the integer domain. Writing to elements may throw if the backing data is read-only (file backed), and otherwise is fully supported.
Note this dangerous scenario: loading a missing value as a double, and setting it as a long:
set(row,(long)at(row)); // Danger!The cast from a Double.NaN to a long produces a zero! This code will silently replace a missing value with a zero.
Vecs have a lazily computed RollupStats
object and Key. The
RollupStats give fast access to the common metrics: min()
, max()
, mean()
, sigma()
, the count of missing elements (naCnt()
) and non-zeros (nzCnt()
), amongst other stats. They are
cleared if the Vec is modified and lazily recomputed after the modified Vec
is closed. Clearing the RollupStats cache is fairly expensive for
individual set(long, long)
calls but is easy to amortize over a large count of
writes; i.e., batch writing is efficient. This is normally handled by the
MRTask framework; the Vec.Writer
framework allows
single-threaded efficient batch writing for smaller Vecs.
Example usage of common stats:
double mean = vec.mean(); // Vec's mean; first touch computes and caches rollups double min = vec.min(); // Smallest element; already computed double max = vec.max(); // Largest element; already computed double sigma= vec.sigma(); // Standard deviation; already computed
Example: Impute (replace) missing values with the mean. Note that the
use of vec.mean()
in the constructor uses (and computes) the
general RollupStats before the MRTask starts. Setting a value in the Chunk
clears the RollupStats (since setting any value but the mean will change
the mean); they will be recomputed at the next use after the MRTask.
new MRTask{} { final double _mean = vec.mean(); public void map( Chunk chk ) { for( int row=0; row < chk._len; row++ ) if( chk.isNA(row) ) chk.set(row,_mean); } }.doAll(vec);
Vecs have a Vec.VectorGroup
. Vecs in the same VectorGroup have the
same Chunk and row alignment - that is, Chunks with the same index are
homed to the same Node and have the same count of rows-per-Chunk. Frame
s are only composed of Vecs of the same VectorGroup (or very small
Vecs) guaranteeing that all elements of each row are homed to the same Node
and set of Chunks - such that a simple for
loop over a set of
Chunks all operates locally. See the example in the Chunk
class.
It is common and cheap to make new Vecs in the same VectorGroup as an existing Vec and initialized to e.g. zero. Such Vecs are often used as temps, and usually immediately set to interest values in a later MRTask pass.
Example creation of temp Vecs:
Vec tmp0 = vec.makeZero(); // New Vec with same VectorGroup and layout as vec, filled with zero Vec tmp1 = vec.makeCon(mean); // Filled with 'mean' assert tmp1.at(0x123456789)==mean; // All elements filled with 'mean' for( int i=0; i<100; i++ ) // A little math on the first 100 elements tmp0.set(i,tmp1.at(i)+17); // ...set into the tmp0 vec
Vec Key
s have a special layout (enforced by the various Vec
constructors) so there is a direct Key mapping from a Vec to a numbered
Chunk and vice-versa. This mapping is crucially used in all sorts of
places, basically representing a global naming scheme across a Vec and the
Chunks that make it up. The basic layout created by newKey()
:
byte: 0 1 2 3 4 5 6 7 8 9 10+ Vec Key layout: Key.VEC -1 vec#grp -1 normal Key bytes; often e.g. a function of original file name Chunk Key layout: Key.CHK -1 vec#grp chunk# normal Key bytes; often e.g. a function of original file name RollupStats Key : Key.CHK -1 vec#grp -2 normal Key bytes; often e.g. a function of original file name Group Key layout: Key.GRP -1 -1 -1 normal Key bytes; often e.g. a function of original file name ESPC Key layout: Key.GRP -1 -1 -2 normal Key bytes; often e.g. a function of original file name
Modifier and Type | Class and Description |
---|---|
static class |
Vec.ESPC |
class |
Vec.Reader
A more efficient way to read randomly to a Vec - still single-threaded,
but much faster than Vec.at(i).
|
static class |
Vec.VectorGroup
Class representing the group of vectors.
|
class |
Vec.Writer
A more efficient way to write randomly to a Vec - still single-threaded,
still slow, but much faster than Vec.set().
|
Modifier and Type | Field and Description |
---|---|
int |
_rowLayout
Element-start per chunk, i.e.
|
static boolean |
DO_HISTOGRAMS |
static int |
KEY_PREFIX_LEN
Internally used to help build Vec and Chunk Keys; public to help
PersistNFS build file mappings.
|
static double[] |
PERCENTILES
Default percentiles for approximate (single-pass) quantile computation (histogram-based).
|
static byte |
T_BAD |
static byte |
T_CAT |
static byte |
T_NUM |
static byte |
T_STR |
static byte |
T_TIME |
static byte |
T_UUID |
static java.lang.String[] |
TYPE_STR |
Constructor and Description |
---|
Vec(Key<Vec> key,
int rowLayout)
Build a numeric-type Vec; the caller understands Chunk layout (via the
espc array). |
Vec(Key<Vec> key,
int rowLayout,
java.lang.String[] domain,
byte type)
Main default constructor; the caller understands Chunk layout (via the
espc array), plus categorical/factor the domain (or null for
non-categoricals), and the Vec type. |
Modifier and Type | Method and Description |
---|---|
CategoricalWrappedVec |
adaptTo(java.lang.String[] domain)
Make a Vec adapting this cal vector to the 'to' categorical Vec.
|
Vec |
align(Vec vec)
Always makes a copy of the given vector which shares the same group as
this Vec.
|
double |
at(long i)
Fetch element the slow way, as a double, or Double.NaN is missing.
|
long |
at16h(long i)
Fetch element the slow way, as the high half of a UUID.
|
long |
at16l(long i)
Fetch element the slow way, as the low half of a UUID.
|
long |
at8(long i)
Fetch element the slow way, as a long.
|
BufferedString |
atStr(BufferedString bStr,
long i)
Fetch element the slow way, as a
BufferedString or null if missing. |
double |
base()
The
base for a simple and cheap histogram of the Vec, useful
for getting a broad overview of the data. |
long[] |
bins()
A simple and cheap histogram of the Vec, useful for getting a broad
overview of the data.
|
long |
byteSize()
Size of compressed vector data.
|
int |
cardinality()
Returns cardinality for categorical domain or -1 for other types.
|
protected long |
checksum_impl()
A high-quality 64-bit checksum of the Vec's content, useful for
establishing dataset identity.
|
Chunk |
chunkForChunkIdx(int cidx)
The Chunk for a chunk#.
|
Chunk |
chunkForRow(long i)
The Chunk for a row#.
|
Value |
chunkIdx(int cidx)
Get a Chunk's Value by index.
|
Key |
chunkKey(int cidx)
Get a Chunk Key from a chunk-index.
|
static Key |
chunkKey(Key veckey,
int cidx)
Get a Chunk Key from a chunk-index and a Vec Key, without needing the
actual Vec object.
|
void |
copyMeta(Vec src,
Futures fs) |
Vec |
doCopy() |
java.lang.String[] |
domain()
Returns the categorical toString mapping array, or null if not an categorical column.
|
int |
elem2ChunkIdx(long i)
Convert a row# to a chunk#.
|
boolean |
equals(java.lang.Object o)
True if two Vecs are equal.
|
long[] |
espc() |
java.lang.String |
factor(long i)
Returns the
i th factor for this categorical column. |
java.lang.String |
get_type_str() |
byte |
get_type()
Get the column type.
|
static Key |
getVecKey(Key chk_key)
Get a Vec Key from Chunk Key, without loading the Chunk.
|
Vec.VectorGroup |
group()
Get the group this vector belongs to.
|
int |
hashCode()
Vec's hashcode, which is just the Vec Key hashcode.
|
boolean |
isBad()
True if the column contains only NAs
|
boolean |
isBinary() |
boolean |
isCategorical()
True if this is an categorical column.
|
boolean |
isConst()
True if the column contains only a constant value and it is not full of NAs
|
boolean |
isInt()
isInt is a property of numeric Vecs and not a type; this
property can be changed by assigning non-integer values into the Vec (or
restored by overwriting non-integer values with integers).
|
boolean |
isNA(long row)
Fetch the missing-status the slow way.
|
boolean |
isNumeric()
True if this is a numeric column, excluding categorical and time types.
|
boolean |
isString()
True if this is a String column.
|
boolean |
isTime()
True if this is a time column.
|
boolean |
isUUID()
True if this is a UUID column.
|
long[] |
lazy_bins()
Optimistically return the histogram bins, or null if not computed
|
long |
length()
Number of elements in the vector; returned as a
long instead of
an int because Vecs support more than 2^32 elements. |
Vec |
makeCon(double d)
Make a new vector with the same size and data layout as the current one,
and initialized to the given constant value.
|
static Vec |
makeCon(double x,
long len)
Make a new constant vector with the given row count, and redistribute the data
evenly around the cluster.
|
static Vec |
makeCon(double x,
long len,
boolean redistribute)
Make a new constant vector with the given row count.
|
static Vec |
makeCon(double x,
long len,
int log_rows_per_chunk)
Make a new constant vector with the given row count, and redistribute the data evenly
around the cluster.
|
static Vec |
makeCon(double x,
long len,
int log_rows_per_chunk,
boolean redistribute)
Make a new constant vector with the given row count.
|
static Vec |
makeCon(Key<Vec> k,
double... rows)
A Vec from an array of doubles
|
static Vec |
makeCon(long totSize,
long len)
Make a new constant vector with minimal number of chunks.
|
static Vec |
makeCon(long l,
java.lang.String[] domain,
Vec.VectorGroup group,
int rowLayout) |
static Vec[] |
makeCons(double x,
long len,
int n) |
Vec[] |
makeCons(int n,
long l,
java.lang.String[][] domains,
byte[] types) |
Vec |
makeCopy()
A new vector which is a copy of
this one. |
Vec |
makeCopy(java.lang.String[] domain)
A new vector which is a copy of
this one. |
Vec |
makeCopy(java.lang.String[] domain,
byte type) |
Vec[] |
makeDoubles(int n,
double[] values) |
Vec |
makeRand(long seed)
Make a new vector initialized to random numbers with the given seed
|
static Vec |
makeRepSeq(long len,
long repeat)
Make a new vector initialized to increasing integers mod
repeat . |
static Vec |
makeSeq(long len,
boolean redistribute)
Make a new vector initialized to increasing integers, starting with 1.
|
static Vec |
makeSeq(long min,
long len)
Make a new vector initialized to increasing integers, starting with `min`.
|
static Vec |
makeSeq(long min,
long len,
boolean redistribute)
Make a new vector initialized to increasing integers, starting with `min`.
|
static Vec |
makeVec(double[] vals,
Key<Vec> vecKey) |
static Vec |
makeVec(double[] vals,
java.lang.String[] domain,
Key<Vec> vecKey) |
static Vec |
makeVec(long[] vals,
java.lang.String[] domain,
Key<Vec> vecKey) |
Vec |
makeZero()
Make a new vector with the same size and data layout as the current one,
and initialized to zero.
|
static Vec |
makeZero(long len)
Make a new zero-filled vector with the given row count.
|
static Vec |
makeZero(long len,
boolean redistribute)
Make a new zero-filled vec
|
Vec |
makeZero(java.lang.String[] domain)
A new vector with the same size and data layout as the current one, and
initialized to zero, with the given categorical domain.
|
Vec[] |
makeZeros(int n) |
Vec[] |
makeZeros(int n,
java.lang.String[][] domain,
byte[] types) |
double |
max()
Vec's maximum value
|
double[] |
maxs()
Vec's 5 largest values
|
double |
mean()
Vecs's mean
|
double |
min()
Vec's minimum value
|
double[] |
mins()
Vec's 5 smallest values
|
int |
mode()
Vecs's mode
|
long |
naCnt()
Count of missing elements
|
int |
nChunks()
Number of chunks, returned as an
int - Chunk count is limited by
the max size of a Java long[] . |
static Key<Vec> |
newKey()
Make a new random Key that fits the requirements for a Vec key.
|
long |
ninfs()
Count of negative infinities
|
long |
nzCnt()
Count of non-zero elements
|
Vec.Writer |
open()
Create a writer for bulk serial writes into this Vec.
|
double[] |
pctiles()
A simple and cheap percentiles of the Vec, useful for getting a broad
overview of the data.
|
long |
pinfs()
Count of positive infinities
|
Futures |
postWrite(Futures fs)
Stop writing into this Vec.
|
void |
preWriting()
Begin writing into this Vec.
|
protected Keyed |
readAll_impl(AutoBuffer ab,
Futures fs) |
Futures |
remove_impl(Futures fs)
Remove associated Keys when this guy removes.
|
Key |
rollupStatsKey() |
void |
set(long i,
double d)
Write element the slow way, as a double.
|
void |
set(long i,
float f)
Write element the slow way, as a float.
|
void |
set(long i,
long l)
Write element the slow way, as a long.
|
void |
set(long i,
java.lang.String str)
Write element the slow way, as a String.
|
void |
setBad() |
void |
setDomain(java.lang.String[] domain)
Set the categorical/factor names.
|
double |
sigma()
Vecs's standard deviation
|
double |
sparseRatio() |
void |
startRollupStats(Futures fs) |
void |
startRollupStats(Futures fs,
boolean doHisto)
Check if we have local cached copy of basic Vec stats (including histogram if requested) and if not start task to compute and fetch them;
useful when launching a bunch of them in parallel to avoid single threaded execution later (e.g.
|
double |
stride()
The
stride for a a simple and cheap histogram of the Vec, useful
for getting a broad overview of the data. |
Vec |
toCategoricalVec()
Convenience method for converting to a categorical vector.
|
Vec |
toNumericVec()
Convenience method for converting to a numeric vector.
|
java.lang.String |
toString()
Pretty print the Vec:
[#elems, min/mean/max]{chunks,...} |
Vec |
toStringVec()
Convenience method for converting to a string vector.
|
protected AutoBuffer |
writeAll_impl(AutoBuffer ab)
Write out K/V pairs
|
checksum, makeSchema, readAll, remove, remove, remove, remove, writeAll
asBytes, clone, copyOver, frozenType, read, readExternal, readJSON, reloadFromBytes, toJsonString, write, writeExternal, writeJSON
public int _rowLayout
public static final byte T_BAD
public static final byte T_UUID
public static final byte T_STR
public static final byte T_NUM
public static final byte T_CAT
public static final byte T_TIME
public static final java.lang.String[] TYPE_STR
public static final boolean DO_HISTOGRAMS
public static final double[] PERCENTILES
public static final int KEY_PREFIX_LEN
public Vec(Key<Vec> key, int rowLayout)
espc
array).public java.lang.String[] domain()
public final java.lang.String factor(long i)
i
th factor for this categorical column.i
th factorpublic final void setDomain(java.lang.String[] domain)
public final int cardinality()
public final boolean isCategorical()
isInt()
, but not vice-versa.public final double sparseRatio()
public final boolean isUUID()
public final boolean isString()
public final boolean isNumeric()
public final boolean isTime()
isInt()
, but
not vice-versa.public long[] espc()
public long length()
long
instead of
an int
because Vecs support more than 2^32 elements. Overridden
by subclasses that compute length in an alternative way, such as
file-backed Vecs.public int nChunks()
int
- Chunk count is limited by
the max size of a Java long[]
. Overridden by subclasses that
compute chunks in an alternative way, such as file-backed Vecs.public void setBad()
public byte get_type()
public java.lang.String get_type_str()
public boolean isBinary()
public static Vec makeZero(long len, boolean redistribute)
public static Vec makeZero(long len)
public static Vec makeCon(double x, long len)
x
- The value with which to fill the Vec.len
- Number of rows.public static Vec makeCon(double x, long len, boolean redistribute)
public static Vec makeCon(double x, long len, int log_rows_per_chunk)
public static Vec makeCon(long totSize, long len)
public static Vec makeCon(double x, long len, int log_rows_per_chunk, boolean redistribute)
public Vec[] makeDoubles(int n, double[] values)
public Vec makeZero()
public Vec makeZero(java.lang.String[] domain)
public Vec makeCopy()
this
one.public Vec makeCopy(java.lang.String[] domain)
this
one.public Vec makeCopy(java.lang.String[] domain, byte type)
public Vec doCopy()
public static Vec makeCon(long l, java.lang.String[] domain, Vec.VectorGroup group, int rowLayout)
public static Vec[] makeCons(double x, long len, int n)
public Vec makeCon(double d)
public Vec[] makeZeros(int n)
public Vec[] makeZeros(int n, java.lang.String[][] domain, byte[] types)
public Vec[] makeCons(int n, long l, java.lang.String[][] domains, byte[] types)
public static Vec makeCon(Key<Vec> k, double... rows)
rows
- Datapublic static Vec makeSeq(long len, boolean redistribute)
public static Vec makeSeq(long min, long len)
public static Vec makeSeq(long min, long len, boolean redistribute)
public static Vec makeRepSeq(long len, long repeat)
repeat
.repeat
.public Vec makeRand(long seed)
public double min()
public double[] mins()
public double max()
public double[] maxs()
public final boolean isConst()
public final boolean isBad()
public double mean()
public double sigma()
public int mode()
public long naCnt()
public long nzCnt()
public long pinfs()
public long ninfs()
public boolean isInt()
isCategorical()
and isTime()
Vecs.public long byteSize()
public long[] bins()
base()
and stride()
. The
histogram is computed on first use and cached thereafter.public long[] lazy_bins()
public double base()
base
for a simple and cheap histogram of the Vec, useful
for getting a broad overview of the data. This returns the base of
bins()[0]
.bins()[0]
public double stride()
stride
for a a simple and cheap histogram of the Vec, useful
for getting a broad overview of the data. This returns the stride
between any two bins.public double[] pctiles()
PERCENTILES
.public void startRollupStats(Futures fs)
public void startRollupStats(Futures fs, boolean doHisto)
fs
- Futures allow to wait for this task to finish.doHisto
- Also compute histogram, requires second pass over data amd is not computed by default.protected long checksum_impl()
checksum_impl
in class Keyed<Vec>
public void preWriting()
public Futures postWrite(Futures fs)
public int elem2ChunkIdx(long i)
public static Key getVecKey(Key chk_key)
public Key chunkKey(int cidx)
public static Key chunkKey(Key veckey, int cidx)
public Key rollupStatsKey()
public Value chunkIdx(int cidx)
DKV.get()
. Warning: this pulls the data locally; using this call
on every Chunk index on the same node will probably trigger an OOM!public static Key<Vec> newKey()
public final Vec.VectorGroup group()
public Chunk chunkForChunkIdx(int cidx)
public final Chunk chunkForRow(long i)
public final long at8(long i)
i
th element as a long, or throw if missingpublic final double at(long i)
i
th element as a double, or Double.NaN if missingpublic final boolean isNA(long row)
public final long at16l(long i)
i
th element as a UUID low half, or throw if missingpublic final long at16h(long i)
i
th element as a UUID high half, or throw if missingpublic final BufferedString atStr(BufferedString bStr, long i)
BufferedString
or null if missing.
Throws if the value is not a String. BufferedStrings are String-like
objects than can be reused in-place, which is much more efficient than
constructing Strings.i
th element as BufferedString
or null if missing, or
throw if not a Stringpublic final void set(long i, long l)
public final void set(long i, double d)
public final void set(long i, float f)
public final void set(long i, java.lang.String str)
null
will be treated as a
set of a missing element.public final Vec.Writer open()
public java.lang.String toString()
[#elems, min/mean/max]{chunks,...}
toString
in class java.lang.Object
public Vec toCategoricalVec()
public Vec toStringVec()
public Vec toNumericVec()
public boolean equals(java.lang.Object o)
equals
in class java.lang.Object
public int hashCode()
hashCode
in class java.lang.Object
public Futures remove_impl(Futures fs)
remove_impl
in class Keyed<Vec>
protected AutoBuffer writeAll_impl(AutoBuffer ab)
writeAll_impl
in class Keyed<Vec>
protected Keyed readAll_impl(AutoBuffer ab, Futures fs)
readAll_impl
in class Keyed<Vec>
public Vec align(Vec vec)
vec
- vector which is intended to be copiedVec.VectorGroup
with this vectorpublic CategoricalWrappedVec adaptTo(java.lang.String[] domain)