public abstract class Chunk extends Iced<Chunk>
Vec
. The actual vector
header info is in the Vec - which contains info to find all the bytes of
the distributed vector. Subclasses of this abstract class implement
(possibly empty) compression schemes.
Chunks are collections of elements, and support an array-like API.
Chunks are subsets of a Vec; while the elements in a Vec are numbered
starting at 0, any given Chunk has some (probably non-zero) starting row,
and a length which is smaller than the whole Vec. Chunks are limited to a
single Java byte array in a single JVM heap, and only an int's worth of
elements. Chunks support both the notions of a global row-number and a
chunk-local numbering. The global row-number calls are variants of at
and set
. If the row is outside the current Chunk's range, the
data will be loaded by fetching from the correct Chunk. This probably
involves some network traffic, and if all rows are loaded then the entire
dataset will be pulled local (possibly triggering an OutOfMemory).
The chunk-local numbering supports the common for
loop iterator
pattern, using at
and set
calls that end in a '0
',
and is faster than the global row-numbering for tight loops (because it
avoids some range checks):
for( int row=0; row < chunk._len; row++ ) ...chunk.atd(row)...
The array-like API allows loading and storing elements in and out of
Chunks. When loading, values are decompressed. When storing, an attempt
to compress back into the actual underlying Chunk subclass is made; if this
fails the Chunk is "inflated" into a NewChunk
, and the store
completed there. Later the NewChunk will be compressed (probably into a
different underlying Chunk subclass) and put back in the K/V store under
the same Key - effectively replacing the original Chunk; this is done when
close(int, water.Futures)
is called, and is taken care of by the standard MRTask
calls.
Chunk updates are not multi-thread safe; the caller must do correct
synchronization. This is already handled by the Map/Reduce {MRTask)
framework. Chunk updates are not visible cross-cluster until the close(int, water.Futures)
is made; again this is handled by MRTask directly.
In addition to normal load and store operations, Chunks support the
notion a missing element via the isNA_abs()
calls, and a "next
non-zero" notion for rapidly iterating over sparse data.
Data Types
Chunks hold Java primitive values, timestamps, UUIDs, or Strings. All the Chunks in a Vec hold the same type. Most of the types are compressed. Integer types (boolean, byte, short, int, long) are always lossless. Float and Double types might lose 1 or 2 ulps in the compression. Time data is held as milliseconds since the Unix Epoch. UUIDs are held as 128-bit integers (a pair of Java longs). Strings are compressed in various obvious ways. Sparse data is held... sparsely; e.g. loading data in SVMLight format will not "blow up" the in-memory representation. Categoricals/factors are held as small integers, with a shared String lookup table on the side.
Chunks support the notion of missing data. Missing float and double data is always treated as a NaN, both if read or written. There is no equivalent of NaN for integer data; reading a missing integer value is a coding error and will be flagged. If you are working with integer data with missing elements, you must first check for a missing value before loading it:
if( !chk.isNA(row) ) ...chk.at8(row)....
The same holds true for the other non-real types (timestamps, UUIDs, Strings, or categoricals); they must be checked for missing before being used.
Performance Concerns
The standard for
loop mentioned above is the fastest way to
access data; definitely faster (and less error prone) than iterating over
global row numbers. Iterating over a single Chunk is nearly always
memory-bandwidth bound. Often code will iterate over a number of Chunks
aligned together (the common use-case of looking a whole rows of a
dataset). Again, typically such a code pattern is memory-bandwidth bound
although the X86 will stop being able to prefetch well beyond 100 or 200
Chunks.
Note that Chunk alignment is guaranteed within all the Vecs of a Frame: Same numbered Chunks of different Vecs will have the same global row numbering and the same length, enabling a particularly simple and efficient way to iterate over all rows.
This example computes the Euclidean distance between all the columns and a given point, and stores the squared distance back in the last column. Note that due "NaN poisoning" if any row element is missing, the entire distance calculated will be NaN.
final double[] _point; // The given point
public void map( Chunk[] chks ) { // Map over a set of same-numbered Chunks
for( int row=0; row < chks[0]._len; row++ ) { // For all rows
double dist=0; // Squared distance
for( int col=0; col < chks.length-1; col++ ) { // For all cols, except the last output col
double d = chks[col].atd(row) - _point[col]; // Distance along this dimension
dist += d*d; // Sum-squared-distance
}
chks[chks.length-1].set( row, dist ); // Store back the distance in the last col
}
}
Modifier and Type | Field and Description |
---|---|
int |
_len
Number of rows in this Chunk; publically a read-only field.
|
Constructor and Description |
---|
Chunk() |
Modifier and Type | Method and Description |
---|---|
byte[] |
asBytes()
Return serialized version of self as a byte array.
|
int |
asSparseDoubles(double[] vals,
int[] ids)
Sparse bulk interface, stream through the compressed values and extract them into dense double array.
|
int |
asSparseDoubles(double[] vals,
int[] ids,
double NA) |
long |
at16h(int i)
High half of a 128-bit UUID, or throws if the value is missing.
|
long |
at16l(int i)
Low half of a 128-bit UUID, or throws if the value is missing.
|
long |
at8(int i)
Load a
long value using chunk-relative row numbers. |
double |
atd(int i)
Load a
double value using chunk-relative row numbers. |
BufferedString |
atStr(BufferedString bStr,
int i)
String value using chunk-relative row numbers, or null if missing.
|
long |
byteSize()
In memory size in bytes of the compressed Chunk plus embedded array.
|
Chunk |
chk2()
Exposed for internal testing only.
|
int |
cidx() |
Futures |
close(int cidx,
Futures fs)
After writing we must call close() to register the bulk changes.
|
void |
crushBytes()
Used by a ParseExceptionTest to break the Chunk invariants and trigger an
NPE.
|
Chunk |
deepCopy() |
byte[] |
getBytes()
Short-cut to the embedded big-data memory.
|
double[] |
getDoubles(double[] vals,
int[] ids)
Dense bulk interface, fetch values from the given ids
|
double[] |
getDoubles(double[] vals,
int from,
int to)
Dense bulk interface, fetch values from the given range
|
double[] |
getDoubles(double[] vals,
int from,
int to,
double NA) |
int[] |
getIntegers(int[] vals,
int from,
int to,
int NA) |
boolean |
hasFloat() |
boolean |
hasNA() |
abstract NewChunk |
inflate_impl(NewChunk nc)
Chunk-specific bulk inflater back to NewChunk.
|
NewChunk |
inflate() |
protected abstract void |
initFromBytes() |
boolean |
isNA(int i)
Missing value status using chunk-relative row numbers.
|
boolean |
isSparseNA()
Sparse Chunks have a significant number of NAs, and support for
skipping over large runs of NAs in a row.
|
boolean |
isSparseZero()
Sparse Chunks have a significant number of zeros, and support for
skipping over large runs of zeros in a row.
|
int |
len()
Read-only length of chunk (number of rows).
|
Chunk |
nextChunk()
Return the next Chunk, or null if at end.
|
int |
nextNNA(int rid) |
int |
nextNZ(int rid) |
int |
nonnas(int[] res)
Get chunk-relative indices of values (nonnas for nasparse, all for dense)
stored in this chunk.
|
int |
nonzeros(int[] res)
Get indeces of non-zero values stored in this chunk
|
byte |
precision()
Fixed-width format printing support.
|
Chunk |
read_impl(AutoBuffer ab) |
Chunk |
reloadFromBytes(byte[] ary)
Replace yourself with deserialized version from the given bytes.
|
void |
replaceAll(Chunk replacement)
Replace all rows with this new chunk
|
void |
reportBrokenCategorical(int i,
int j,
long l,
int[] cmap,
int levels)
Used by the parser to help report various internal bugs.
|
void |
set_abs(long i,
java.lang.String str)
Set a
String , using absolute row numbers. |
double[] |
set(double[] d) |
double |
set(int idx,
double d)
Write a
double with check-relative indexing. |
float |
set(int idx,
float f)
Write a
float with check-relative indexing. |
long |
set(int idx,
long l)
Write a
long with check-relative indexing. |
java.lang.String |
set(int idx,
java.lang.String str)
Write a
String with check-relative indexing. |
void |
setBytes(byte[] mem) |
boolean |
setNA(int idx)
Set a value as missing.
|
void |
setStart(long start)
Set the start
|
void |
setVec(Vec vec)
Set the owning Vec
|
int |
sparseLenNA()
Sparse Chunks have a significant number of NAs, and support for
skipping over large runs of NAs in a row.
|
int |
sparseLenZero()
Sparse Chunks have a significant number of zeros, and support for
skipping over large runs of zeros in a row.
|
long |
start()
Global starting row for this local Chunk
|
java.lang.String |
toString() |
Vec |
vec()
Owning Vec
|
AutoBuffer |
write_impl(AutoBuffer bb)
Custom serializers implemented by Chunk subclasses: the _mem field
contains ALL the fields already.
|
clone, copyOver, frozenType, read, readExternal, readJSON, toJsonString, write, writeExternal, writeJSON
public transient int _len
NO-ACCESSOR: This is a high-performance field, and must have a known zero-cost cost-model; accessors hide that cost model, and make it not-obvious that a loop will be properly optimized or not.
not-final: set in various deserializers.
Proper usage: read the field, probably in a hot loop.
for( int row=0; row < chunk._len; row++ ) ...chunk.atd(row)...
public int asSparseDoubles(double[] vals, int[] ids)
vals
- holds extracted values, length must be >= this.sparseLen()vals
- holds extracted chunk-relative row ids, length must be >= this.sparseLen()public int asSparseDoubles(double[] vals, int[] ids, double NA)
public double[] getDoubles(double[] vals, int from, int to)
vals
- from
- to
- public double[] getDoubles(double[] vals, int from, int to, double NA)
public int[] getIntegers(int[] vals, int from, int to, int NA)
public double[] getDoubles(double[] vals, int[] ids)
vals
- ids
- public final long start()
public int len()
public Chunk chk2()
public Vec vec()
public void setVec(Vec vec)
public void setStart(long start)
public byte[] getBytes()
public void setBytes(byte[] mem)
public final void crushBytes()
public final double atd(int i)
double
value using chunk-relative row numbers. Returns Double.NaN
if value is missing.public final long at8(int i)
long
value using chunk-relative row numbers. Floating
point values are silently rounded to a long. Throws if the value is
missing.public final boolean isNA(int i)
public final long at16l(int i)
public final long at16h(int i)
public final BufferedString atStr(BufferedString bStr, int i)
public final void set_abs(long i, java.lang.String str)
String
, using absolute row numbers.
As with all the set
calls, if the value written does not fit
in the current compression scheme, the Chunk will be inflated into a
NewChunk and the value written there. Later, the NewChunk will be
compressed (after a close(int, water.Futures)
call) and written back to the DKV.
i.e., there is some interesting cost if Chunk compression-types need to
change.
This version uses absolute element numbers, but must convert them to chunk-relative indices - requiring a load from an aliasing local var, leading to lower quality JIT'd code (similar issue to using iterator objects).
public boolean hasFloat()
public boolean hasNA()
public void replaceAll(Chunk replacement)
public Chunk deepCopy()
public final long set(int idx, long l)
long
with check-relative indexing. There is no way to
write a missing value with this call. Under rare circumstances this can
throw: if the long does not fit in a double (value is larger magnitude
than 2^52), AND float values are stored in Vector. In this case, there
is no common compatible data representation.
As with all the set
calls, if the value written does not fit
in the current compression scheme, the Chunk will be inflated into a
NewChunk and the value written there. Later, the NewChunk will be
compressed (after a close(int, water.Futures)
call) and written back to the DKV.
i.e., there is some interesting cost if Chunk compression-types need to
change.
public final double[] set(double[] d)
public final double set(int idx, double d)
double
with check-relative indexing. NaN will be treated
as a missing value.
As with all the set
calls, if the value written does not fit
in the current compression scheme, the Chunk will be inflated into a
NewChunk and the value written there. Later, the NewChunk will be
compressed (after a close(int, water.Futures)
call) and written back to the DKV.
i.e., there is some interesting cost if Chunk compression-types need to
change.
public final float set(int idx, float f)
float
with check-relative indexing. NaN will be treated
as a missing value.
As with all the set
calls, if the value written does not fit
in the current compression scheme, the Chunk will be inflated into a
NewChunk and the value written there. Later, the NewChunk will be
compressed (after a close(int, water.Futures)
call) and written back to the DKV.
i.e., there is some interesting cost if Chunk compression-types need to
change.
public final boolean setNA(int idx)
As with all the set
calls, if the value written does not fit
in the current compression scheme, the Chunk will be inflated into a
NewChunk and the value written there. Later, the NewChunk will be
compressed (after a close(int, water.Futures)
call) and written back to the DKV.
i.e., there is some interesting cost if Chunk compression-types need to
change.
public final java.lang.String set(int idx, java.lang.String str)
String
with check-relative indexing. null
will
be treated as a missing value.
As with all the set
calls, if the value written does not fit
in the current compression scheme, the Chunk will be inflated into a
NewChunk and the value written there. Later, the NewChunk will be
compressed (after a close(int, water.Futures)
call) and written back to the DKV.
i.e., there is some interesting cost if Chunk compression-types need to
change.
public Futures close(int cidx, Futures fs)
DKV.put
completes
will all readers of this Chunk witness the changes.Futures
, for flow-coding.public int cidx()
public boolean isSparseZero()
public int sparseLenZero()
_len
public int nextNZ(int rid)
public int nonzeros(int[] res)
public boolean isSparseNA()
public int sparseLenNA()
_len
public int nextNNA(int rid)
public int nonnas(int[] res)
public NewChunk inflate()
public abstract NewChunk inflate_impl(NewChunk nc)
public Chunk nextChunk()
public java.lang.String toString()
toString
in class java.lang.Object
public long byteSize()
public final AutoBuffer write_impl(AutoBuffer bb)
public final byte[] asBytes()
Freezable
public final Chunk reloadFromBytes(byte[] ary)
Freezable
reloadFromBytes
in interface Freezable<Chunk>
reloadFromBytes
in class Iced<Chunk>
ary
- byte array containing exactly (i.e. nothing else) the serialized version of the Freezableprotected abstract void initFromBytes()
public final Chunk read_impl(AutoBuffer ab)
public byte precision()
public final void reportBrokenCategorical(int i, int j, long l, int[] cmap, int levels)