Chunk (h2o-core version 3.10.0.3 API)

java.lang.Object
- water.Iced<Chunk>
- - water.fvec.Chunk

All Implemented Interfaces:

java.io.Externalizable, java.io.Serializable, java.lang.Cloneable, Freezable<Chunk>

Direct Known Subclasses:

CategoricalWrappedVec.CategoricalWrappedChunk, InteractionWrappedVec.InteractionWrappedChunk, NewChunk, SubsetChunk, TransformWrappedVec.TransformWrappedChunk
```
public abstract class Chunk
extends Iced<Chunk>
```
A compression scheme, over a chunk of data - a single array of bytes. Chunks are mapped many-to-1 to a Vec. The actual vector header info is in the Vec - which contains info to find all the bytes of the distributed vector. Subclasses of this abstract class implement (possibly empty) compression schemes.
Chunks are collections of elements, and support an array-like API. Chunks are subsets of a Vec; while the elements in a Vec are numbered starting at 0, any given Chunk has some (probably non-zero) starting row, and a length which is smaller than the whole Vec. Chunks are limited to a single Java byte array in a single JVM heap, and only an int's worth of elements. Chunks support both the notions of a global row-number and a chunk-local numbering. The global row-number calls are variants of at and set. If the row is outside the current Chunk's range, the data will be loaded by fetching from the correct Chunk. This probably involves some network traffic, and if all rows are loaded then the entire dataset will be pulled local (possibly triggering an OutOfMemory).
The chunk-local numbering supports the common for loop iterator pattern, using at and set calls that end in a '0', and is faster than the global row-numbering for tight loops (because it avoids some range checks):
```
  for( int row=0; row < chunk._len; row++ )
    ...chunk.atd(row)...
  
```
The array-like API allows loading and storing elements in and out of Chunks. When loading, values are decompressed. When storing, an attempt to compress back into the actual underlying Chunk subclass is made; if this fails the Chunk is "inflated" into a NewChunk, and the store completed there. Later the NewChunk will be compressed (probably into a different underlying Chunk subclass) and put back in the K/V store under the same Key - effectively replacing the original Chunk; this is done when close(int, water.Futures) is called, and is taken care of by the standard MRTask calls.
Chunk updates are not multi-thread safe; the caller must do correct synchronization. This is already handled by the Map/Reduce {MRTask) framework. Chunk updates are not visible cross-cluster until the close(int, water.Futures) is made; again this is handled by MRTask directly.
In addition to normal load and store operations, Chunks support the notion a missing element via the isNA_abs() calls, and a "next non-zero" notion for rapidly iterating over sparse data.
Data Types
Chunks hold Java primitive values, timestamps, UUIDs, or Strings. All the Chunks in a Vec hold the same type. Most of the types are compressed. Integer types (boolean, byte, short, int, long) are always lossless. Float and Double types might lose 1 or 2 ulps in the compression. Time data is held as milliseconds since the Unix Epoch. UUIDs are held as 128-bit integers (a pair of Java longs). Strings are compressed in various obvious ways. Sparse data is held... sparsely; e.g. loading data in SVMLight format will not "blow up" the in-memory representation. Categoricals/factors are held as small integers, with a shared String lookup table on the side.
Chunks support the notion of missing data. Missing float and double data is always treated as a NaN, both if read or written. There is no equivalent of NaN for integer data; reading a missing integer value is a coding error and will be flagged. If you are working with integer data with missing elements, you must first check for a missing value before loading it:
```
  if( !chk.isNA(row) ) ...chk.at8(row)....
  
```
The same holds true for the other non-real types (timestamps, UUIDs, Strings, or categoricals); they must be checked for missing before being used.
Performance Concerns
The standard for loop mentioned above is the fastest way to access data; definitely faster (and less error prone) than iterating over global row numbers. Iterating over a single Chunk is nearly always memory-bandwidth bound. Often code will iterate over a number of Chunks aligned together (the common use-case of looking a whole rows of a dataset). Again, typically such a code pattern is memory-bandwidth bound although the X86 will stop being able to prefetch well beyond 100 or 200 Chunks.
Note that Chunk alignment is guaranteed within all the Vecs of a Frame: Same numbered Chunks of different Vecs will have the same global row numbering and the same length, enabling a particularly simple and efficient way to iterate over all rows.
This example computes the Euclidean distance between all the columns and a given point, and stores the squared distance back in the last column. Note that due "NaN poisoning" if any row element is missing, the entire distance calculated will be NaN.
```
final double[] _point;                             // The given point
public void map( Chunk[] chks ) {                  // Map over a set of same-numbered Chunks
  for( int row=0; row < chks[0]._len; row++ ) {    // For all rows
    double dist=0;                                 // Squared distance
    for( int col=0; col < chks.length-1; col++ ) { // For all cols, except the last output col
      double d = chks[col].atd(row) - _point[col]; // Distance along this dimension
      dist += d*d;                                 // Sum-squared-distance
    }
    chks[chks.length-1].set( row, dist );          // Store back the distance in the last col
  }
}
```
See Also:
Serialized Form

Field Summary

Fields
Modifier and Type Field and Description

int _len
Number of rows in this Chunk; publically a read-only field.

Fields
Modifier and Type	Field and Description
`int`	`_len` Number of rows in this Chunk; publically a read-only field.

Constructor Summary

Constructors
Constructor and Description

Chunk()

Constructors
Constructor and Description
`Chunk()`

Method Summary

Methods
Modifier and Type	Method and Description
`byte[]`	`asBytes()` Return serialized version of self as a byte array.
`int`	`asSparseDoubles(double[] vals, int[] ids)` Sparse bulk interface, stream through the compressed values and extract them into dense double array.
`int`	`asSparseDoubles(double[] vals, int[] ids, double NA)`
`long`	`at16h(int i)` High half of a 128-bit UUID, or throws if the value is missing.
`long`	`at16l(int i)` Low half of a 128-bit UUID, or throws if the value is missing.
`long`	`at8(int i)` Load a `long` value using chunk-relative row numbers.
`double`	`atd(int i)` Load a `double` value using chunk-relative row numbers.
`BufferedString`	`atStr(BufferedString bStr, int i)` String value using chunk-relative row numbers, or null if missing.
`long`	`byteSize()` In memory size in bytes of the compressed Chunk plus embedded array.
`Chunk`	`chk2()` Exposed for internal testing only.
`int`	`cidx()`
`Futures`	`close(int cidx, Futures fs)` After writing we must call close() to register the bulk changes.
`void`	`crushBytes()` Used by a ParseExceptionTest to break the Chunk invariants and trigger an NPE.
`Chunk`	`deepCopy()`
`byte[]`	`getBytes()` Short-cut to the embedded big-data memory.
`double[]`	`getDoubles(double[] vals, int[] ids)` Dense bulk interface, fetch values from the given ids
`double[]`	`getDoubles(double[] vals, int from, int to)` Dense bulk interface, fetch values from the given range
`double[]`	`getDoubles(double[] vals, int from, int to, double NA)`
`int[]`	`getIntegers(int[] vals, int from, int to, int NA)`
`boolean`	`hasFloat()`
`boolean`	`hasNA()`
`abstract NewChunk`	`inflate_impl(NewChunk nc)` Chunk-specific bulk inflater back to NewChunk.
`NewChunk`	`inflate()`
`protected abstract void`	`initFromBytes()`
`boolean`	`isNA(int i)` Missing value status using chunk-relative row numbers.
`boolean`	`isSparseNA()` Sparse Chunks have a significant number of NAs, and support for skipping over large runs of NAs in a row.
`boolean`	`isSparseZero()` Sparse Chunks have a significant number of zeros, and support for skipping over large runs of zeros in a row.
`int`	`len()` Read-only length of chunk (number of rows).
`Chunk`	`nextChunk()` Return the next Chunk, or null if at end.
`int`	`nextNNA(int rid)`
`int`	`nextNZ(int rid)`
`int`	`nonnas(int[] res)` Get chunk-relative indices of values (nonnas for nasparse, all for dense) stored in this chunk.
`int`	`nonzeros(int[] res)` Get indeces of non-zero values stored in this chunk
`byte`	`precision()` Fixed-width format printing support.
`Chunk`	`read_impl(AutoBuffer ab)`
`Chunk`	`reloadFromBytes(byte[] ary)` Replace yourself with deserialized version from the given bytes.
`void`	`replaceAll(Chunk replacement)` Replace all rows with this new chunk
`void`	`reportBrokenCategorical(int i, int j, long l, int[] cmap, int levels)` Used by the parser to help report various internal bugs.
`void`	`set_abs(long i, java.lang.String str)` Set a `String`, using absolute row numbers.
`double[]`	`set(double[] d)`
`double`	`set(int idx, double d)` Write a `double` with check-relative indexing.
`float`	`set(int idx, float f)` Write a `float` with check-relative indexing.
`long`	`set(int idx, long l)` Write a `long` with check-relative indexing.
`java.lang.String`	`set(int idx, java.lang.String str)` Write a `String` with check-relative indexing.
`void`	`setBytes(byte[] mem)`
`boolean`	`setNA(int idx)` Set a value as missing.
`void`	`setStart(long start)` Set the start
`void`	`setVec(Vec vec)` Set the owning Vec
`int`	`sparseLenNA()` Sparse Chunks have a significant number of NAs, and support for skipping over large runs of NAs in a row.
`int`	`sparseLenZero()` Sparse Chunks have a significant number of zeros, and support for skipping over large runs of zeros in a row.
`long`	`start()` Global starting row for this local Chunk
`java.lang.String`	`toString()`
`Vec`	`vec()` Owning Vec
`AutoBuffer`	`write_impl(AutoBuffer bb)` Custom serializers implemented by Chunk subclasses: the _mem field contains ALL the fields already.

Methods inherited from class water.Iced
clone, copyOver, frozenType, read, readExternal, readJSON, toJsonString, write, writeExternal, writeJSON

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - _len
```
public transient int _len
```
    Number of rows in this Chunk; publically a read-only field. Odd API design choice: public, not-final, read-only, NO-ACCESSOR.
    NO-ACCESSOR: This is a high-performance field, and must have a known zero-cost cost-model; accessors hide that cost model, and make it not-obvious that a loop will be properly optimized or not.
    not-final: set in various deserializers.
    Proper usage: read the field, probably in a hot loop.
```
  for( int row=0; row < chunk._len; row++ )
    ...chunk.atd(row)...
  
```
- Constructor Detail
  - Chunk
```
public Chunk()
```
- Method Detail
  - asSparseDoubles
```
public int asSparseDoubles(double[] vals,
                  int[] ids)
```
    Sparse bulk interface, stream through the compressed values and extract them into dense double array.
    
    Parameters:
    vals - holds extracted values, length must be >= this.sparseLen()
    vals - holds extracted chunk-relative row ids, length must be >= this.sparseLen()
    
    Returns:
    number of extracted (non-zero) elements, equal to sparseLen()
  - asSparseDoubles
```
public int asSparseDoubles(double[] vals,
                  int[] ids,
                  double NA)
```
  - getDoubles
```
public double[] getDoubles(double[] vals,
                  int from,
                  int to)
```
    Dense bulk interface, fetch values from the given range
    
    Parameters:
    vals -
    from -
    to -
  - getDoubles
```
public double[] getDoubles(double[] vals,
                  int from,
                  int to,
                  double NA)
```
  - getIntegers
```
public int[] getIntegers(int[] vals,
                int from,
                int to,
                int NA)
```
  - getDoubles
```
public double[] getDoubles(double[] vals,
                  int[] ids)
```
    Dense bulk interface, fetch values from the given ids
    
    Parameters:
    vals -
    ids -
  - start
```
public final long start()
```
    Global starting row for this local Chunk
  - len
```
public int len()
```
    Read-only length of chunk (number of rows).
  - chk2
```
public Chunk chk2()
```
    Exposed for internal testing only. Not a publically visible API.
  - vec
```
public Vec vec()
```
    Owning Vec
  - setVec
```
public void setVec(Vec vec)
```
    Set the owning Vec
  - setStart
```
public void setStart(long start)
```
    Set the start
  - getBytes
```
public byte[] getBytes()
```
    Short-cut to the embedded big-data memory. Generally not useful for public consumption, since the data remains compressed and holding on to a pointer to this array defeats the user-mode spill-to-disk.
  - setBytes
```
public void setBytes(byte[] mem)
```
  - crushBytes
```
public final void crushBytes()
```
    Used by a ParseExceptionTest to break the Chunk invariants and trigger an NPE. Not intended for public use.
  - atd
```
public final double atd(int i)
```
    Load a double value using chunk-relative row numbers. Returns Double.NaN if value is missing.
    
    Returns:
    double value at the given row, or NaN if the value is missing
  - at8
```
public final long at8(int i)
```
    Load a long value using chunk-relative row numbers. Floating point values are silently rounded to a long. Throws if the value is missing.
    
    Returns:
    long value at the given row, or throw if the value is missing
  - isNA
```
public final boolean isNA(int i)
```
    Missing value status using chunk-relative row numbers.
    
    Returns:
    true if the value is missing
  - at16l
```
public final long at16l(int i)
```
    Low half of a 128-bit UUID, or throws if the value is missing.
    
    Returns:
    Low half of a 128-bit UUID, or throws if the value is missing.
  - at16h
```
public final long at16h(int i)
```
    High half of a 128-bit UUID, or throws if the value is missing.
    
    Returns:
    High half of a 128-bit UUID, or throws if the value is missing.
  - atStr
```
public final BufferedString atStr(BufferedString bStr,
                   int i)
```
    String value using chunk-relative row numbers, or null if missing.
    
    Returns:
    String value or null if missing.
  - set_abs
```
public final void set_abs(long i,
           java.lang.String str)
```
    Set a String, using absolute row numbers.
    As with all the set calls, if the value written does not fit in the current compression scheme, the Chunk will be inflated into a NewChunk and the value written there. Later, the NewChunk will be compressed (after a close(int, water.Futures) call) and written back to the DKV. i.e., there is some interesting cost if Chunk compression-types need to change.
    This version uses absolute element numbers, but must convert them to chunk-relative indices - requiring a load from an aliasing local var, leading to lower quality JIT'd code (similar issue to using iterator objects).
  - hasFloat
```
public boolean hasFloat()
```
  - hasNA
```
public boolean hasNA()
```
  - replaceAll
```
public void replaceAll(Chunk replacement)
```
    Replace all rows with this new chunk
  - deepCopy
```
public Chunk deepCopy()
```
  - set
```
public final long set(int idx,
       long l)
```
    Write a long with check-relative indexing. There is no way to write a missing value with this call. Under rare circumstances this can throw: if the long does not fit in a double (value is larger magnitude than 2^52), AND float values are stored in Vector. In this case, there is no common compatible data representation.
    As with all the set calls, if the value written does not fit in the current compression scheme, the Chunk will be inflated into a NewChunk and the value written there. Later, the NewChunk will be compressed (after a close(int, water.Futures) call) and written back to the DKV. i.e., there is some interesting cost if Chunk compression-types need to change.
    
    Returns:
    the set value
  - set
```
public final double[] set(double[] d)
```
  - set
```
public final double set(int idx,
         double d)
```
    Write a double with check-relative indexing. NaN will be treated as a missing value.
    As with all the set calls, if the value written does not fit in the current compression scheme, the Chunk will be inflated into a NewChunk and the value written there. Later, the NewChunk will be compressed (after a close(int, water.Futures) call) and written back to the DKV. i.e., there is some interesting cost if Chunk compression-types need to change.
    
    Returns:
    the set value
  - set
```
public final float set(int idx,
        float f)
```
    Write a float with check-relative indexing. NaN will be treated as a missing value.
    As with all the set calls, if the value written does not fit in the current compression scheme, the Chunk will be inflated into a NewChunk and the value written there. Later, the NewChunk will be compressed (after a close(int, water.Futures) call) and written back to the DKV. i.e., there is some interesting cost if Chunk compression-types need to change.
    
    Returns:
    the set value
  - setNA
```
public final boolean setNA(int idx)
```
    Set a value as missing.
    As with all the set calls, if the value written does not fit in the current compression scheme, the Chunk will be inflated into a NewChunk and the value written there. Later, the NewChunk will be compressed (after a close(int, water.Futures) call) and written back to the DKV. i.e., there is some interesting cost if Chunk compression-types need to change.
    
    Returns:
    the set value
  - set
```
public final java.lang.String set(int idx,
                   java.lang.String str)
```
    Write a String with check-relative indexing. null will be treated as a missing value.
    As with all the set calls, if the value written does not fit in the current compression scheme, the Chunk will be inflated into a NewChunk and the value written there. Later, the NewChunk will be compressed (after a close(int, water.Futures) call) and written back to the DKV. i.e., there is some interesting cost if Chunk compression-types need to change.
    
    Returns:
    the set value
  - close
```
public Futures close(int cidx,
            Futures fs)
```
    After writing we must call close() to register the bulk changes. If a NewChunk was needed, it will be compressed into some other kind of Chunk. The resulting Chunk (either a modified self, or a compressed NewChunk) will be written to the DKV. Only after that DKV.put completes will all readers of this Chunk witness the changes.
    
    Returns:
    the passed-in Futures, for flow-coding.
  - cidx
```
public int cidx()
```
    Returns:
    Chunk index
  - isSparseZero
```
public boolean isSparseZero()
```
    Sparse Chunks have a significant number of zeros, and support for skipping over large runs of zeros in a row.
    
    Returns:
    true if this Chunk is sparse.
  - sparseLenZero
```
public int sparseLenZero()
```
    Sparse Chunks have a significant number of zeros, and support for skipping over large runs of zeros in a row.
    
    Returns:
    At least as large as the count of non-zeros, but may be significantly smaller than the _len
  - nextNZ
```
public int nextNZ(int rid)
```
  - nonzeros
```
public int nonzeros(int[] res)
```
    Get indeces of non-zero values stored in this chunk
    
    Returns:
    array of chunk-relative indices of values stored in this chunk.
  - isSparseNA
```
public boolean isSparseNA()
```
    Sparse Chunks have a significant number of NAs, and support for skipping over large runs of NAs in a row.
    
    Returns:
    true if this Chunk is sparseNA.
  - sparseLenNA
```
public int sparseLenNA()
```
    Sparse Chunks have a significant number of NAs, and support for skipping over large runs of NAs in a row.
    
    Returns:
    At least as large as the count of non-NAs, but may be significantly smaller than the _len
  - nextNNA
```
public int nextNNA(int rid)
```
  - nonnas
```
public int nonnas(int[] res)
```
    Get chunk-relative indices of values (nonnas for nasparse, all for dense) stored in this chunk. For dense chunks, this will contain indices of all the rows in this chunk.
    
    Returns:
    array of chunk-relative indices of values stored in this chunk.
  - inflate
```
public NewChunk inflate()
```
  - inflate_impl
```
public abstract NewChunk inflate_impl(NewChunk nc)
```
    Chunk-specific bulk inflater back to NewChunk. Used when writing into a chunk and written value is out-of-range for an update-in-place operation. Bulk copy from the compressed form into the nc._ls8 array.
  - nextChunk
```
public Chunk nextChunk()
```
    Return the next Chunk, or null if at end. Mostly useful for parsers or optimized stencil calculations that want to "roll off the end" of a Chunk, but in a highly optimized way.
  - toString
```
public java.lang.String toString()
```
    Overrides:
    
    toString in class java.lang.Object
    
    Returns:
    String version of a Chunk, currently just the class name
  - byteSize
```
public long byteSize()
```
    In memory size in bytes of the compressed Chunk plus embedded array.
  - write_impl
```
public final AutoBuffer write_impl(AutoBuffer bb)
```
    Custom serializers implemented by Chunk subclasses: the _mem field contains ALL the fields already.
  - asBytes
```
public final byte[] asBytes()
```
    Description copied from interface: Freezable
    
    Return serialized version of self as a byte array. Useful for Freezables directly supported by byte array (@see Chunk) In most cases, just use the Autobuffer version.
    
    Specified by:
    
    asBytes in interface Freezable<Chunk>
    
    Overrides:
    
    asBytes in class Iced<Chunk>
    
    Returns:
    serialized bytes
  - reloadFromBytes
```
public final Chunk reloadFromBytes(byte[] ary)
```
    Description copied from interface: Freezable
    
    Replace yourself with deserialized version from the given bytes. Useful for Freezables directly supported by byte array (@see Chunk). In most cases, just use the Autobuffer version.
    
    Specified by:
    
    reloadFromBytes in interface Freezable<Chunk>
    
    Overrides:
    
    reloadFromBytes in class Iced<Chunk>
    
    Parameters:
    ary - byte array containing exactly (i.e. nothing else) the serialized version of the Freezable
    
    Returns:
    this freshly reloaded from the given bytes.
  - initFromBytes
```
protected abstract void initFromBytes()
```
  - read_impl
```
public final Chunk read_impl(AutoBuffer ab)
```
  - precision
```
public byte precision()
```
    Fixed-width format printing support. Filled in by the subclasses.
  - reportBrokenCategorical
```
public final void reportBrokenCategorical(int i,
                           int j,
                           long l,
                           int[] cmap,
                           int levels)
```
    Used by the parser to help report various internal bugs. Not intended for public use.

Class Chunk

Field Summary

Constructor Summary

Method Summary

Methods inherited from class water.Iced

Methods inherited from class java.lang.Object

Field Detail

_len

Constructor Detail

Chunk

Method Detail

asSparseDoubles

asSparseDoubles

getDoubles

getDoubles

getIntegers

getDoubles

start

len

chk2

vec

setVec

setStart

getBytes

setBytes

crushBytes

atd

at8

isNA

at16l

at16h

atStr

set_abs

hasFloat

hasNA

replaceAll

deepCopy

set

set

set

set

setNA

set

close

cidx

isSparseZero

sparseLenZero

nextNZ

nonzeros

isSparseNA

sparseLenNA

nextNNA

nonnas

inflate

inflate_impl

nextChunk

toString

byteSize

write_impl

asBytes

reloadFromBytes

initFromBytes

read_impl

precision

reportBrokenCategorical