public abstract class FileVec extends ByteVec
Vec.ESPC, Vec.Reader, Vec.VectorGroup, Vec.Writer
Modifier and Type | Field and Description |
---|---|
int |
_chunkSize |
static int |
DFLT_CHUNK_SIZE
Default Chunk size in bytes, useful when breaking up large arrays into
"bite-sized" chunks.
|
static int |
DFLT_LOG2_CHUNK_SIZE
Log-2 of Chunk size.
|
_rowLayout, DO_HISTOGRAMS, KEY_PREFIX_LEN, PERCENTILES, T_BAD, T_CAT, T_NUM, T_STR, T_TIME, T_UUID, TYPE_STR
Modifier | Constructor and Description |
---|---|
protected |
FileVec(Key key,
long len,
byte be) |
Modifier and Type | Method and Description |
---|---|
long |
byteSize()
Size of vector data.
|
static int |
calcOptimalChunkSize(long totalSize,
int numCols,
long maxLineLength,
int cores,
int cloudsize,
boolean oldHeuristic,
boolean verbose)
Calculates safe and hopefully optimal chunk sizes.
|
Value |
chunkIdx(int cidx)
Get a Chunk's Value by index.
|
static long |
chunkOffset(Key ckey)
Convert a chunk-key to a file offset.
|
int |
elem2ChunkIdx(long i)
Convert a row# to a chunk#.
|
long |
length()
Number of elements in the vector; returned as a
long instead of
an int because Vecs support more than 2^32 elements. |
int |
nChunks()
Number of chunks, returned as an
int - Chunk count is limited by
the max size of a Java long[] . |
int |
setChunkSize(Frame fr,
int chunkSize) |
int |
setChunkSize(int chunkSize)
Chunk size must be positive, 1G or less, and a power of two.
|
boolean |
writable()
Default read/write behavior for Vecs.
|
chunkForChunkIdx, getFirstBytes, getPreviewChunkBytes, isInt, naCnt, openStream
adaptTo, align, at, at16h, at16l, at8, atStr, base, bins, cardinality, checksum_impl, chunkForRow, chunkKey, chunkKey, copyMeta, doCopy, domain, equals, espc, factor, get_type_str, get_type, getVecKey, group, hashCode, isBad, isBinary, isCategorical, isConst, isNA, isNumeric, isString, isTime, isUUID, lazy_bins, makeCon, makeCon, makeCon, makeCon, makeCon, makeCon, makeCon, makeCon, makeCons, makeCons, makeCopy, makeCopy, makeCopy, makeDoubles, makeRand, makeRepSeq, makeSeq, makeSeq, makeSeq, makeVec, makeVec, makeVec, makeZero, makeZero, makeZero, makeZero, makeZeros, makeZeros, max, maxs, mean, min, mins, mode, newKey, ninfs, nzCnt, open, pctiles, pinfs, postWrite, preWriting, readAll_impl, remove_impl, rollupStatsKey, set, set, set, set, setBad, setDomain, sigma, sparseRatio, startRollupStats, startRollupStats, stride, toCategoricalVec, toNumericVec, toString, toStringVec, writeAll_impl
checksum, makeSchema, readAll, remove, remove, remove, remove, writeAll
asBytes, clone, copyOver, frozenType, read, readExternal, readJSON, reloadFromBytes, toJsonString, write, writeExternal, writeJSON
public static final int DFLT_LOG2_CHUNK_SIZE
public static final int DFLT_CHUNK_SIZE
public int _chunkSize
protected FileVec(Key key, long len, byte be)
public int setChunkSize(int chunkSize)
Since, optimal chunk size is not known during FileVec instantiation, setter is required to both set it, and keep it in sync with _log2ChkSize.
chunkSize
- requested chunk size to be used when parsingpublic int setChunkSize(Frame fr, int chunkSize)
public long length()
Vec
long
instead of
an int
because Vecs support more than 2^32 elements. Overridden
by subclasses that compute length in an alternative way, such as
file-backed Vecs.public int nChunks()
Vec
int
- Chunk count is limited by
the max size of a Java long[]
. Overridden by subclasses that
compute chunks in an alternative way, such as file-backed Vecs.public boolean writable()
Vec
public int elem2ChunkIdx(long i)
Vec
elem2ChunkIdx
in class Vec
public static long chunkOffset(Key ckey)
public Value chunkIdx(int cidx)
Vec
DKV.get()
. Warning: this pulls the data locally; using this call
on every Chunk index on the same node will probably trigger an OOM!public static int calcOptimalChunkSize(long totalSize, int numCols, long maxLineLength, int cores, int cloudsize, boolean oldHeuristic, boolean verbose)
very small data < 64K per core - uses default chunk size and all data will be in one chunk
small data - data is partitioned into chunks that at least 4 chunks per core to help keep all cores loaded
default - chunks are 4194304
large data - if the data would create more than 2M keys per node, then chunk sizes larger than DFLT_CHUNK_SIZE are issued.
Too many keys can create enough overhead to blow out memory in large data parsing. # keys = (parseSize / chunkSize) * numCols. Key limit of 2M is a guessed "reasonable" number.
totalSize
- - parse size in bytes (across all files to be parsed)numCols
- - number of columns expected in datasetcores
- - number of processing cores per nodecloudsize
- - number of compute nodesverbose
- - print the parse heuristics