Data¶
How should I format my SVMLight data before importing?¶
The data must be formatted as a sorted list of unique integers, the column indices must be >= 1, and the columns must be in ascending order.
Does H2O provide native sparse support?¶
Sparse data is supported natively by loading a sparse matrix from an SVMLight file. In addition, H2O includes a direct conversion of a sparse matrix to an H2O Frame in Python via the h2o.H2OFrame()
method and in R via the as.h2o()
function. For sparse data, H2O writes a sparse matrix to SVMLight format and then loads it back in using h2o.import_file
in Python or h2o.importFile
with parse_type=SVMLight
in R.
In R, a sparse matrix is specified using Matrix::sparseMatrix
along with the boolean flag, sparse=TRUE
. For example:
data <- rep(0, 100) data[(1:10)^2] <- 1:10 * pi m <- matrix(data, ncol = 20, byrow = TRUE) m <- Matrix::Matrix(m, sparse = TRUE) h2o.matrix <- as.h2o(m, "sparse_matrix")
In Python, a sparse matrix is specified using scipy.parse
. For example:
import scipy.sparse as sp A = sp.csr_matrix([[1, 2, 0, 5.5], [0, 0, 3, 6.7], [4, 0, 5, 0]]) fr = h2o.H2OFrame(A) A = sp.lil_matrix((1000, 1000)) A.setdiag(10) for i in range(999): A[i, i + 1] = -3 A[i + 1, i] = -2 fr = h2o.H2OFrame(A)
What date and time formats does H2O support?¶
H2O is set to auto-detect two major date/time formats. Because many date time formats are ambiguous (e.g. 01/02/03), general date time detection is not used.
The first format is for dates formatted as yyyy-MM-dd. Year is a four-digit number, the month is a two-digit number ranging from 1 to 12, and the day is a two-digit value ranging from 1 to 31. This format can also be followed by a space and then a time (specified below).
The second date format is for dates formatted as dd-MMM-yy. Here the day must be one or two digits with a value ranging from 1 to 31. The month must be either a three-letter abbreviation or the full month name but is not case sensitive. The year must be either two or four digits. In agreement with POSIX standards, two-digit dates >= 69 are assumed to be in the 20th century (e.g. 1969) and the rest are part of the 21st century. This date format can be followed by either a space or colon character and then a time. The ‘-‘ between the values is optional.
Times are specified as HH:mm:ss. HH is a two-digit hour and must be a value between 0-23 (for 24-hour time) or 1-12 (for a twelve-hour clock). mm is a two-digit minute value and must be a value between 0-59. ss is a two-digit second value and must be a value between 0-59. This format can be followed with either milliseconds, nanoseconds, and/or the cycle (i.e. AM/PM). If milliseconds are included, the format is HH:mm:ss:SSS. If nanoseconds are included, the format is HH:mm:ss:SSSnnnnnn. H2O only stores fractions of a second up to the millisecond, so accuracy may be lost. Nanosecond parsing is only included for convenience. Finally, a valid time can end with a space character and then either “AM” or “PM”. For this format, the hours must range from 1 to 12. Within the time, the ‘:’ character can be replaced with a ‘.’ character.
How does H2O handle name collisions/conflicts in the dataset?¶
If there is a name conflict (for example, column 48 isn’t named, but C48 already exists), then the column name in concatenated to itself until a unique name is created. So for the previously cited example, H2O will try renaming the column to C48C48, then C48C48C48, and so on until an unused name is generated.
What types of data columns does H2O support?¶
Currently, H2O supports:
- float (any IEEE double)
- integer (up to 64bit, but compressed according to actual range)
- factor (same as integer, but with a String mapping, often handled differently in the algorithms)
- time (same as 64bit integer, but with a time-since-Unix-epoch interpretation)
- UUID (128bit integer, no math allowed)
- String
Do binary variables count as numeric?¶
When H2O parses a file, it uses rules to determine whether a column is numeric, enum, etc. If you need a column to be read as a specific type, especially the columns that can be interpreted as enum or numeric, you can use the col_types
(Python)/col.types
(R) when you import a file.
If you decide to make a column binary, that column will be treated as binary. It may be represented as a numeric, but it will not have numerical meanings such as less than, greater than or equal to, etc.
I am trying to parse a Gzip data file containing multiple files, but it does not parse as quickly as the uncompressed files. Why is this?¶
Parsing Gzip files is not done in parallel, so it is sequential and uses only one core. Other parallel parse compression schemes are on the roadmap.