Spark - H2O Frame Mapping

Type Mapping between H2O H2OFrame Types and Spark DataFrame Types

For all primitive Scala types or Spark SQL (see org.apache.spark.sql.types) types that can be part of Spark RDD/DataFrame, we provide the mapping into H2O vector types (numeric, categorical, string, time, UUID - see water.fvec.Vec):

Scala type

SQL type

H2O type

NA

BinaryType

Numeric

Byte

ByteType

Numeric

Short

ShortType

Numeric

Integer

IntegerType

Numeric

Long

LongType

Numeric

Float

FloatType

Numeric

Double

DoubleType

Numeric

String

StringType

String/Categorical 1

Boolean

BooleanType

Categorical 2

java.sql.Timestamp

TimestampType

Time


Type Mapping Between H2O H2OFrame Types and RDD[T] Types

As type T, we support the following types:

T

NA

Byte

Short

Integer

Long

Float

Double

String

Boolean

java.sql.Timestamp

Any scala class extending scala Product

org.apache.spark.mllib.regression.LabeledPoint

As is specified in the table, Sparkling Water provides support for transforming arbitrary scala class extending Product, which are, for example, all case classes.

Footnotes

1

The H2O type is String if cardinality is greater than 10 000 0000 or ratio of unique values to all values is 95% or higher.

2

The H2O categorical values are “True” and “False” for true and false respectively.