H2O Frame as Spark’s Data Source¶
The way that an H2O Frame can be used as Spark’s Data Source differs between Python and Scala.
Quick links:
Usage in Python - PySparkling¶
Reading from H2O Frame¶
Let’s suppose we have an H2OFrame frame
.
There are two ways in which the dataframe can be loaded from H2OFrame in PySparkling:
df = spark.read.format("h2o").option("key", frame.frame_id).load()
or
df = spark.read.format("h2o").load(frame.frame_id)
Saving to H2O Frame¶
Let’s suppose we have DataFrame df
.
There are two ways in which the dataframe can be saved as H2OFrame in PySparkling:
df.write.format("h2o").option("key", "new_key").save()
or
df.write.format("h2o").save("new_key")
Both variants save the dataframe as an H2OFrame with the key new_key
. They will not succeed if the H2OFrame with the same key already exists.
Loading & Saving Options¶
If the key is specified with the key
option and also in the load/save
method, then the key
option is preferred
df = spark.read.from("h2o").option("key", "key_one").load("key_two")
or
df = spark.read.from("h2o").option("key", "key_one").save("key_two")
In both examples, key_one
is used.
Usage in Scala¶
Reading from H2O Frame¶
Let’s suppose we have H2OFrame frame
.
The shortest way in which the dataframe can be loaded from the H2OFrame with default settings is:
val df = spark.read.h2o(frame.key)
There are two more ways in which the dataframe can be loaded from H2OFrame. These calls allow us to specify additional options:
val df = spark.read.format("h2o").option("key", frame.key.toString).load()
or
val df = spark.read.format("h2o").load(frame.key.toString)
Saving to H2O Frame¶
Let’s suppose we have DataFrame df
.
The shortest way in which a dataframe can be saved as an H2O Frame with default settings is:
df.write.h2o("new_key")
There are two additional methods for saving a dataframe as an H2OFrame. These calls allow us to specify additional options:
df.write.format("h2o").option("key", "new_key").save()
or
df.write.format("h2o").save("new_key")
All three variants save the dataframe as an H2OFrame with key new_key
. They will not succeed if the H2OFrame with the same key already exists.
Loading & Saving Options¶
If the key is specified with the key
option and also in the load/save
method, then the key
option is preferred.
val df = spark.read.from("h2o").option("key", "key_one").load("key_two")
or
val df = spark.read.from("h2o").option("key", "key_one").save("key_two")
In both examples, key_one
is used.
Specifying Saving Mode¶
There are four save modes available when saving data using the Data Source API: append
, overwrite
, error
and ignore
. The full description is available in the Spark documentation for Spark Save Modes.
If
append
is used, an existing H2OFrame with the same key is deleted, and a new one containing the union of all rows from the original H2O Frame and from the appended Data Frame is created with the same key.If
overwrite
is used, an existing H2OFrame with the same key is deleted, and new one with the new rows is created with the same key.If
error
is used and an H2OFrame with the specified key already exists, then an exception is thrown.If
ignore
is used and an H2OFrame with the specified key already exists, then no data is changed.