H2O Frame as Spark’s Data Source¶
The way that an H2O Frame can be used as Spark’s Data Source differs between Python and Scala.
Quick links:
Usage in Python - PySparkling¶
Reading from H2O Frame¶
Let’s suppose we have an H2OFrame frame
.
There are two ways in which the dataframe can be loaded from H2OFrame in PySparkling:
df = spark.read.format("h2o").option("key", frame.frame_id).load()
or
df = spark.read.format("h2o").load(frame.frame_id)
Saving to H2O Frame¶
Let’s suppose we have DataFrame df
.
There are two ways in which the dataframe can be saved as H2OFrame in PySparkling:
df.write.format("h2o").option("key", "new_key").save()
or
df.write.format("h2o").save("new_key")
Both variants save the dataframe as an H2OFrame with the key new_key
. They will not succeed if the H2OFrame with the same key already exists.
Loading & Saving Options¶
If the key is specified with the key
option and also in the load/save
method, then the key
option is preferred
df = spark.read.from("h2o").option("key", "key_one").load("key_two")
or
df = spark.read.from("h2o").option("key", "key_one").save("key_two")
In both examples, key_one
is used.
Usage in Scala¶
Reading from H2O Frame¶
Let’s suppose we have H2OFrame frame
.
The shortest way in which the dataframe can be loaded from the H2OFrame with default settings is:
val df = spark.read.h2o(frame.key)
There are two more ways in which the dataframe can be loaded from H2OFrame. These calls allow us to specify additional options:
val df = spark.read.format("h2o").option("key", frame.key.toString).load()
or
val df = spark.read.format("h2o").load(frame.key.toString)
Saving to H2O Frame¶
Let’s suppose we have DataFrame df
.
The shortest way in which a dataframe can be saved as an H2O Frame with default settings is:
df.write.h2o("new_key")
There are two additional methods for saving a dataframe as an H2OFrame. These calls allow us to specify additional options:
df.write.format("h2o").option("key", "new_key").save()
or
df.write.format("h2o").save("new_key")
All three variants save the dataframe as an H2OFrame with key new_key
. They will not succeed if the H2OFrame with the same key already exists.
Loading & Saving Options¶
If the key is specified with the key
option and also in the load/save
method, then the key
option is preferred.
val df = spark.read.from("h2o").option("key", "key_one").load("key_two")
or
val df = spark.read.from("h2o").option("key", "key_one").save("key_two")
In both examples, key_one
is used.
Specifying Saving Mode¶
There are four save modes available when saving data using the Data Source API: append
, overwrite
, error
and ignore
. The full description is available in the Spark documentation for Spark Save Modes.
- If
append
is used, an existing H2OFrame with the same key is deleted, and a new one containing the union of all rows from the original H2O Frame and from the appended Data Frame is created with the same key. - If
overwrite
is used, an existing H2OFrame with the same key is deleted, and new one with the new rows is created with the same key. - If
error
is used and an H2OFrame with the specified key already exists, then an exception is thrown. - If
ignore
is used and an H2OFrame with the specified key already exists, then no data is changed.