H2O Frame as Spark’s Data Source

H2O Frame can be used directly as a Spark data source.

Reading Spark Data Frame from H2O Frame

Let’s suppose we have an H2OFrame with key testFrame

There are multiple ways in which the Spark data frame can be loaded from H2OFrame:

  • Scala
  • Python
val df = spark.read.format("h2o").option("key", "testFrame").load()
val df = spark.read.format("h2o").load("testFrame")

Saving H2O Frame as Spark Data Frame

Let’s suppose we have Spark Data Frame df.

There are multiple ways in which the Spark Data Frame can be saved as H2OFrame:

  • Scala
  • Python
df.write.format("h2o").option("key", "new_key").save()
df.write.format("h2o").save("new_key")

All variants save the data frame as an H2OFrame with the key new_key. They will not succeed if the H2OFrame with the same key already exists.

Loading & Saving Options

If the key is specified with the key option and also in the load/save method, then the key option is preferred

  • Scala
  • Python
val df = spark.read.from("h2o").option("key", "key_one").load("key_two")
val df = spark.read.from("h2o").option("key", "key_one").save("key_two")

In all examples, key_one is used.

Specifying Saving Mode

There are four save modes available when saving data using the Data Source API: append, overwrite, error and ignore. The full description is available in the Spark documentation for Spark Save Modes.

  • If append is used, an existing H2OFrame with the same key is deleted, and a new one containing the union of all rows from the original H2O Frame and from the appended Data Frame is created with the same key.

  • If overwrite is used, an existing H2OFrame with the same key is deleted, and a new one with the new rows is created with the same key.

  • If error is used and an H2OFrame with the specified key already exists, then an exception is thrown.

  • If ignore is used and an H2OFrame with the specified key already exists, then no data is changed.