Import & Export H2O Frames from/to S3

Sparkling Water car read and write H2O frames from and to S3. Several configuration steps are required.

Specify the AWS Dependencies

In order to enable support for S3A/S3N, we need to start Sparkling Water with the following extra dependencies:

  • org.apache.hadoop:hadoop-aws:2.7.3

  • spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4

For production environments, we advice to download these jars and add them on your Spark path manually by copying them to $SPARK_HOME/jars directory.

Additionally, we can also use the --packages option when starting Sparkling Water as:

./bin/sparkling-shell --packages spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

The same holds for PySparkling. We can also add the following line to the spark-defaults.conf file:

spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

This line ensures that we don’t need to specify the --packages option all the time.

Configuring S3A

In order to read and write to S3A, please add the following lines to the spark-defaults.conf file in the $SPARK_HOME/conf directory:

spark.hadoop.fs.s3a.impl                org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key          {{AWS_ACCESS_KEY}}
spark.hadoop.fs.s3a.secret.key          {{AWS_SECRET_KEY}}

where {{AWS_ACCESS_KEY}} should be substituted by AWS access key and {{AWS_SECRET_KEY}} by AWS secret key.

Configuring S3N

In order to read and write to S3N, please add the following lines to the spark-defaults.conf file in the $SPARK_HOME/conf directory:

spark.hadoop.fs.s3n.impl                org.apache.hadoop.fs.s3native.NativeS3FileSystem
spark.hadoop.fs.s3n.awsAccessKeyId      {{AWS_ACCESS_KEY}}
spark.hadoop.fs.s3n.awsSecretAccessKey  {{AWS_SECRET_KEY}}

where {{AWS_ACCESS_KEY}} should be substituted by AWS access key and {{AWS_SECRET_KEY}} by AWS secret key.

Sparkling Water Example Code

When you configured Sparkling Water as explained above, please start Sparkling Shell as


Next, let’s start H2OContext. This context brings H2O support into Spark environment:

import org.apache.spark.h2o._
val hc = H2OContext.getOrCreate(spark)

Finally, read the data:

val fr = new H2OFrame(new URI("s3n://"))

PySparkling Example Code

When you configured PySparkling as explained above, please start PySparkling as


Next, let’s start H2OContext. This context brings H2O support into Spark environment:

from pysparkling import *
hc = H2OContext.getOrCreate(spark)

Finally, read the data:

fr = h2o.import_file("s3n://")

In PySparkling, you can also export the file to S3 as:
