Import & Export H2O Frames from/to S3¶
Sparkling Water car read and write H2O frames from and to S3. Several configuration steps are required.
Specify the AWS Dependencies¶
In order to enable support for S3A/S3N, we need to start Sparkling Water with the following extra dependencies:
org.apache.hadoop:hadoop-aws:2.7.3
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4
For production environments, we advice to download these jars and add them on your Spark path manually by copying them to
$SPARK_HOME/jars
directory.
Additionally, we can also use the --packages
option when starting Sparkling Water as:
./bin/sparkling-shell --packages spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
The same holds for PySparkling. We can also add the following line to the spark-defaults.conf file:
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
This line ensures that we don’t need to specify the --packages
option all the time.
Configuring S3A¶
In order to read and write to S3A, please add the following lines to the spark-defaults.conf
file
in the $SPARK_HOME/conf
directory:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key {{AWS_ACCESS_KEY}} spark.hadoop.fs.s3a.secret.key {{AWS_SECRET_KEY}}
where {{AWS_ACCESS_KEY}}
should be substituted by AWS access key and {{AWS_SECRET_KEY}}
by
AWS secret key.
Configuring S3N¶
In order to read and write to S3N, please add the following lines to the spark-defaults.conf
file
in the $SPARK_HOME/conf
directory:
spark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem spark.hadoop.fs.s3n.awsAccessKeyId {{AWS_ACCESS_KEY}} spark.hadoop.fs.s3n.awsSecretAccessKey {{AWS_SECRET_KEY}}
where {{AWS_ACCESS_KEY}}
should be substituted by AWS access key and {{AWS_SECRET_KEY}}
by
AWS secret key.
Sparkling Water Example Code¶
When you configured Sparkling Water as explained above, please start Sparkling Shell as
./bin/sparkling-shell
Next, let’s start H2OContext
. This context brings H2O support into Spark environment:
import org.apache.spark.h2o._ val hc = H2OContext.getOrCreate(spark)
Finally, read the data:
import java.net.URI val fr = new H2OFrame(new URI("s3n://data.h2o.ai/h2o-open-tour/2016-nyc/weather.csv"))
PySparkling Example Code¶
When you configured PySparkling as explained above, please start PySparkling as
./bin/pysparkling
Next, let’s start H2OContext
. This context brings H2O support into Spark environment:
from pysparkling import * hc = H2OContext.getOrCreate(spark)
Finally, read the data:
fr = h2o.import_file("s3n://data.h2o.ai/h2o-open-tour/2016-nyc/weather.csv")
In PySparkling, you can also export the file to S3 as:
h2o.export_file("s3n://path/to/target/location")