Import & Export H2O Frames from/to S3 ------------------------------------- Sparkling Water car read and write H2O frames from and to S3. Several configuration steps are required. Specify the AWS Dependencies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In order to enable support for S3A/S3N, we need to start Sparkling Water with the following extra dependencies: - ``org.apache.hadoop:hadoop-aws:2.7.3`` - ``spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4`` For production environments, we advice to download these jars and add them on your Spark path manually by copying them to ``$SPARK_HOME/jars`` directory. Additionally, we can also use the ``--packages`` option when starting Sparkling Water as: .. code:: bash ./bin/sparkling-shell --packages spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 The same holds for PySparkling. We can also add the following line to the spark-defaults.conf file: .. code:: bash spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 This line ensures that we don't need to specify the ``--packages`` option all the time. Configuring S3A ~~~~~~~~~~~~~~~ In order to read and write to S3A, please add the following lines to the ``spark-defaults.conf`` file in the ``$SPARK_HOME/conf`` directory: .. code:: bash spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key {{AWS_ACCESS_KEY}} spark.hadoop.fs.s3a.secret.key {{AWS_SECRET_KEY}} where ``{{AWS_ACCESS_KEY}}`` should be substituted by AWS access key and ``{{AWS_SECRET_KEY}}`` by AWS secret key. Configuring S3N ~~~~~~~~~~~~~~~ In order to read and write to S3N, please add the following lines to the ``spark-defaults.conf`` file in the ``$SPARK_HOME/conf`` directory: .. code:: bash spark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem spark.hadoop.fs.s3n.awsAccessKeyId {{AWS_ACCESS_KEY}} spark.hadoop.fs.s3n.awsSecretAccessKey {{AWS_SECRET_KEY}} where ``{{AWS_ACCESS_KEY}}`` should be substituted by AWS access key and ``{{AWS_SECRET_KEY}}`` by AWS secret key. Sparkling Water Example Code ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you configured Sparkling Water as explained above, please start Sparkling Shell as .. code:: bash ./bin/sparkling-shell Next, let's start ``H2OContext``. This context brings H2O support into Spark environment: .. code:: scala import org.apache.spark.h2o._ val hc = H2OContext.getOrCreate(spark) Finally, read the data: .. code:: scala import java.net.URI val fr = new H2OFrame(new URI("s3n://data.h2o.ai/h2o-open-tour/2016-nyc/weather.csv")) PySparkling Example Code ~~~~~~~~~~~~~~~~~~~~~~~~ When you configured PySparkling as explained above, please start PySparkling as .. code:: python ./bin/pysparkling Next, let's start ``H2OContext``. This context brings H2O support into Spark environment: .. code:: python from pysparkling import * hc = H2OContext.getOrCreate(spark) Finally, read the data: .. code:: python fr = h2o.import_file("s3n://data.h2o.ai/h2o-open-tour/2016-nyc/weather.csv") In PySparkling, you can also export the file to S3 as: .. code:: python h2o.export_file("s3n://path/to/target/location")