Get started with Sparkling Water in a few easy steps
1. Download Spark (if not already installed) from the Spark Downloads Page
Choose Spark release : 3.4.*
Choose a package type: Pre-built for Hadoop 2.7 and later
2. Point SPARK_HOME to the existing installation of Spark and export variable MASTER.
export SPARK_HOME="/path/to/spark/installation"
# To launch a local Spark cluster.
export MASTER="local[*]"
3. From your terminal, run:
cd ~/Downloads
unzip sparkling-water-3.44.0.2-1-3.4.zip
cd sparkling-water-3.44.0.2-1-3.4
bin/sparkling-shell --conf "spark.executor.memory=1g"
4. Create an H2O cloud inside the Spark cluster:
import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._
5. Follow this demo, which imports airlines and weather data and runs predictions on delays.
Launch Sparkling Water on Hadoop using Yarn.
1. Download Spark (if not already installed) from the Spark Downloads Page.
Choose Spark release : 3.4.*
Choose a package type: Pre-built for Hadoop 2.7 and later
2. Point SPARK_HOME to an existing installation of Spark:
export SPARK_HOME='/path/to/spark/installation'
3. Set the HADOOP_CONF_DIR and Spark MASTER environmental variables.
export HADOOP_CONF_DIR=/etc/hadoop/conf
export MASTER="yarn"
4. Download Spark and use sparkling-shell to launch Sparkling Shell on YARN.
wget /sparkling-water-3.44.0.2-1-3.4.zip
unzip sparkling-water-3.44.0.2-1-3.4.zip
cd sparkling-water-3.44.0.2-1-3.4/
bin/sparkling-shell --num-executors 3 --executor-memory 2g --master yarn --deploy-mode client
5. Create an H2O cloud inside the Spark cluster:
import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._
Launch H2O on a Standalone Spark Cluster
1. Download Spark (if not already installed) from the Spark Downloads Page.
Choose Spark release : 3.4.*
Choose a package type: Pre-built for Hadoop 2.7 and later
2. Point SPARK_HOME to an existing installation of Spark:
export SPARK_HOME='/path/to/spark/installation'
3. From your terminal, run:
cd ~/Downloads
unzip sparkling-water-3.44.0.2-1-3.4.zip
cd sparkling-water-3.44.0.2-1-3.4
bin/launch-spark-cloud.sh
export MASTER="spark://localhost:7077"
bin/sparkling-shell
4. Create an H2O cloud inside the Spark cluster:
import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._
Kluster
Kluster mode of Sparkling Water supports connection to external H2O clusters (standalone/hadoop).
The H2O cluster needs to be started with a corresponding H2O, which can be downloaded as below.
1. Download and unpack Sparkling Water distribution
2. Download corresponding H2O driver for your Hadoop distribution (e.g., hdp2.2, cdh5.4) or standalone one:
export H2O_DRIVER_JAR=$(/path/to/sparkling-water-3.44.0.2-1-3.4/bin/get-h2o-driver.sh hdp2.2)
3. Set path to sparkling-water-assembly-extensions-3.44.0.2-1-3.4-all.jar which is bundled in Sparkling Water archive:
SW_EXTENSIONS_ASSEMBLY=/path/to/sparkling-water-3.44.0.2-1-3.4/jars/sparkling-water-assembly-extensions-3.44.0.2-1-3.4-all.jar
4. Start H2O cluster on Hadoop:
hadoop -jar $H2O_DRIVER_JAR -libjars $SW_EXTENSIONS_ASSEMBLY -sw_ext_backend -jobname test -nodes 3 -mapperXmx 6g
5. In your Sparkling Water application, create H2OContext:
Scala
import ai.h2o.sparkling._
val conf = new H2OConf().setExternalClusterMode().useManualClusterStart().setCloudName("test")
val hc = H2OContext.getOrCreate(conf)
Python
from pysparkling import *
conf = H2OConf().setExternalClusterMode().useManualClusterStart().setCloudName("test")
hc = H2OContext.getOrCreate(conf)
List of supported Hadoop distributions: cdh5.4 cdh5.5 cdh5.6 cdh5.7 cdh5.8 cdh5.9 cdh5.10 cdh5.13 cdh5.14 cdh5.15 cdh5.16 cdh6.0 cdh6.1 cdh6.2 cdh6.3 cdp7.0 cdp7.1 cdp7.2 hdp2.2 hdp2.3 hdp2.4 hdp2.5 hdp2.6 hdp3.0 hdp3.1 mapr4.0 mapr5.0 mapr5.1 mapr5.2 mapr6.0 mapr6.1 mapr6.2 iop4.2 emr6.10
For more info, please follow Kluster documentation.
Gradle-style specification for Maven artifacts
See the h2o-droplets github repository for a working example.
repositories {
mavenCentral()
}
dependencies {
compile "ai.h2o:sparkling-water-package_2.12:3.44.0.2-1-3.4"
}
RSparkling
RSparkling is a R client for the Sparkling Water applications. To use it:
1. Download and unpack Sparkling Water distribution
cd ~/Downloads
unzip sparkling-water-3.44.0.2-1-3.4.zip
cd sparkling-water-3.44.0.2-1-3.4
Now, continue inside R or RStudio and prepare the environment.
2. Install RSparkling dependency, SparklyR:
install.packages("sparklyr")
3. Install Spark:
library(sparklyr)
spark_install(version = "3.4.1")
4. Install H2O of correct version:
install.packages("h2o", type = "source", repos = "https://h2o-release.s3.amazonaws.com/h2o/rel-3.44.0/2/R")
5. Install RSparkling for Sparkling Water 3.44.0.2-1-3.4:
From S3 repository:
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.4/3.44.0.2-1-3.4/R")
From downloaded distribution:
# rsparkling_3.44.0.2-1-3.4.tar.gz is available at the downloaded distribution.
install.packages(repos=NULL, type="source", "rsparkling_3.44.0.2-1-3.4.tar.gz")
6. Initialize RSparkling
library(rsparkling)
7. Connect to Spark
sc <- spark_connect(master = "local", version = "3.4.1")
8. Now, H2OContext is available and we can use any H2O features available in R.
hc <- H2OContext.getOrCreate()
For more and detailed information, please follow the installation and usage instructions on RSparkling page
H2O R Client
Once you have H2OContext available in RSParkling, any commands available in the R client can be used. For more information, please visit H2O-R page.
PySparkling
PySparkling is a Python client for the Sparkling Water applications. To use it:
1. Ensure that Spark is installed and MASTER, SPARK_HOME environmental variables are properly set.
2. Download and unpack Sparkling Water distribution.
3. Run PySparkling shell.
./bin/pysparkling
4. In your PySparkling application, create H2OContext.
from pysparkling import *
hc = H2OContext.getOrCreate()
PySparkling installed from PyPi repository
1. Install PySparkling using pip
pip install h2o_pysparkling_3.4
2. In Your Python, first create SparkSession. For this step, you need to have PySpark
package installed.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparklingWaterApp").getOrCreate()
3. Start H2OContext.
from pysparkling import *
hc = H2OContext.getOrCreate()
H2O Python Client
Once you have H2OContext available, any API calls available in the H2O Python client can be used. For more information about the Python client, please visit H2O-Python page.
Sparkling Water as Spark Package
To start Spark with Sparkling Water enabled via Spark package, do:
1. Ensure that Spark is installed, and MASTER and SPARK_HOME environmental variables are properly set.
2. Start Spark and point to maven coordinates of Sparkling Water:
$SPARK_HOME/bin/spark-shell --packages ai.h2o:sparkling-water-package_2.12:3.44.0.2-1-3.4
3. Create an H2O cloud inside the Spark cluster:
import ai.h2o.sparkling._
val h2oContext = H2OContext.getOrCreate()
import h2oContext._