Hive Support in Sparkling Water

Spark support reading data natively from Hive and H2O supports that in a Hadoop environment as well. In Sparkling Water you can decide which tool you want to use for this task. This tutorial explains what is needed to use H2O to read data from Hive in Sparkling Water environment.

Import Data from Hive via Hive Metastore

  • Make sure $SPARK_HOME/conf contains the hive-site.xml with your Hive configuration.

  • In YARN client mode or any local mode, please copy the required connector jars for your Metastore to $SPARK_HOME/jars. You can find these jars in $HIVE_HOME/lib directory. For example, if you are using MySQL as a Metastore for Hive, copy MySQL metastore jdbc connector. This is not required in YARN cluster mode.

This is all preparation we need to do. The following code shows how to import the table.

Scala

To read data from Hive in Sparkling Water, you can use the method:

val airlinesTable = h2oContext.importHiveTable("default", "airlines")

Python

To read data from Hive in PySparkling, you can use the method:

airlines_frame = h2o.import_hive_table("default", "airlines")

R

To read data from Hive in RSparkling, you can use the method:

airlines_frame <- h2o.import_hive_table("default", "airlines")

This call reads airlines table from default database.

Import Data from Hive in a Kerberized Hadoop Cluster

This feature reads data from Hive via a standard JDBC connection. Before a given connection to Hive is made, a user has to be authenticated with the Hive instance via a delegation token and pass the delegation token to Sparkling Water. Sparkling Water ensures that the delegation token is being automatically refreshed, thus delegation token never expires in long-running Sparkling Water applications.

Obtain the Hive JDBC Client Jar

To be able to connect to Hive, Sparkling Water will need Hive JDBC Client Jar on the class path. The jar can be obtained from in the following ways.

Generate Initial Token

The initial delegation token can be generated with the following steps.

Authenticate your user against Kerberos.

kinit <your_user_name>

Put Hive JDBC client jar on the hadoop class path.

export HADOOP_CLASSPATH=/path/to/hive-jdbc-<version>-standalone.jar

Set path to sparkling-water-assembly-3.30.0.3-1-2.3-all.jar which is bundled in Sparkling Water archive.

SW_ASSEMBLY=/path/to/sparkling-water-3.30.0.3-1-2.3/jars/sparkling-water-assembly_2.11-3.30.0.3-1-2.3-all.jar
Get the delegation token generated with arguments:
  • hiveHost - The full address of HiveServer2, for example hostname:10000

  • hivePrincipal - Hiveserver2 Kerberos principal, for example hive/hostname@DOMAIN.COM

  • tokenFile - The output file which the delegation token will be generated to

hadoop jar $SW_ASSEMBLY water.hive.GenerateHiveToken -hiveHost <your_hive_host> -hivePrincipal <your_hive_principal> -tokenFile hive.token

Run Sparkling Water with Hive Support for Kerberized Hadoop Cluster

To run Sparkling Water with Hive support for kerberized hadoop cluster, you must configure the following sparkling water options:

Python

First, start PySparkling Shell with the Hive JDBC client jar on the class path

./bin/pysparkling --jars /path/to/hive-jdbc-<version>-standalone.jar

Create H2OContext with properties ensuring connectivity to Hive

from pysparkling import *
conf = H2OConf()
conf.setHiveSupportEnabled()
conf.setHiveHost("hostname:10000") # The full address of HiveServer2
conf.setHivePrincipal("hive/hostname@DOMAIN.COM") # Hiveserver2 Kerberos principal
conf.setHiveJdbcUrlPattern("jdbc:hive2://{{host}}/;{{auth}}") # Doesn't have to be specified if host is set
with open('hive.token', 'r') as tokenFile:
    token = tokenFile.read()
    conf.setHiveToken(token)
H2OContext.getOrCreate(conf)

Import data table from Hive

import h2o
h2o.import_hive_table("jdbc:hive2://hostname:10000/default;auth=delegationToken", "airlines")

R

Run your R environment and install required libraries according to RSparkling tutorial and then create Spark context with the Hive JDBC client jar on the class path.

library(sparklyr)
library(rsparkling)
conf <- spark_config()
conf$sparklyr.jars.default <- "/path/to/hive-jdbc-<version>-standalone.jar"
sc <- spark_connect(master = "yarn-client", config = conf)

Create H2OContext with properties ensuring connectivity to Hive

h2oConf <- H2OConf()
h2oConf$setHiveSupportEnabled()
h2oConf$setHiveHost("hostname:10000")
h2oConf$setHivePrincipal("hive/hostname@DOMAIN.COM")
tokenFile <- 'hive.token'
token <- readChar(tokenFile, file.info(tokenFile)$size)
h2oConf$setHiveToken(token)
H2OContext.getOrCreate(h2oConf)

Import data table from Hive

library(h2o)
frame <- h2o.import_hive_table("jdbc:hive2://hostname:10000/default;auth=delegationToken", "airlines")