Hive Support in Sparkling Water ------------------------------------- Spark support reading data natively from Hive and H2O supports that in a Hadoop environment as well. In Sparkling Water you can decide which tool you want to use for this task. This tutorial explains what is needed to use H2O to read data from Hive in Sparkling Water environment. Import Data from Hive via Hive Metastore ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Make sure ``$SPARK_HOME/conf`` contains the hive-site.xml with your Hive configuration. - In YARN client mode or any local mode, please copy the required connector jars for your Metastore to ``$SPARK_HOME/jars``. You can find these jars in ``$HIVE_HOME/lib directory``. For example, if you are using MySQL as a Metastore for Hive, copy MySQL metastore jdbc connector. This is not required in YARN cluster mode. This is all preparation we need to do. The following code shows how to import the table. .. content-tabs:: .. tab-container:: Scala :title: Scala To read data from Hive in Sparkling Water, you can use the method: .. code:: Scala val airlinesTable = h2oContext.importHiveTable("default", "airlines") .. tab-container:: Python :title: Python To read data from Hive in PySparkling, you can use the method: .. code:: python airlines_frame = h2o.import_hive_table("default", "airlines") .. tab-container:: R :title: R To read data from Hive in RSparkling, you can use the method: .. code:: R airlines_frame <- h2o.import_hive_table("default", "airlines") This call reads airlines table from default database. Import Data from Hive in a Kerberized Hadoop Cluster ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This feature reads data from Hive via a standard JDBC connection. Before a given connection to Hive is made, a user has to be authenticated with the Hive instance via a delegation token and pass the delegation token to Sparkling Water. Sparkling Water ensures that the delegation token is being automatically refreshed, thus delegation token never expires in long-running Sparkling Water applications. Obtain the Hive JDBC Client Jar ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To be able to connect to Hive, Sparkling Water will need Hive JDBC Client Jar on the class path. The jar can be obtained from in the following ways. - For Hortonworks, Hive JDBC client jars can be found at ``/usr/hdp/current/hive-client/lib/hive-jdbc--standalone.jar`` on one of the edge nodes after you have installed HDP. For more information, see Hortenworks documentation https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_data-access/content/hive-jdbc-odbc-drivers.html. - For Cloudera, install the JDBC package for your operating system according to instructions stated on Cloudera documentation https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_hive_jdbc_install.html. The path to the jar then will be ``/usr/lib/hive/lib/hive-jdbc--standalone.jar`` - You can also retrieve this from Maven for the desired version using ``mvn dependency:get -Dartifact=groupId:artifactId:version``. Generate Initial Token ^^^^^^^^^^^^^^^^^^^^^^ The initial delegation token can be generated with the following steps. Authenticate your user against Kerberos. .. code:: bash kinit Put Hive JDBC client jar on the hadoop class path. .. code:: bash export HADOOP_CLASSPATH=/path/to/hive-jdbc--standalone.jar Set path to sparkling-water-assembly-SUBST_SW_VERSION-all.jar which is bundled in Sparkling Water archive. .. code:: bash SW_ASSEMBLY=/path/to/sparkling-water-SUBST_SW_VERSION/jars/sparkling-water-assembly_SUBST_SCALA_BASE_VERSION-SUBST_SW_VERSION-all.jar Get the delegation token generated with arguments: - ``hiveHost`` - The full address of HiveServer2, for example ``hostname:10000`` - ``hivePrincipal`` - Hiveserver2 Kerberos principal, for example ``hive/hostname@DOMAIN.COM`` - ``tokenFile`` - The output file which the delegation token will be generated to .. code:: bash hadoop jar $SW_ASSEMBLY water.hive.GenerateHiveToken -hiveHost -hivePrincipal -tokenFile hive.token Run Sparkling Water with Hive Support for Kerberized Hadoop Cluster ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To run Sparkling Water with Hive support for kerberized hadoop cluster, you must configure the following sparkling water options: .. content-tabs:: .. tab-container:: Python :title: Python First, start PySparkling Shell with the Hive JDBC client jar on the class path .. code:: bash ./bin/pysparkling --jars /path/to/hive-jdbc--standalone.jar Create ``H2OContext`` with properties ensuring connectivity to Hive .. code:: python from pysparkling import * conf = H2OConf() conf.setHiveSupportEnabled() conf.setHiveHost("hostname:10000") # The full address of HiveServer2 conf.setHivePrincipal("hive/hostname@DOMAIN.COM") # Hiveserver2 Kerberos principal conf.setHiveJdbcUrlPattern("jdbc:hive2://{{host}}/;{{auth}}") # Doesn't have to be specified if host is set with open('hive.token', 'r') as tokenFile: token = tokenFile.read() conf.setHiveToken(token) H2OContext.getOrCreate(conf) Import data table from Hive .. code:: R import h2o h2o.import_hive_table("jdbc:hive2://hostname:10000/default;auth=delegationToken", "airlines") .. tab-container:: R :title: R Run your R environment and install required libraries according to :ref:`rsparkling` tutorial and then create Spark context with the Hive JDBC client jar on the class path. .. code:: R library(sparklyr) library(rsparkling) conf <- spark_config() conf$sparklyr.jars.default <- "/path/to/hive-jdbc--standalone.jar" sc <- spark_connect(master = "yarn-client", config = conf) Create ``H2OContext`` with properties ensuring connectivity to Hive .. code:: R h2oConf <- H2OConf() h2oConf$setHiveSupportEnabled() h2oConf$setHiveHost("hostname:10000") h2oConf$setHivePrincipal("hive/hostname@DOMAIN.COM") tokenFile <- 'hive.token' token <- readChar(tokenFile, file.info(tokenFile)$size) h2oConf$setHiveToken(token) H2OContext.getOrCreate(h2oConf) Import data table from Hive .. code:: R library(h2o) frame <- h2o.import_hive_table("jdbc:hive2://hostname:10000/default;auth=delegationToken", "airlines")