Sparkling Water Configuration Properties

The following configuration properties can be passed to Spark to configure Sparking Water.

Configuration properties independent of selected backend

Property name Default value Description
Generic parameters    
spark.ext.h2o.backend.cluster.mode internal This option can be set either to internal or external. When set to external, H2O Context is created by connecting to existing H2O cluster, otherwise H2O cluster located inside Spark is created. That means that each Spark executor will have one H2O instance running in it. The internal mode is not recommended for big clusters and clusters where Spark executors are not stable.
spark.ext.h2o.cloud.name Generated unique name Name of H2O cluster.
spark.ext.h2o.nthreads -1 Limit for number of threads used by H2O, default -1 means: Use value of spark.executor.cores in case this property is set. Otherwise use H2O’s default value Runtime.getRuntime().availableProcessors().
spark.ext.h2o.repl.enabled true Decides whether H2O REPL is initiated or not.
spark.ext.scala.int.default.num 1 Number of parallel REPL sessions started at the start of Sparkling Water
spark.ext.h2o.topology.change.listener.enabled true Decides whether listener which kills H2O cluster on the change of the underlying cluster’s topology is enabled or not. This configuration has effect only in non-local mode.
spark.ext.h2o.spark.version.check.enabled true Enables check if run-time Spark version matches build time Spark version.
spark.ext.h2o.fail.on.unsupported.spark.param true If unsupported Spark parameter is detected, then application is forced to shutdown.
spark.ext.h2o.jks None Path to Java KeyStore file.
spark.ext.h2o.jks.pass None Password for Java KeyStore file.
spark.ext.h2o.hash.login false Enable hash login.
spark.ext.h2o.ldap.login false Enable LDAP login.
spark.ext.h2o.kerberos.login false Enable Kerberos login.
spark.ext.h2o.login.conf None Login configuration file.
spark.ext.h2o.user.name None Override user name for cluster.
spark.ext.h2o.internal_security_conf None Path to a file containing H2O or Sparkling Water internal security configuration.
spark.ext.h2o.auto.flow.ssl false Automatically generate the required key store and password to secure H2O flow by SSL.
spark.ext.h2o.node.log.level INFO H2O internal log level used for H2O nodes except the client.
spark.ext.h2o.node.log.dir {user.dir}/h 2ologs/{SparkA ppId} or YARN container dir Location of H2O logs on H2O nodes except on the client.
spark.ext.h2o.ui.update.interval 10000ms Interval for updates of the Spark UI and History server in milliseconds
spark.ext.h2o.cloud.timeout 60*1000 Timeout (in msec) for cluster formation.
spark.ext.h2o.node.enable.web false Enable or disable web on H2O worker nodes. It is disabled by default for security reasons.
spark.ext.h2o.node.network.mask None Subnet selector for H2O running inside Spark executors. This disables using IP reported by Spark but tries to find IP based on the specified mask.
spark.ext.h2o.stacktrace.collector.interval -1 Interval specifying how often stack traces are taken on each H2O node. -1 means that no stack traces will be taken.
spark.ext.h2o.context.path None Context path to expose H2O web server.
spark.ext.h2o.flow.scala.cell.async false Decide whether the Scala cells in H2O Flow will run synchronously or Asynchronously. Default is synchronously.
spark.ext.h2o.flow.scala.cell.max.parallel -1 Number of max parallel Scala cell jobs The value -1 means not limited.
spark.ext.h2o.internal.port.offset 1 Offset between the API(=web) port and the internal communication port on the client node; api_port + port_offset = h2o_port
H2O client parameters    
spark.ext.h2o.client.flow.dir None Directory where flows from H2O Flow are saved.
spark.ext.h2o.client.ip None IP of H2O client node.
spark.ext.h2o.client.iced.dir None Location of iced directory for the driver instance.
spark.ext.h2o.client.log.level INFO H2O internal log level used for H2O client running inside Spark driver.
spark.ext.h2o.client.log.dir {user.dir}/h 2ologs/{SparkA ppId} Location of H2O logs on the driver machine.
spark.ext.h2o.client.port.base 54321 Port on which H2O client publishes its API. If already occupied, the next odd port is tried on so on.
spark.ext.h2o.client.web.port -1 Exact client port to access web UI. The value -1 means automatic search for a free port starting at spark.ext.h2o.port.base.
spark.ext.h2o.client.verbose false The client outputs verbose log output directly into console. Enabling the flag increases the client log level to INFO.
spark.ext.h2o.client.network.mask None Subnet selector for H2O client, this disables using IP reported by Spark but tries to find IP based on the specified mask.
spark.ext.h2o.client.ignore.SPARK_PUBLIC_DNS false Ignore SPARK_PUBLIC_DNS setting on the H2O client. The option still applies to the Spark application.
spark.ext.h2o.client.enable.web true Enable or disable web on h2o client node. It is enabled by default. Disabling the web just on the client node just restricts everybody from accessing flow, the internal ports between client and rest of the cluster remain open.
spark.ext.h2o.client.flow.baseurl.override None Allows to override the base URL address of Flow UI, including the scheme, which is showed to the user.

Internal backend configuration properties

Property name Default value Description
Generic parameters    
spark.ext.h2o.flatfile true Use flatfile instead of multicast approach for creating H2O cluster.
spark.ext.h2o.ip.based.flatfile false Translate hostnames of the worker nodes discovered by Spark to the IP address. The generated flatfile will contain these IP address. Can be useful in certain restricted DNS environments.
spark.ext.h2o.cluster.size None Expected number of workers of H2O cluster. Value None means automatic detection of cluster size. This number must be equal to number of Spark executors.
spark.ext.h2o.dummy.rdd.mul.factor 10 Multiplication factor for dummy RDD generation. Size of dummy RDD is spark.ext.h2o.cluster.size * spark.ext.h2o.dummy.rdd.mul.factor .
spark.ext.h2o.spreadrdd.retries 10 Number of retries for creation of an RDD spread across all existing Spark executors.
spark.ext.h2o.default.cluster.size 20 Starting size of cluster in case that size is not explicitly configured.
spark.ext.h2o.subseq.tries 5 Subsequent successful tries to figure out size of Spark cluster, which are producing the same number of nodes.
spark.ext.h2o.internal_secure_connections false Enables secure communications among H2O nodes. The security is based on automatically generated keystore and truststore. This is equivalent for -internal_secure_conections option in H2O Hadoop deployments.
H2O nodes parameters    
spark.ext.h2o.node.port.base 54321 Base port used for individual H2O nodes.
spark.ext.h2o.node.iced.dir None Location of iced directory for H2O nodes on the Spark executors.

External backend configuration properties

Property name Default value Description
spark.ext.h2o.cloud.representative None ip:port of arbitrary H2O node to identify external H2O cluster.
spark.ext.h2o.external.cluster.num.h2o.nodes None Number of H2O nodes to start in auto mode and wait for in manual mode when starting Sparkling Water in external H2O cluster mode.
spark.ext.h2o.cluster.client.retry.timeout 60000ms Timeout in milliseconds specifying how often the check for availability of connected watchdog client is done.
spark.ext.h2o.cluster.client.connect.timeout 180000ms Timeout in milliseconds for watchdog client connection. If the client is not connected to the external cluster in the given time ,the cluster is killed.
spark.ext.h2o.external.read.confirmation.timeout 60s Timeout for confirmation of read operation (H2O frame => Spark frame) on external cluster.
spark.ext.h2o.external.write.confirmation.timeout 60s Timeout for confirmation of write operation (Spark frame => H2O frame) on external cluster.
spark.ext.h2o.cluster.start.timeout 120s Timeout in seconds for starting H2O external cluster.
spark.ext.h2o.cluster.info.name None Full path to a file which is used sd the notification file for the startup of external H2O cluster.
spark.ext.h2o.hadoop.memory 6G Amount of memory assigned to each H2O node on YARN/Hadoop.
spark.ext.h2o.external.hdfs.dir None Path to the directory on HDFS used for storing temporary files.
spark.ext.h2o.external.start.mode manual If this option is set to auto then H2O external cluster is automatically started using the provided H2O driver JAR on YARN, otherwise it is expected that the cluster is started by the user manually.
spark.ext.h2o.external.h2o.driver None Path to H2O driver used during auto start mode.
spark.ext.h2o.external.yarn.queue None Yarn queue on which external H2O cluster is started.
spark.ext.h2o.external.driver.if None IP address of H2O driver in case of external cluster in automatic mode.
spark.ext.h2o.external.health.check.interval HeartBeatThr ead.TIMEOUT Health check interval for external H2O nodes.
spark.ext.h2o.external.kill.on.unhealthy true If true, the client will try to kill the cluster and then itself in case some nodes in the cluster report unhealthy status.
spark.ext.h2o.external.kill.on.unhealthy.interval HeartBeatThr ead.TIMEOUT * 3 How often check the healthy status for the decision whether to kill the cloud or not.
spark.ext.h2o.external.kerberos.principal None Kerberos Principal.
spark.ext.h2o.external.kerberos.keytab None Kerberos Keytab.
spark.ext.h2o.external.run.as.user None Impersonated Hadoop user.