Sparkling Water Configuration Properties¶

The following configuration properties can be passed to Spark to configure Sparking Water.

Configuration properties independent of selected backend¶

Property name	Default value	H2OConf setter (* getter)	Description
Generic parameters
`spark.ext.h2o.backend.cluster.mode`	`internal`	`setInternalClusterMode()` `setExternalClusterMode()`	This option can be set either to `internal` or `external`. When set to `external`, `H2O Context` is created by connecting to existing H2O cluster, otherwise H2O cluster located inside Spark is created. That means that each Spark executor will have one H2O instance running in it. The `internal` mode is not recommended for big clusters and clusters where Spark executors are not stable.
`spark.ext.h2o.cloud.name`	Generated unique name	`setCloudName(String)`	Name of H2O cluster.
`spark.ext.h2o.nthreads`	`-1`	`setNthreads(Integer)`	Limit for number of threads used by H2O, default `-1` means: Use value of `spark.executor.cores` in case this property is set. Otherwise use H2O’s default value `Runtime.getRuntime().availableProcessors()`.
`spark.ext.h2o.repl.enabled`	`true`	`setReplEnabled()` `setReplDisabled()`	Decides whether H2O REPL is initiated or not.
`spark.ext.scala.int.default.num`	`1`	`setDefaultNumReplSessions(Integer)`	Number of parallel REPL sessions started at the start of Sparkling Water
`spark.ext.h2o.topology.change.listener.enabled`	`true`	`setClusterTopologyListenerEnabled()` `setClusterTopologyListenerDisabled()`	Decides whether listener which kills H2O cluster on the change of the underlying cluster’s topology is enabled or not. This configuration has effect only in non-local mode.
`spark.ext.h2o.spark.version.check.enabled`	`true`	`setSparkVersionCheckEnabled()` `setSparkVersionCheckDisabled()`	Enables check if run-time Spark version matches build time Spark version.
`spark.ext.h2o.fail.on.unsupported.spark.param`	`true`	`setFailOnUnsupportedSparkParamEnabled()` `setFailOnUnsupportedSparkParamDisabled()`	If unsupported Spark parameter is detected, then application is forced to shutdown.
`spark.ext.h2o.jks`	`None`	`setJks(String)`	Path to Java KeyStore file.
`spark.ext.h2o.jks.pass`	`None`	`setJksPass(String)`	Password for Java KeyStore file.
`spark.ext.h2o.jks.alias`	`None`	`setJksAlias(String)`	Alias to certificate in keystore to secure H2O Flow.
`spark.ext.h2o.hash.login`	`false`	`setHashLoginEnabled()` `setHashLoginDisabled()`	Enable hash login.
`spark.ext.h2o.ldap.login`	`false`	`setLdapLoginEnabled()` `setLdapLoginDisabled()`	Enable LDAP login.
`spark.ext.h2o.kerberos.login`	`false`	`setKerberosLoginEnabled()` `setKerberosLoginDisabled()`	Enable Kerberos login.
`spark.ext.h2o.login.conf`	`None`	`setLoginConf(String)`	Login configuration file.
`spark.ext.h2o.user.name`	`None`	`setUserName(String)`	Username used for the backend H2O cluster and to authenticate the client against the backend.
`spark.ext.h2o.password`	`None`	`setPassword(String)`	Password used to authenticate the client against the backend.
`spark.ext.h2o.internal_security_conf`	`None`	`setSslConf(String)`	Path to a file containing H2O or Sparkling Water internal security configuration.
`spark.ext.h2o.auto.flow.ssl`	`false`	`setAutoFlowSslEnabled()` `setAutoFlowSslDisabled()`	Automatically generate the required key store and password to secure H2O flow by SSL.
`spark.ext.h2o.node.log.level`	`INFO`	`setH2ONodeLogLevel(String)`	H2O internal log level used for H2O nodes except the client.
`spark.ext.h2o.node.log.dir`	`{user.dir}/h2ologs/{SparkAppId}` or YARN container dir	`setH2ONodeLogDir(String)`	Location of H2O logs on H2O nodes except on the client.
`spark.ext.h2o.backend.heartbeat.interval`	`10000ms`	`setBackendHeartbeatInterval(Integer)`	Interval for getting heartbeat from the H2O backend.
`spark.ext.h2o.cloud.timeout`	`60*1000`	`setCloudTimeout(Integer)`	Timeout (in msec) for cluster formation.
`spark.ext.h2o.node.network.mask`	`None`	`setNodeNetworkMask(String)`	Subnet selector for H2O running inside Spark executors. This disables using IP reported by Spark but tries to find IP based on the specified mask.
`spark.ext.h2o.stacktrace.collector.interval`	`-1`	`setStacktraceCollectorInterval(Integer)`	Interval specifying how often stack traces are taken on each H2O node. -1 means that no stack traces will be taken.
`spark.ext.h2o.context.path`	`None`	`setContextPath(String)`	Context path to expose H2O web server.
`spark.ext.h2o.flow.scala.cell.async`	`false`	`setFlowScalaCellAsyncEnabled()` `setFlowScalaCellAsyncDisabled()`	Decide whether the Scala cells in H2O Flow will run synchronously or Asynchronously. Default is synchronously.
`spark.ext.h2o.flow.scala.cell.max.parallel`	`-1`	`setMaxParallelScalaCellJobs(Integer)`	Number of max parallel Scala cell jobs The value -1 means not limited.
`spark.ext.h2o.internal.port.offset`	`1`	`setInternalPortOffset(Integer)`	Offset between the API(=web) port and the internal communication port on the client node; `api_port + port_offset = h2o_port`
`spark.ext.h2o.node.port.base`	`54321`	`setNodeBasePort(Integer)`	Base port used for individual H2O nodes.
`spark.ext.h2o.mojo.destroy.timeout`	`600000`	`setMojoDestroyTimeout(Integer)`	If a scoring MOJO instance is not used within a Spark executor JVM for a given timeout in milliseconds, it’s evicted from executor’s cache. Default timeout value is 10 minutes.
`spark.ext.h2o.node.extra`	`None`	`setNodeExtraProperties(String)`	A string containing extra parameters passed to H2O nodes during startup. This parameter should be configured only if H2O parameters do not have any corresponding parameters in Sparkling Water.
`spark.ext.h2o.flow.extra.http.headers`	`None`	`setFlowExtraHttpHeaders(Map[String,String])` `setFlowExtraHttpHeaders(String)`	Extra HTTP headers that will be used in communication between the front-end and back-end part of Flow UI. The headers should be delimited by a new line. Don’t forget to escape special characters when passing the parameter from a command line.
`spark.ext.h2o.internal_secure_connections`	`false`	`setInternalSecureConnectionsEnabled()` `setInternalSecureConnectionsDisabled()`	Enables secure communications among H2O nodes. The security is based on automatically generated keystore and truststore. This is equivalent for `-internal_secure_conections` option in H2O Hadoop deployments.
`spark.ext.h2o.allow_insecure_xgboost`	`false`	`setInsecureXGBoostAllowed()` `setInsecureXGBoostDenied()`	If the property set to true, insecure communication among H2O nodes is allowed for the XGBoost algorithm even if the property `spark.ext.h2o.internal_secure_connections` is set to `true`.
`spark.ext.h2o.kerberized.hive.enabled`	`false`	`setKerberizedHiveEnabled()` `setKerberizedHiveDisabled()`	If enabled, H2O instances will create JDBC connections to a Kerberized Hive so that all clients can read data from HiveServer2. Don’t forget to put a jar with Hive driver on Spark classpath if the internal backend is used.
`spark.ext.h2o.hive.host`	`None`	`setHiveHost(String)`	The full address of HiveServer2, for example hostname:10000
`spark.ext.h2o.hive.principal`	`None`	`setHivePrincipal(String)`	Hiveserver2 Kerberos principal, for example hive/hostname@DOMAIN.COM
`spark.ext.h2o.hive.jdbc_url_pattern`	`None`	`setHiveJdbcUrlPattern(String)`	A pattern of JDBC URL used for connecting to Hiveserver2. Example: `jdbc:hive2://{{host}}/;{{auth}}`
`spark.ext.h2o.hive.token`	`None`	`setHiveToken(String)`	An authorization token to Hive
H2O client parameters
`spark.ext.h2o.client.flow.dir`	`None`	`setFlowDir(String)`	Directory where flows from H2O Flow are saved.
`spark.ext.h2o.client.ip`	`None`	`setClientIp(String)`	IP of H2O client node.
`spark.ext.h2o.client.iced.dir`	`None`	`setClientIcedDir(String)`	Location of iced directory for the driver instance.
`spark.ext.h2o.client.log.level`	`INFO`	`setH2OClientLogLevel(String)`	H2O internal log level used for H2O client running inside Spark driver.
`spark.ext.h2o.client.log.dir`	`{user.dir}/h2ologs/{SparkAppId}`	`setH2OClientLogDir(String)`	Location of H2O logs on the driver machine.
`spark.ext.h2o.client.port.base`	`54321`	`setClientBasePort(Integer)`	Port on which H2O client publishes its API. If already occupied, the next odd port is tried on so on.
`spark.ext.h2o.client.web.port`	`-1`	`setClientWebPort(Integer)`	Exact client port to access web UI. The value `-1` means automatic search for a free port starting at `spark.ext.h2o.port.base`.
`spark.ext.h2o.client.verbose`	`false`	`setClientVerboseEnabled()` `setClientVerboseDisabled()`	The client outputs verbose log output directly into console. Enabling the flag increases the client log level to `INFO`.
`spark.ext.h2o.client.network.mask`	`None`	`setClientNetworkMask(String)`	Subnet selector for H2O client, this disables using IP reported by Spark but tries to find IP based on the specified mask.
`spark.ext.h2o.client.flow.baseurl.override`	`None`	`setClientFlowBaseurlOverride(String)`	Allows to override the base URL address of Flow UI, including the scheme, which is showed to the user.
`spark.ext.h2o.cluster.client.retry.timeout`	`60000`	`setClientCheckRetryTimeout(Integer)`	Timeout in milliseconds specifying how often we check whether the the client is still connected.
`spark.ext.h2o.client.extra`	`None`	`setClientExtraProperties(String)`	A string containing extra parameters passed to H2O client during startup. This parameter should be configured only if H2O parameters do not have any corresponding parameters in Sparkling Water.
`spark.ext.h2o.verify_ssl_certificates`	`True`	`setVerifySslCertificates(Boolean)`	Whether certificates should be verified before using in H2O or not.
`spark.ext.h2o.rest.api.timeout`	`3601000`	`setSessionTimeout(Boolean)`	Timeout in milliseconds for Rest API requests.

Internal backend configuration properties¶

Property name	Default value	H2OConf setter (* getter)	Description
Generic parameters
`spark.ext.h2o.cluster.size`	`None`	`setNumH2OWorkers(Integer)`	Expected number of workers of H2O cluster. Value None means automatic detection of cluster size. This number must be equal to number of Spark executors.
`spark.ext.h2o.dummy.rdd.mul.factor`	`10`	`setDrddMulFactor(Integer)`	Multiplication factor for dummy RDD generation. Size of dummy RDD is `spark.ext.h2o.cluster.size` * `spark.ext.h2o.dummy.rdd.mul.factor` .
`spark.ext.h2o.spreadrdd.retries`	`10`	`setNumRddRetries(Integer)`	Number of retries for creation of an RDD spread across all existing Spark executors.
`spark.ext.h2o.default.cluster.size`	`20`	`setDefaultCloudSize(Integer)`	Starting size of cluster in case that size is not explicitly configured.
`spark.ext.h2o.subseq.tries`	`5`	`setSubseqTries(Integer)`	Subsequent successful tries to figure out size of Spark cluster, which are producing the same number of nodes.
`spark.ext.h2o.hdfs_conf`	`sc.hadoopConfig`	`setHdfsConf(String)`	Either a string with the Path to a file with Hadoop HDFS configuration or the org.apache.hadoop.conf.Configuration object. Useful for HDFS credentials settings and other HDFS-related configurations.
`spark.ext.h2o.spreadrdd.retries.timeout`	`0`	`setSpreadRddRetriesTimeout(Int)`	Specifies how long the discovering of Spark executors should last. This option has precedence over other options influencing the discovery mechanism.That means that as long as the timeout hasn’t expired, we keep trying to discover new executors. This option might be useful in environments where Spark executors might join the cloud with some delays.
H2O nodes parameters
`spark.ext.h2o.node.iced.dir`	`None`	`setNodeIcedDir(String)`	Location of iced directory for H2O nodes on the Spark executors.

External backend configuration properties¶

Property name	Default value	H2OConf setter (* getter)	Description
`spark.ext.h2o.cloud.representative`	`None`	`setH2OCluster(String)`	ip:port of arbitrary H2O node to identify external H2O cluster.
`spark.ext.h2o.external.cluster.size`	`None`	`setClusterSize(Integer)`	Number of H2O nodes to start when `auto` mode of the external backend is set.
`spark.ext.h2o.cluster.start.timeout`	`120s`	`setClusterStartTimeout(Integer)`	Timeout in seconds for starting H2O external cluster.
`spark.ext.h2o.cluster.info.name`	`None`	`setClusterInfoFile(Integer)`	Full path to a file which is used sd the notification file for the startup of external H2O cluster.
`spark.ext.h2o.hadoop.memory`	`6G`	`setMapperXmx(String)`	Amount of memory assigned to each H2O node on YARN/Hadoop.
`spark.ext.h2o.external.hdfs.dir`	`None`	`setHDFSOutputDir(String)`	Path to the directory on HDFS used for storing temporary files.
`spark.ext.h2o.external.start.mode`	`manual`	`useAutoClusterStart()` `useManualClusterStart()`	If this option is set to `auto` then H2O external cluster is automatically started using the provided H2O driver JAR on YARN, otherwise it is expected that the cluster is started by the user manually.
`spark.ext.h2o.external.h2o.driver`	`None`	`setH2ODriverPath(String)`	Path to H2O driver used during `auto` start mode.
`spark.ext.h2o.external.yarn.queue`	`None`	`setYARNQueue(String)`	Yarn queue on which external H2O cluster is started.
`spark.ext.h2o.external.kill.on.unhealthy`	`true`	`setKillOnUnhealthyClusterEnabled()` `setKillOnUnhealthyClusterDisabled()`	If true, the client will try to kill the cluster and then itself in case some nodes in the cluster report unhealthy status.
`spark.ext.h2o.external.kerberos.principal`	`None`	`setKerberosPrincipal(String)`	Kerberos Principal.
`spark.ext.h2o.external.kerberos.keytab`	`None`	`setKerberosKeytab(String)`	Kerberos Keytab.
`spark.ext.h2o.external.run.as.user`	`None`	`setRunAsUser(String)`	Impersonated Hadoop user.
`spark.ext.h2o.external.driver.if`	`None`	`setExternalH2ODriverIf(String)`	Ip address or network of mapper->driver callback interface. Default value means automatic detection.
`spark.ext.h2o.external.driver.port`	`None`	`setExternalH2ODriverPort(Integer)`	Port of mapper->driver callback interface. Default value means automatic detection.
`spark.ext.h2o.external.driver.port.range`	`None`	`setExternalH2ODriverPortRange(String)`	Range portX-portY of mapper->driver callback interface; eg: 50000-55000.
`spark.ext.h2o.external.extra.memory.percent`	`10`	`setExternalExtraMemoryPercent(Integer)`	This option is a percentage of `spark.ext.h2o.hadoop.memory` and specifies memory for internal JVM use outside of Java heap.
`spark.ext.h2o.external.backend.stop.timeout`	`10000ms`	`setExternalBackendStopTimeout(Integer)`	Timeout for confirmation from worker nodes when stopping the external backend. It is also possible to pass `-1` to ensure the indefinite timeout. The unit is milliseconds.
`spark.ext.h2o.external.hadoop.executable`	`hadoop`	`setExternalHadoopExecutable(String)`	Name or path to path to a hadoop executable binary which is used to start external H2O backend on YARN.
`spark.ext.h2o.external.extra.jars`	`None`	`setExternalExtraJars(String)` `setExternalExtraJars(String[])`	Comma-separated paths to jars that will be placed onto classpath of each H2O node.
`spark.ext.h2o.external.communication.compression`	`SNAPPY`	`setExternalCommunicationCompression(String)`	The type of compression used for data transfer between Spark and H2O node. Possible values are `NONE`, `DEFLATE`, `GZIP`, `SNAPPY`.

H2OConf getter can be derived from the corresponding setter. All getters are parameter-less. If the type of the property is Boolean, the getter is prefixed with is (E.g. setReplEnabled() -> isReplEnabled()). Property getters of other types do not have any prefix and start with lowercase (E.g. setUserName(String) -> userName for Scala, userName() for Python).