Sparkling Water Configuration Properties

The following configuration properties can be passed to Spark to configure Sparking Water.

Configuration properties independent of selected backend

Property name

Default value

H2OConf setter (* getter)

Description

spark.ext.h2o.backend.cluster.mode

internal

setInternalClusterMode()

setExternalClusterMode()

This option can be set either to internal or external. When set to external, H2O Context is created by connecting to existing H2O cluster, otherwise H2O cluster located inside Spark is created. That means that each Spark executor will have one H2O instance running in it. The internal mode is not recommended for big clusters and clusters where Spark executors are not stable.

spark.ext.h2o.cloud.name

None

setCloudName(String)

Name of H2O cluster. If this option is not set, the name is automatically generated

spark.ext.h2o.nthreads

-1

setNthreads(Integer)

Limit for number of threads used by H2O. Default -1 using internal backend means: Use the value of spark.executor.cores if the property is set, otherwise use H2O’s default value Runtime.getRuntime().availableProcessors(). Default -1 using automatically started external backend on Hadoop means: Use H2O’s default value Runtime.getRuntime().availableProcessors() Default -1 using automatically started external backend on Kubernetes means: Use just one cpu.

spark.ext.h2o.progressbar.enabled

true

setProgressBarEnabled()

setProgressBarDisabled()

Decides whether to display progress bar related to H2O jobs on stdout or not.

spark.ext.h2o.model.print.after.training.enabled

true

setModelPrintAfterTrainingEnabled()

setModelPrintAfterTrainingDisabled()

Decides whether to display model info on stdout after training or not.

spark.ext.h2o.repl.enabled

true

setReplEnabled()

setReplDisabled()

Decides whether H2O REPL is initiated or not.

spark.ext.scala.int.default.num

1

setDefaultNumReplSessions(Integer)

Number of parallel REPL sessions started at the start of Sparkling Water.

spark.ext.h2o.topology.change.listener.enabled

true

setClusterTopologyListenerEnabled()

setClusterTopologyListenerDisabled()

Decides whether listener which kills H2O cluster on the change of the underlying cluster’s topology is enabled or not. This configuration has effect only in non-local mode.

spark.ext.h2o.spark.version.check.enabled

true

setSparkVersionCheckEnabled()

setSparkVersionCheckDisabled()

Enables check if run-time Spark version matches build time Spark version.

spark.ext.h2o.fail.on.unsupported.spark.param

true

setFailOnUnsupportedSparkParamEnabled()

setFailOnUnsupportedSparkParamDisabled()

If unsupported Spark parameter is detected, then application is forced to shutdown.

spark.ext.h2o.jks

None

setJks(String)

Path to a Java keystore file with certificates securing H2O Flow UI and internal REST connections between instances (driver + executors) and H2O nodes. When configuring this property, you must consider that a Spark executor can communicate to any of H2O nodes and verifies H2O node according to the hostname specified in the keystore certificate. You can consider usage of a wildcard certificate or you can disable the hostname verification completely with the spark.ext.h2o.verify_ssl_hostnames property.

spark.ext.h2o.jks.pass

None

setJksPass(String)

Password for the Java keystore file.

spark.ext.h2o.jks.alias

None

setJksAlias(String)

Alias to certificate in the to the Java keystore file to secure H2O Flow UI and internal REST connections between Spark instances (driver + executors) and H2O nodes.

spark.ext.h2o.ssl.ca.cert

None

setSslCACert(String)

A path to a CA bundle file or a directory with certificates of trusted CAs. This path is used by RSparkling or PySparking for connecting to a Sparkling Water backend.

spark.ext.h2o.hash.login

false

setHashLoginEnabled()

setHashLoginDisabled()

Enable hash login.

spark.ext.h2o.ldap.login

false

setLdapLoginEnabled()

setLdapLoginDisabled()

Enable LDAP login.

spark.ext.h2o.kerberos.login

false

setKerberosLoginEnabled()

setKerberosLoginDisabled()

Enable Kerberos login.

spark.ext.h2o.login.conf

None

setLoginConf(String)

Login configuration file.

spark.ext.h2o.user.name

None

setUserName(String)

Username used for the backend H2O cluster and to authenticate the client against the backend.

spark.ext.h2o.password

None

setPassword(String)

Password used to authenticate the client against the backend.

spark.ext.h2o.internal_security_conf

None

setSslConf(String)

Path to a file containing H2O or Sparkling Water internal security configuration.

spark.ext.h2o.auto.flow.ssl

false

setAutoFlowSslEnabled()

setAutoFlowSslDisabled()

Automatically generate the required key store and password to secure secure H2O Flow UI and internal REST connections between Spark executors and H2O nodes. Hostname verification is disabled when creating SSL connections to H2O nodes.

spark.ext.h2o.log.level

INFO

setLogLevel(String)

H2O log level.

spark.ext.h2o.log.dir

None

setLogDir(String)

Location of H2O logs. When not specified, it uses {user.dir}/h2ologs/{AppId} or YARN container dir

spark.ext.h2o.backend.heartbeat.interval

10000

setBackendHeartbeatInterval(Integer)

Interval (in msec) for getting heartbeat from the H2O backend.

spark.ext.h2o.cloud.timeout

60000

setCloudTimeout(Integer)

Timeout (in msec) for cluster formation.

spark.ext.h2o.node.network.mask

None

setNodeNetworkMask(String)

Subnet selector for H2O running inside park executors. This disables using IP reported by Spark but tries to find IP based on the specified mask.

spark.ext.h2o.stacktrace.collector.interval

-1

setStacktraceCollectorInterval(Integer)

Interval specifying how often stack traces are taken on each H2O node. -1 means that no stack traces will be taken

spark.ext.h2o.context.path

None

setContextPath(String)

Context path to expose H2O web server.

spark.ext.h2o.flow.scala.cell.async

false

setFlowScalaCellAsyncEnabled()

setFlowScalaCellAsyncDisabled()

Decide whether the Scala cells in H2O Flow will run synchronously or Asynchronously. Default is synchronously.

spark.ext.h2o.flow.scala.cell.max.parallel

-1

setMaxParallelScalaCellJobs(Integer)

Number of max parallel Scala cell jobs. The value -1 means not limited.

spark.ext.h2o.internal.port.offset

1

setInternalPortOffset(Integer)

Offset between the API(=web) port and the internal communication port on the client node; api_port + port_offset = h2o_port

spark.ext.h2o.base.port

54321

setBasePort(Integer)

Base port used for individual H2O nodes

spark.ext.h2o.mojo.destroy.timeout

600000

setMojoDestroyTimeout(Integer)

If a scoring MOJO instance is not used within a Spark executor JVM for a given timeout in milliseconds, it’s evicted from executor’s cache. Default timeout value is 10 minutes.

spark.ext.h2o.extra.properties

None

setExtraProperties(String)

A string containing extra parameters passed to H2O nodes during startup. This parameter should be configured only if H2O parameters do not have any corresponding parameters in Sparkling Water.

spark.ext.h2o.flow.dir

None

setFlowDir(String)

Directory where flows from H2O Flow are saved.

spark.ext.h2o.flow.extra.http.headers

None

setFlowExtraHttpHeaders(Map[String,String])

setFlowExtraHttpHeaders(String)

Extra HTTP headers that will be used in communication between the front-end and back-end part of Flow UI. The headers should be delimited by a new line. Don’t forget to escape special characters when passing the parameter from a command line. Example: "spark.ext.h2o.flow.extra.http.headers=Strict-Transport-Security:max-age=31536000"

spark.ext.h2o.flow.proxy.request.maxSize

32768

setFlowProxyRequestMaxSize(Integer)

The maximum size of a request coming to flow UI proxy running on the Spark driver. The request is forwarded to Flow UI on H2O leader node.

spark.ext.h2o.flow.proxy.response.maxSize

32768

setFlowProxyResponseMaxSize(Integer)

The maximum size of a response coming from flow UI proxy running on the Spark driver. The content for the response comes from Flow UI H2O leader node.

spark.ext.h2o.internal_secure_connections

false

setInternalSecureConnectionsEnabled()

setInternalSecureConnectionsDisabled()

Enables secure communications among H2O nodes. The security is based on automatically generated keystore and truststore. This is equivalent for -internal_secure_conections option in H2O Hadoop. More information is available in the H2O documentation.

spark.ext.h2o.allow_insecure_xgboost

false

setInsecureXGBoostAllowed()

setInsecureXGBoostDenied()

If the property set to true, insecure communication among H2O nodes is allowed for the XGBoost algorithm even if the other security options are enabled

spark.ext.h2o.client.ip

None

setClientIp(String)

IP of H2O client node.

spark.ext.h2o.client.web.port

-1

setClientWebPort(Integer)

Exact client port to access web UI. The value -1 means automatic search for a free port starting at spark.ext.h2o.base.port.

spark.ext.h2o.client.verbose

false

setClientVerboseEnabled()

setClientVerboseDisabled()

The client outputs verbose log output directly into console. Enabling the flag increases the client log level to INFO.

spark.ext.h2o.client.network.mask

None

setClientNetworkMask(String)

Subnet selector for H2O client, this disables using IP reported by Spark but tries to find IP based on the specified mask.

spark.ext.h2o.client.flow.baseurl.override

None

setClientFlowBaseurlOverride(String)

Allows to override the base URL address of Flow UI, including the scheme, which is showed to the user.

spark.ext.h2o.cluster.client.retry.timeout

60000

setClientCheckRetryTimeout(Integer)

Timeout in milliseconds specifying how often we check whether the the client is still connected.

spark.ext.h2o.verify_ssl_certificates

true

setVerifySslCertificates(Boolean)

If the property is enabled, Pysparkling or RSparkling client will verify certificates when connecting Sparkling Water Flow UI.

spark.ext.h2o.internal.rest.verify_ssl_certificates

true

setSslCertificateVerificationInInternalRestConnectionsEnabled()

setSslCertificateVerificationInInternalRestConnectionsDisabled()

If the property is enabled, Sparkling Water will verify ssl certificates during establishing secured http connections to one of H2O nodes. Such connections are utilized for delegation of Flow UI calls to H2O leader node or during data exchange between Spark executors and H2O nodes. If the property is disabled, hostname verification is disabled as well.

spark.ext.h2o.internal.rest.verify_ssl_hostnames

true

setSslHostnameVerificationInInternalRestConnectionsEnabled()

setSslHostnameVerificationInInternalRestConnectionsDisabled()

If the property is enabled, Sparkling Water will verify a hostname during establishing of secured http connections to one of H2O nodes. Such connections are utilized for delegation of Flow UI calls to H2O leader node or during data exchange between Spark executors and H2O nodes.

spark.ext.h2o.kerberized.hive.enabled

false

setKerberizedHiveEnabled()

setKerberizedHiveDisabled()

If enabled, H2O instances will create JDBC connections to a Kerberized Hive so that all clients can read data from HiveServer2. Don’t forget to put a jar with Hive driver on Spark classpath if the internal backend is used.

spark.ext.h2o.hive.host

None

setHiveHost(String)

The full address of HiveServer2, for example hostname:10000.

spark.ext.h2o.hive.principal

None

setHivePrincipal(String)

Hiveserver2 Kerberos principal, for example hive/hostname@DOMAIN.COM

spark.ext.h2o.hive.jdbc_url_pattern

None

setHiveJdbcUrlPattern(String)

A pattern of JDBC URL used for connecting to Hiveserver2. Example: jdbc:hive2://{{host}}/;{{auth}}

spark.ext.h2o.hive.token

None

setHiveToken(String)

An authorization token to Hive.

spark.ext.h2o.iced.dir

None

setIcedDir(String)

Location of iced directory for H2O nodes.

spark.ext.h2o.rest.api.timeout

300000

setSessionTimeout(Boolean)

Timeout in milliseconds for Rest API requests.


Internal backend configuration properties

Property name

Default value

H2OConf setter (* getter)

Description

spark.ext.h2o.cluster.size

None

setNumH2OWorkers(Integer)

Expected number of workers of H2O cluster. Value None means automatic detection of cluster size. This number must be equal to number of Spark executors. If Spark property spark.executor.instances is specified, this Sparkling Water property is set to its value.

spark.ext.h2o.extra.cluster.nodes

false

setExtraClusterNodesEnabled()

setExtraClusterNodesDisabled()

If the property is set true and the Sparkling Water internal backend identifies more executors than specified in the Spark property spark.executor.instances or in the Sparkling Water property spark.ext.h2o.cluster.size, Sparkling Water deploys H2O nodes to all discovered Spark executors. Otherwise, Sparkling Water deploys just a number of executors specified in spark.ext.h2o.cluster.size (or spark.executor.instances).

spark.ext.h2o.dummy.rdd.mul.factor

10

setDrddMulFactor(Integer)

Multiplication factor for dummy RDD generation. Size of dummy RDD is spark.ext.h2o.cluster.size multiplied by this option.

spark.ext.h2o.spreadrdd.retries

10

setNumRddRetries(Integer)

Number of retries for creation of an RDD spread across all existing Spark executors

spark.ext.h2o.default.cluster.size

20

setDefaultCloudSize(Integer)

Starting size of cluster in case that size is not explicitly configured.

spark.ext.h2o.subseq.tries

5

setSubseqTries(Integer)

Subsequent successful tries to figure out size of Spark cluster, which are producing the same number of nodes.

spark.ext.h2o.hdfs_conf

None

setHdfsConf(String)

Either a string with the Path to a file with Hadoop HDFS configuration or the hadoop.conf.Configuration object in the org.apache package. Useful for HDFS credentials settings and other HDFS-related configurations. Default value None means use sc.hadoopConfig.

spark.ext.h2o.spreadrdd.retries.timeout

0

setSpreadRddRetriesTimeout(Int)

Specifies how long the discovering of Spark executors should last. This option has precedence over other options influencing the discovery mechanism. That means that as long as the timeout hasn’t expired, we keep trying to discover new executors. This option might be useful in environments where Spark executors might join the cloud with some delays.

spark.ext.h2o.direct.configuration.ip

true

setDirectIpConfigurationEnabled()

setDirectIpConfigurationDisabled()

If the property is disabled, Spark executor doesn’t assign its IP address to H2O node directly. The IP address is suggested to H2O node and its bootstrap logic performs additional network interface availability checks before the IP is assigned to the node.


External backend configuration properties

Property name

Default value

H2OConf setter (* getter)

Description

spark.ext.h2o.external.driver.if

None

setExternalH2ODriverIf(String)

Ip address or network of mapper->driver callback interface. Default value means automatic detection.

spark.ext.h2o.external.driver.port

None

setExternalH2ODriverPort(Integer)

Port of mapper->driver callback interface. Default value means automatic detection.

spark.ext.h2o.external.driver.port.range

None

setExternalH2ODriverPortRange(String)

Range portX-portY of mapper->driver callback interface; eg: 50000-55000.

spark.ext.h2o.external.extra.memory.percent

10

setExternalExtraMemoryPercent(Integer)

This option is a percentage of external memory option and specifies memory for internal JVM use outside of Java heap.

spark.ext.h2o.cloud.representative

None

setH2OCluster(String)

ip:port of a H2O cluster leader node to identify external H2O cluster.

spark.ext.h2o.external.cluster.size

None

setClusterSize(Integer)

Number of H2O nodes to start when auto mode of the external backend is set.

spark.ext.h2o.cluster.start.timeout

120

setClusterStartTimeout(Integer)

Timeout in seconds for starting H2O external cluster

spark.ext.h2o.cluster.info.name

None

setClusterInfoFile(Integer)

Full path to a file which is used as the notification file for the startup of external H2O cluster.

spark.ext.h2o.external.memory

6G

setExternalMemory(String)

Amount of memory assigned to each external H2O node

spark.ext.h2o.external.hdfs.dir

None

setHDFSOutputDir(String)

Path to the directory on HDFS used for storing temporary files.

spark.ext.h2o.external.start.mode

manual

useAutoClusterStart()

useManualClusterStart()

If this option is set to auto then H2O external cluster is automatically started using the provided H2O driver JAR on YARN, otherwise it is expected that the cluster is started by the user manually

spark.ext.h2o.external.h2o.driver

None

setH2ODriverPath(String)

Path to H2O driver used during auto start mode.

spark.ext.h2o.external.yarn.queue

None

setYARNQueue(String)

Yarn queue on which external H2O cluster is started.

spark.ext.h2o.external.kill.on.unhealthy

true

setKillOnUnhealthyClusterEnabled()

setKillOnUnhealthyClusterDisabled()

If true, the client will try to kill the cluster and then itself in case some nodes in the cluster report unhealthy status.

spark.ext.h2o.external.kerberos.principal

None

setKerberosPrincipal(String)

Kerberos Principal

spark.ext.h2o.external.kerberos.keytab

None

setKerberosKeytab(String)

Kerberos Keytab

spark.ext.h2o.external.run.as.user

None

setRunAsUser(String)

Impersonated Hadoop user

spark.ext.h2o.external.backend.stop.timeout

10000

setExternalBackendStopTimeout(Integer)

Timeout for confirmation from worker nodes when stopping the external backend. It is also possible to pass -1 to ensure the indefinite timeout. The unit is milliseconds.

spark.ext.h2o.external.hadoop.executable

hadoop

setExternalHadoopExecutable(String)

Name or path to path to a hadoop executable binary which is used to start external H2O backend on YARN.

spark.ext.h2o.external.extra.jars

None

setExternalExtraJars(String)

setExternalExtraJars(String[])

Comma-separated paths to jars that will be placed onto classpath of each H2O node.

spark.ext.h2o.external.communication.compression

SNAPPY

setExternalCommunicationCompression(String)

The type of compression used for data transfer between Spark and H2O node. Possible values are NONE, DEFLATE, GZIP, SNAPPY.

spark.ext.h2o.external.auto.start.backend

yarn

setExternalAutoStartBackend(String)

The backend on which the external H2O backend will be started in auto start mode. Possible values are YARN and KUBERNETES.

spark.ext.h2o.external.k8s.h2o.service.name

h2o-service

setExternalK8sH2OServceName(String)

Name of H2O service required to start H2O on K8s.

spark.ext.h2o.external.k8s.h2o.statefulset.name

h2o-statefulset

setExternalK8sH2OStatefulsetName(String)

Name of H2O stateful set required to start H2O on K8s.

spark.ext.h2o.external.k8s.h2o.label

app=h2o

setExternalK8sH2OLabel(String)

Label used to select node for H2O cluster formation.

spark.ext.h2o.external.k8s.h2o.api.port

8081

setExternalK8sH2OApiPort(String)

Kubernetes API port.

spark.ext.h2o.external.k8s.namespace

default

setExternalK8sNamespace(String)

Kubernetes namespace where external H2O is started.

spark.ext.h2o.external.k8s.docker.image

See doc

setExternalK8sDockerImage(String)

Docker image containing Sparkling Water external H2O backend. Default value is h2oai/sparkling-water-external-backend:3.36.1.3-1-3.1

spark.ext.h2o.external.k8s.domain

cluster.local

setExternalK8sDomain(String)

Domain of the Kubernetes cluster.

spark.ext.h2o.external.k8s.svc.timeout

300

setExternalK8sServiceTimeout(Int)

[Deprecated] Timeout in seconds used as a limit for K8s service creation.


H2OConf getter can be derived from the corresponding setter. All getters are parameter-less. If the type of the property is Boolean, the getter is prefixed with is (E.g. setReplEnabled() -> isReplEnabled()). Property getters of other types do not have any prefix and start with lowercase (E.g. setUserName(String) -> userName for Scala, userName() for Python).