Migration Guide¶
Migration guide between Sparkling Water versions.
From 3.34 to 3.36¶
The methods
getWithDetailedPredictionCol
andsetWithDetailedPredictionCol
on all SW Algorithms and MOJO models were removed without replacement.The parameter
variableImportances
ofH2ODeepLearning
has been replaced withcalculateFeatureImportances
as well as the methodsgetVariableImportances
andsetVariableImportances
onH2ODeepLearning
have been replaced withgetCalculateFeatureImportances
andsetCalculateFeatureImportances
.The method
getVariableImportances
ofH2ODeepLearningMOJOModel
has been replaced withgetCalculateFeatureImportances
.The parameter
autoencoder
and methodsgetAutoencoder
,setAutoencoder
onH2ODeepLearning
have been removed without replacement.The method
getAutoencoder
ofH2ODeepLearningMOJOModel
has been removed without replacement.
From 3.32.1 to 3.34¶
On
H2OConf
, the setterssetClientIcedDir
andsetNodeIcedDir
are replaced bysetIcedDir
and gettersclientIcedDir
andnodeIcedDir
are replaced byicedDir
. Also the spark optionsspark.ext.h2o.client.iced.dir
andspark.ext.h2o.node.iced.dir
are replaced byspark.ext.h2o.iced.dir
.On
H2OConf
, the setterssetH2OClientLogLevel
andsetH2ONodeLogLevel
are replaced bysetLogLevel
and gettersh2oClientLogLevel
andh2oNodeLogLevel
are replaced bylogLevel
. Also the spark optionsspark.ext.h2o.client.log.level
andspark.ext.h2o.node.log.level
are replaced byspark.ext.h2o.log.level
.Spark option
spark.ext.h2o.client.flow.dir
is replaced byspark.ext.h2o.flow.dir
.On
H2OConf
, the setterssetClientBasePort
andsetNodeBasePort
are replaced bysetBasePort
and gettersclientBasePort
andnodeBasePort
are replaced bybasePort
. Also the spark optionsspark.ext.h2o.client.port.base
andspark.ext.h2o.node.port.base
are replaced byspark.ext.h2o.base.port
.On
H2OConf
, the setterssetH2OClientLogDir
andsetH2ONodeLogDir
are replaced bysetLogDir
and gettersh2oClientLogDir
andh2oNodeLogDir
are replaced bylogDir
. Also the spark optionsspark.ext.h2o.client.log.dir
andspark.ext.h2o.node.log.dir
are replaced byspark.ext.h2o.log.dir
.On
H2OConf
, the setterssetClientExtraProperties
andsetNodeExtraProperties
are replaced bysetExtraProperties
and gettersclientExtraProperties
andnodeExtraProperties
are replaced byextraProperties
. Also the spark optionsspark.ext.h2o.client.extra
andspark.ext.h2o.node.extra
are replaced byspark.ext.h2o.extra.properties
.On
H2OConf
, the settersetMapperXmx
is replaced bysetExternalMemory
and the gettermapperXmx
is replaced byexternalMemory
. Also the Spark optionspark.ext.h2o.hadoop.memory
is replaced byspark.ext.h2o.external.memory
.The
withDetailedPredictionCol
field onH2OMOJOSettings
was removed without a replacement.The
weightCol
parameter onH2OKmeans
was removed without a replacement.The
distribution
parameter onH2OGLM
was removed without a replacement.The support for Apache Spark 2.1.x has been removed.
Binary models could be downloaded only if the algorithm parameter
keepBinaryModels
was set totrue
.
From 3.32 to 3.32.1¶
The data type of H2OTargetEncoder output columns has been changed from
DoubleType
toml.linalg.VectorUDT
.The sub-columns of the
prediction
column produced byH2OMOJOPipelineModel
could be of the typefloat
instead ofdouble
ifMOJOModelSettings.namedMojoOutputColumns
is set totrue
.
From 3.30.1 to 3.32¶
We have created two new classes -
ai.h2o.sparkling.H2OContext
andai.h2o.sparkling.H2OConf
. The behaviour of the context and configuration is the same as in the originalorg.apache.spark.h2o
package except that in this case we no longer use H2O client on Spark driver. This means that H2O is running only on worker nodes and not on Spark driver. This change affects only Scala API as PySparkling and RSparkling are already using the new API internally. We have also changed all documentation materials to point to the new classes.We are keeping the original context and conf available in the code in case the user needs to use some of the H2O client features directly on the Spark driver, but please note that once the API in the new package
ai.h2o.sparkling
is complete, the context and conf classes will get removed. We therefore encourage users to migrate to new context class and report missing features to us via Github issues.For example, if the user is training XGboost model using H2O Java api but is not using
H2OXGBoost
from Sparkling Water, this feature requires the H2O client to be available.In PySparkling, the
H2OConf
no longer accepts any arguments. To create newH2OConf
, please just callconf = H2OConf()
. Also theH2OContext.getOrCreate
method no longer accepts the spark argument. You can start H2OContext asH2OContext.getOrCreate()
orH2OContext.getOrCreate(conf)
In RSparkling, the
H2OConf
no longer accepts any arguments. To create newH2OConf
, please just callconf <- H2OConf()
. Also theH2OContext.getOrCreate
method no longer accepts the spark argument. You can start H2OContext asH2OContext.getOrCreate()
orH2OContext.getOrCreate(conf)
In Scala,
H2OConf
can be created asnew H2OConf()
ornew H2OConf(sparkConf)
. Other constructor variants have been removed. Also,H2OContext
can be created asH2OContext.getOrCreate()
orH2OContext.getOrCreate(conf)
. The other variants of this method have been removed.The
setH2OCluster(ip, port)
method onH2OConf
in all APIs doesn’t implicitly set the external backend anymore. The methodsetExternalClusterMode()
must be called explicitly.The method
classify
in thehex.ModelUtils
object is removed. Please use Sparkling Water algorithm API to train and score H2O models. This removal affects only Scala API as other APIs don’t have such functionality.The method
DLModel
inwater.support.DeepLearningSupport
is removed. Please useH2ODeepLearning
instead. The same holds for methodGBMModel
inwater.support.GBMSupport
. Please useH2OGBM
instead. The classes wrapping these methods are removed as well. This removal affects only Scala API as other APIs don’t have such functionality.The method
splitFrame
andsplit
inwater.support.H2OFrameSupport
is removed. Please useai.h2o.sparkling.H2OFrame(frameKeyString).split(ratios)
instead.The method
withLockAndUpdate
inwater.support.H2OFrameSupport
is removed. Please useai.h2o.sparkling.backend.utils.H2OClientUtils.withLockAndUpdate
instead.The methods
columnsToCategorical
with both indices and column names argument inwater.support.H2OFrameSupport
are removed. Please useai.h2o.sparkling.H2OFrame(frameKeyString).convertColumnsToCategorical
instead.Method
modelMetrics
inwater.support.ModelMetricsSupport
is removed. Please use methodsgetTrainingMetrics
,getValidationMetrics
orgetCrossValidationMetrics
on theH2OMOJOModel
. You can also use methodgetCurrentMetrics
, which returns cross validation metrics if nfolds was specified and higher than 0, validation metrics if validation frame has been specified ( splitRatio was set and lower than 1 ) and nfolds was 0 and training metrics otherwise ( splitRatio is 1 and nfolds is 0).The whole trait
ModelSerializationSupport
in Scala is removed. The MOJO is a first class citizen in Sparkling Water and most code works with our Spark MOJO wrapper. Please use the following approaches to migrate from previous methods in the model serialization support:To create Spark MOJO wrapper in Sparkling Water, you can load it from H2O-3 as:
val mojoModel = H2OMOJOModel.createFromMojo(path)
or train model using Sparkling Water API, such as
val gbm = H2OGBM().setLabelCol("label") val mojoModel = gbm.fit(data)
In this case the
mojoModel
is Spark wrapper around the H2O’s mojo providing Spark friendly API. This also means that the such model can be embedded into Spark pipelines without any additional work.To export it as, please call:
mojoModel.write.save("path")
The advantage is that this variant is H2O-version independent and when such model is loaded, H2O run-time is not required.
You can load the exported model from Sparkling Water as:
val mojoModel = H2OMOJOModel.read.load("path")
For additional information about how to load MOJO into Sparkling Water, please see Loading MOJOs into Sparkling Water.
The methods
join
,innerJoin
,outerJoin
,leftJoin
andrightJoin
inwater.support.JoinSupport
are removed together with their encapsulating class. The enumwater.support.munging.JoinMethod
is also removed. In order to perform joins, please use the following methods:Inner join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).innerJoin(rightFrame)
Outer join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).outerJoin(rightFrame)
Left join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).leftJoin(rightFrame)
Right join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).rightJoin(rightFrame)
The
JoinMethod
enum is removed as it is no longer required.Since the method
asH2OFrame
ofH2OContext
converts strings to categorical columns automatically according to the heuristic from H2O parsers, the methodsgetAllStringColumnsToCategorical
andsetAllStringColumnsToCategorical
have been removed from all SW API algorithms in Python and Scala API.Methods
setH2ONodeLogLevel
andsetH2OClientLogLevel
are removed onH2OContext
. Please usesetH2OLogLevel
instead.Methods
asDataFrame
on ScalaH2OContext
has been replaced by methodsasSparkFrame
with same arguments. This was done to ensure full consistency between Scala, Python and R APIs.JavaH2OContext is removed. Please use
org.apache.spark.h2o.H2OContext
instead.When using H2O as Spark data source, the approach
val df = spark.read.h2o(key)
has been removed. Please useval df = spark.read.format("h2o").load(key)
instead. The same holds forspark.write.h2o(key)
. Please usedf.write.format("h2o").save("new_key")
instead.Starting from the version 3.32,
H2OGridSearch
hyper-parameters now correspond to parameter names in Sparkling Water. Previously, the hyper-parameters were specified using internal H2O names such as_ntrees
or_max_depth
. At this version, the parameter names follow the naming convention of getters and setters of the corresponding parameter, such asntrees
ormaxDepth
.Also the output of
getGridModelsParams
now contains column names which correspond to Sparkling Water parameter names instead of H2O internal ones. When updating to version 3.32, please make sure to update your hyper parameter names.On
H2OConf
, the methodssetHiveSupportEnabled
,setHiveSupportDisabled
andisHiveSupportEnabled
are replaced bysetKerberizedHiveEnabled
,setKerberizedHiveDisabled
andisKerberizedHiveEnabled
to reflect their actual meaning. Also the optionspark.ext.h2o.hive.enabled
is replaced byspark.ext.h2o.kerberized.hive.enabled
.The below list of Grid Search parameters with their getters and setters were replaced by the same parameters on the algorithm the grid search is applied to.
Parameter Name |
Getter |
Setter |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Schema of detailed predictions produced by
H2OMOJOModel
and thus by all Sparkling Water algorithms has been changed a bit. TheMapType
sub-columnsprobabilities
,calibratedProbabilities
andcontributions
have been changed toStructType
columns.On H2OXGBoost, the options
minSumHessianInLeaf
andminDataInLeaf
have been removed as well as the corresponding getters and setters. The methods are removed without replacement as these parameters weren’t valid XGBoost parameters.
From 3.30 to 3.30.1¶
The detailed prediction columns is always enabled for all types of MOJO predictions.
From 3.28.1 to 3.30¶
It is now required to explicitly create
H2OContext
before you run any of our exposed algorithms. Previously, the algorithm would create the H2OContext on demand.It is no longer possible to disable web (REST API endpoints) on the worker nodes in the internal client as we require the endpoints to be available. In particular, the methods
setH2ONodeWebEnabled
,setH2ONodeWebDisabled
andh2oNodeWebEnabled
are removed without replacement. Also the optionspark.ext.h2o.node.enable.web
does not have any effect anymore.It is no longer possible to disable web (REST API endpoints) on the client node as we require the Rest API to be available. In particular, the methods
setClientWebEnabled
,setClientWebDisabled
andclientWebEnabled
are removed without replacement. Also the optionspark.ext.h2o.client.enable.web
does not have any effect anymore.The property
spark.ext.h2o.node.iced.dir
and the setter methodsetNodeIcedDir
onH2OConf
has no effect in all 3.30.x.y-z versions. If users need to set a custom iced directory for executors, they can set the propertyspark.ext.h2o.node.extra
to-ice_root dir
, wheredir
is a user-specified directory.
Removal of Deprecated Methods and Classes¶
On PySparkling, passing authentication on
H2OContext
viaauth
param is removed in favor of methodssetUserName
andsetPassword
ond theH2OConf
or via the Spark optionsspark.ext.h2o.user.name
andspark.ext.h2o.password
directly.On Pysparkling, passing
verify_ssl_certificates
parameter as H2OContext argument is removed in favor of methodsetVerifySslCertificates
onH2OConf
or via the spark optionspark.ext.h2o.verify_ssl_certificates
.On RSparkling, the method
h2o_context
is removed. To create H2OContext, please callhc <- H2OContext.getOrCreate()
. Also the methodsh2o_flow
,as_h2o_frame
andas_spark_dataframe
are removed. Please use the methods available on theH2OContext
instance created viahc <- H2OContext.getOrCreate()
. Instead ofh2o_flow
, usehc$openFlow
, instead ofas_h2o_frame
, useasH2OFrame
and instead ofas_spark_dataframe
useasSparkFrame
.Also the
H2OContext.getOrCreate()
does not haveusername
andpassword
arguments anymore. The correct way how to pass authentication details toH2OContext
is viaH2OConf
class, such as:conf <- H2OConf() conf$setUserName(username) conf$setPassword(password) hc <- H2OContext.getOrCreate(conf)
The Spark options
spark.ext.h2o.user.name
andspark.ext.h2o.password
correspond to these setters and can be also used directly.In
H2OContext
Python API, the methodas_spark_frame
is replaced by the methodasSparkFrame
and the methodas_h2o_frame
is replaced byasH2OFrame
.In
H2OXGBoost
Scala And Python API, the methodsgetNEstimators
andsetNEstimators
are removed. Please usegetNtrees
andsetNtrees
instead.In Scala and Python API for tree-based algorithms, the method
getR2Stopping
is removed in favor ofgetStoppingRounds
,getStoppingMetric
,getStoppingTolerance
methods and the methodsetR2Stopping
is removed in favor ofsetStoppingRounds
,setStoppingMetric
,setStoppingTolerance
methods.Method
download_h2o_logs
on PySparklingH2OContext
is removed in favor of thedownloadH2OLogs
method.Method
get_conf
on PySparklingH2OContext
is removed in favor of thegetConf
method.On Python and Scala
H2OGLM
API, the methodssetExactLambdas
andgetExactLambdas
are removed without replacement.On H2OConf Python API, the following methods have been renamed to be consistent with the Scala counterparts:
h2o_cluster
->h2oCluster
h2o_cluster_host
->h2oClusterHost
h2o_cluster_port
->h2oClusterPort
cluster_size
->clusterSize
cluster_start_timeout
->clusterStartTimeout
cluster_config_file
->clusterInfoFile
mapper_xmx
->mapperXmx
hdfs_output_dir
->HDFSOutputDir
cluster_start_mode
->clusterStartMode
is_auto_cluster_start_used
->isAutoClusterStartUsed
is_manual_cluster_start_used
->isManualClusterStartUsed
h2o_driver_path
->h2oDriverPath
yarn_queue
->YARNQueue
is_kill_on_unhealthy_cluster_enabled
->isKillOnUnhealthyClusterEnabled
kerberos_principal
->kerberosPrincipal
kerberos_keytab
->kerberosKeytab
run_as_user
->runAsUser
set_h2o_cluster
->setH2OCluster
set_cluster_size
->setClusterSize
set_cluster_start_timeout
->setClusterStartTimeout
set_cluster_config_file
->setClusterInfoFile
set_mapper_xmx
->setMapperXmx
set_hdfs_output_dir
->setHDFSOutputDir
use_auto_cluster_start
->useAutoClusterStart
use_manual_cluster_start
->useManualClusterStart
set_h2o_driver_path
->setH2ODriverPath
set_yarn_queue
->setYARNQueue
set_kill_on_unhealthy_cluster_enabled
->setKillOnUnhealthyClusterEnabled
set_kill_on_unhealthy_cluster_disabled
->setKillOnUnhealthyClusterDisabled
set_kerberos_principal
->setKerberosPrincipal
set_kerberos_keytab
->setKerberosKeytab
set_run_as_user
->setRunAsUser
num_h2o_workers
->numH2OWorkers
drdd_mul_factor
->drddMulFactor
num_rdd_retries
->numRddRetries
default_cloud_size
->defaultCloudSize
subseq_tries
->subseqTries
h2o_node_web_enabled
->h2oNodeWebEnabled
node_iced_dir
->nodeIcedDir
set_num_h2o_workers
->setNumH2OWorkers
set_drdd_mul_factor
->setDrddMulFactor
set_num_rdd_retries
->setNumRddRetries
set_default_cloud_size
->setDefaultCloudSize
set_subseq_tries
->setSubseqTries
set_h2o_node_web_enabled
->setH2ONodeWebEnabled
set_h2o_node_web_disabled
->setH2ONodeWebDisabled
set_node_iced_dir
->setNodeIcedDir
backend_cluster_mode
->backendClusterMode
cloud_name
->cloudName
is_h2o_repl_enabled
->isH2OReplEnabled
scala_int_default_num
->scalaIntDefaultNum
is_cluster_topology_listener_enabled
->isClusterTopologyListenerEnabled
is_spark_version_check_enabled
->isSparkVersionCheckEnabled
is_fail_on_unsupported_spark_param_enabled
->isFailOnUnsupportedSparkParamEnabled
jks_pass
->jksPass
jks_alias
->jksAlias
hash_login
->hashLogin
ldap_login
->ldapLogin
kerberos_login
->kerberosLogin
login_conf
->loginConf
ssl_conf
->sslConf
auto_flow_ssl
->autoFlowSsl
h2o_node_log_level
->h2oNodeLogLevel
h2o_node_log_dir
->h2oNodeLogDir
cloud_timeout
->cloudTimeout
node_network_mask
->nodeNetworkMask
stacktrace_collector_interval
->stacktraceCollectorInterval
context_path
->contextPath
flow_scala_cell_async
->flowScalaCellAsync
max_parallel_scala_cell_jobs
->maxParallelScalaCellJobs
internal_port_offset
->internalPortOffset
mojo_destroy_timeout
->mojoDestroyTimeout
node_base_port
->nodeBasePort
node_extra_properties
->nodeExtraProperties
flow_extra_http_headers
->flowExtraHttpHeaders
is_internal_secure_connections_enabled
->isInternalSecureConnectionsEnabled
flow_dir
->flowDir
client_ip
->clientIp
client_iced_dir
->clientIcedDir
h2o_client_log_level
->h2oClientLogLevel
h2o_client_log_dir
->h2oClientLogDir
client_base_port
->clientBasePort
client_web_port
->clientWebPort
client_verbose_output
->clientVerboseOutput
client_network_mask
->clientNetworkMask
ignore_spark_public_dns
->ignoreSparkPublicDNS
client_web_enabled
->clientWebEnabled
client_flow_baseurl_override
->clientFlowBaseurlOverride
client_extra_properties
->clientExtraProperties
runs_in_external_cluster_mode
->runsInExternalClusterMode
runs_in_internal_cluster_mode
->runsInInternalClusterMode
client_check_retry_timeout
->clientCheckRetryTimeout
set_internal_cluster_mode
->setInternalClusterMode
set_external_cluster_mode
->setExternalClusterMode
set_cloud_name
->setCloudName
set_nthreads
->setNthreads
set_repl_enabled
->setReplEnabled
set_repl_disabled
->setReplDisabled
set_default_num_repl_sessions
->setDefaultNumReplSessions
set_cluster_topology_listener_enabled
->setClusterTopologyListenerEnabled
set_cluster_topology_listener_disabled
->setClusterTopologyListenerDisabled
set_spark_version_check_disabled
->setSparkVersionCheckDisabled
set_fail_on_unsupported_spark_param_enabled
->setFailOnUnsupportedSparkParamEnabled
set_fail_on_unsupported_spark_param_disabled
->setFailOnUnsupportedSparkParamDisabled
set_jks
->setJks
set_jks_pass
->setJksPass
set_jks_alias
->setJksAlias
set_hash_login_enabled
->setHashLoginEnabled
set_hash_login_disabled
->setHashLoginDisabled
set_ldap_login_enabled
->setLdapLoginEnabled
set_ldap_login_disabled
->setLdapLoginDisabled
set_kerberos_login_enabled
->setKerberosLoginEnabled
set_kerberos_login_disabled
->setKerberosLoginDisabled
set_login_conf
->setLoginConf
set_ssl_conf
->setSslConf
set_auto_flow_ssl_enabled
->setAutoFlowSslEnabled
set_auto_flow_ssl_disabled
->setAutoFlowSslDisabled
set_h2o_node_log_level
->setH2ONodeLogLevel
set_h2o_node_log_dir
->setH2ONodeLogDir
set_cloud_timeout
->setCloudTimeout
set_node_network_mask
->setNodeNetworkMask
set_stacktrace_collector_interval
->setStacktraceCollectorInterval
set_context_path
->setContextPath
set_flow_scala_cell_async_enabled
->setFlowScalaCellAsyncEnabled
set_flow_scala_cell_async_disabled
->setFlowScalaCellAsyncDisabled
set_max_parallel_scala_cell_jobs
->setMaxParallelScalaCellJobs
set_internal_port_offset
->setInternalPortOffset
set_node_base_port
->setNodeBasePort
set_mojo_destroy_timeout
->setMojoDestroyTimeout
set_node_extra_properties
->setNodeExtraProperties
set_flow_extra_http_headers
->setFlowExtraHttpHeaders
set_internal_secure_connections_enabled
->setInternalSecureConnectionsEnabled
set_internal_secure_connections_disabled
->setInternalSecureConnectionsDisabled
set_flow_dir
->setFlowDir
set_client_ip
->setClientIp
set_client_iced_dir
->setClientIcedDir
set_h2o_client_log_level
->setH2OClientLogLevel
set_h2o_client_log_dir
->setH2OClientLogDir
set_client_port_base
->setClientBasePort
set_client_web_port
->setClientWebPort
set_client_verbose_enabled
->setClientVerboseEnabled
set_client_verbose_disabled
->setClientVerboseDisabled
set_client_network_mask
->setClientNetworkMask
set_ignore_spark_public_dns_enabled
->setIgnoreSparkPublicDNSEnabled
set_ignore_spark_public_dns_disabled
->setIgnoreSparkPublicDNSDisabled
set_client_web_enabled
->setClientWebEnabled
set_client_web_disabled
->setClientWebDisabled
set_client_flow_baseurl_override
->setClientFlowBaseurlOverride
set_client_check_retry_timeout
->setClientCheckRetryTimeout
set_client_extra_properties
->setClientExtraProperties
In
H2OAutoML
Python and Scala API, the memberleaderboard()
/leaderboard
is replaced by the methodgetLeaderboard()
.The method
setClusterConfigFile
was removed fromH2OConf
in Scala API. The replacement method issetClusterInfoFile
onH2OConf
.The method
setClientPortBase
was removed fromH2OConf
in Scala API. The replacement method issetClientBasePort
onH2OConf
.In
H2OGridSearch
Python API, the methods:get_grid_models
,get_grid_models_params
and `` get_grid_models_metrics`` are removed and replaced bygetGridModels
,getGridModelsParams
and `` getGridModelsMetrics``.On
H2OXGboost
Scala and Python API, the methodsgetInitialScoreIntervals
,setInitialScoreIntervals
,getScoreInterval
andsetScoreInterval
are removed without replacement. They correspond to an internal H2O argument which should not be exposed.On
H2OXGboost
Scala and Python API, the methodsgetLearnRateAnnealing
andsetLearnRateAnnealing
are removed without replacement as this parameter is currently not exposed in H2O.The methods
ignoreSparkPublicDNS
,setIgnoreSparkPublicDNSEnabled
andsetIgnoreSparkPublicDNSDisabled
are removed without replacement as they are no longer required. Also the optionspark.ext.h2o.client.ignore.SPARK_PUBLIC_DNS
does not have any effect anymore.
From 3.28.0 to 3.28.1¶
On
H2OConf
Python API, the methodsexternal_write_confirmation_timeout
andset_external_write_confirmation_timeout
are removed without replacement. OnH2OConf
Scala API, the methodsexternalWriteConfirmationTimeout
andsetExternalWriteConfirmationTimeout
are removed without replacement. Also the optionspark.ext.h2o.external.write.confirmation.timeout
does not have any effect anymore.The environment variable
H2O_EXTENDED_JAR
specifying path to an extended driver jar was entirely replaced withH2O_DRIVER_JAR
. TheH2O_DRIVER_JAR
should contain a path to a plain H2O driver jar without any extensions. For more details, see External Backend.The location of Sparkling Water assembly JAR has changed inside the Sparkling Water distribution archive which you can download from our download page. It has been moved from
assembly/build/libs
to justjars
.H2OSVM
has been removed from the Scala API. We have removed this API as it was just wrapping Spark SVM and complicated the future development. If you still need to useSVM
, please use Spark SVM directly. All the parameters remain the same. We are planning to expose proper H2O’s SVM implementation in Sparkling Water in the following major releases.In case of binomial predictions on H2O MOJOs, the fields
p0
andp1
in the detailed prediction column are replaced by a single fieldprobabilities
which is a map from label to predicted probability. The same is done for the fieldsp0_calibrated
andp1_calibrated
. These fields are replaced by a single fieldcalibratedProbabilities
which is a map from label to predicted calibrated probability.In case of multinomial predictions on H2O MOJOs, the type of field
probabilities
in the detailed prediction column is changed from array of probabilities to a map from label to predicted probability.In case of ordinal predictions on H2O MOJOs, the type of field
probabilities
in the detailed prediction column is changed from array of probabilities to a map from label to predicted probability.On
H2OConf
in all clients, the methodsexternalCommunicationBlockSizeAsBytes
,externalCommunicationBlockSize
andsetExternalCommunicationBlockSize
have been removed as they are no longer needed.Method
Security.enableSSL
in Scala API has been removed. Please usesetInternalSecureConnectionsEnabled
on H2OConf to secure your cluster. This setter is available on Scala, Python and R clients.For the users of the manual backend we have simplified the configuration and there is no need to specify a cluster size anymore in advance. Sparkling Water automatically discovers the cluster size. In particular
spark.ext.h2o.external.cluster.size
does not have any effect anymore.
From 3.26 To 3.28.0¶
Passing Authentication in Scala¶
The users of Scala who set up any form of authentication on the backend side are now required to specify credentials on the
H2OConf
object via setUserName
and setPassword
. It is also possible to specify these directly
as Spark options spark.ext.h2o.user.name
and spark.ext.h2o.password
. Note: Actually only users of external
backend need to specify these options at this moment as the external backend is using communication via REST api
but all our documentation is using these options already as the internal backend will start using the REST api
soon as well.
String instead of enums in Sparkling Water Algo API¶
In scala, setters of the pipeline wrappers for H2O algorithms now accepts strings in places where they accepted enum values before. Before, we called, for example:
import hex.genmodel.utils.DistributionFamily
val gbm = H2OGBM()
gbm.setDistribution(DistributionFamily.multinomial)
Now, the correct code is:
val gbm = H2OGBM()
gbm.setDistribution("multinomial")
which makes the Python and Scala APIs consistent. Both upper case and lower case values are valid and if a wrong input is entered, warning is printed out with correct possible values.
Switch to Java 1.8 on Spark 2.1¶
Sparkling Water for Spark 2.1 now requires Java 1.8 and higher.
DRF exposed into Sparkling Water Algorithm API¶
DRF is now exposed in the Sparkling Water. Please see our documentation to learn how to use it Train DRF Model in Sparkling Water.
Also we can run our Grid Search API on DRF.
Change Default Name of Prediction Column¶
The default name of the prediction column has been changed from prediction_output
to prediction
.
Single value in prediction column¶
The prediction column contains directly the predicted value. For example, before this change, the prediction column contained
another struct field called value
(in case of regression issue), which contained the value. From now on, the predicted value
is always stored directly in the prediction column. In case of regression issue, the predicted numeric value
and in case of classification, the predicted label. If you are interested in more details created during the prediction,
please make sure to set withDetailedPredictionCol
to true
via the setters on both PySparkling and Sparkling Water.
When enabled, additional column named detailed_prediction
is created which contains additional prediction details, such as
probabilities, contributions and so on.
In manual mode of external backend always require a specification of cluster location¶
In previous versions, H2O client was able to discover nodes using the multicast search. That is now removed and IP:Port of any node of external cluster to which we need to connect is required. This also means that in the users of multicast cloud up in case of external H2O backend in manual standalone (no Hadoop) mode now need to pass the flatfile argument external H2O. For more information, please see Manual Mode of External Backend without Hadoop (standalone).
Removal of Deprecated Methods and Classes¶
getColsampleBytree
andsetColsampleBytree
methods are removed from the XGBoost API. Please use the newgetColSampleByTree
andsetColSampleByTree
.Removal of deprecated option
spark.ext.h2o.external.cluster.num.h2o.nodes
and corresponding setters. Please usespark.ext.h2o.external.cluster.size
or the corresponding settersetClusterSize
.Removal of deprecated algorithm classes in package
org.apache.spark.h2o.ml.algos
. Please use the classes from the packageai.h2o.sparkling.ml.algos
. Their API remains the same as before. This is the beginning of moving Sparkling Water classes to our distinct packageai.h2o.sparkling
Removal of deprecated option
spark.ext.h2o.external.read.confirmation.timeout
and related setters. This option is removed without a replacement as it is no longer needed.Removal of deprecated parameter
SelectBestModelDecreasing
on the Grid Search API. Related getters and setters have been also removed. This method is removed without replacement as we now internally sort the models with the ordering meaningful to the specified sort metric.TargetEncoder transformer now accepts the
outputCols
parameter which can be used to override the default output column names.On PySparkling
H2OGLM
API, we removed deprecated parameteralpha
in favor ofalphaValue
andlambda_
in favor oflambdaValue
. On Both PySparkling and Sparkling WaterH2OGLM
API, we removed methodsgetAlpha
in favor ofgetAlphaValue
,getLambda
in favor ofgetLambdaValue
,setAlpha
in favor ofsetAlphaValue
andsetLambda
in favor ofsetLambdaValue
. These changes ensure the consistency across Python and Scala APIs.In Sparkling Water
H2OConf
API, we removed methodh2oDriverIf
in favor ofexternalH2ODriverIf
andsetH2ODriverIf
in favor ofsetExternalH2ODriverIf
. In PySparklingH2OConf
API, we removed methodh2o_driver_if
in favor ofexternalH2ODriverIf
andset_h2o_driver_if
in favor ofsetExternalH2ODriverIf
.On PySparkling
H2OConf
API, the methoduser_name
has been removed in favor of theuserName
method and methodset_user_name
had been removed in favor of thesetUserName
method.The configurations
spark.ext.h2o.external.kill.on.unhealthy.interval
,spark.ext.h2o.external.health.check.interval
andspark.ext.h2o.ui.update.interval
have been removed and were replaced by a single optionspark.ext.h2o.backend.heartbeat.interval
. OnH2OConf
Scala API, the methodsbackendHeartbeatInterval
andsetBackendHeartbeatInterval
were added and the following methods were removed:uiUpdateInterval
,setUiUpdateInterval
,killOnUnhealthyClusterInterval
,setKillOnUnhealthyClusterInterval
,healthCheckInterval
andsetHealthCheckInterval
. OnH2OConf
Python API, the methodsbackendHeartbeatInterval
andsetBackendHeartbeatInterval
were added and the following methods were removed:ui_update_interval
,set_ui_update_interval
,kill_on_unhealthy_cluster_interval
,set_kill_on_unhealthy_cluster_interval
,get_health_check_interval
andset_health_check_interval
. The added methods are used to configure single interval which was previously specified by these 3 different methods.The configuration
spark.ext.h2o.cluster.client.connect.timeout
is removed without replacement as it is no longer needed. onH2OConf
Scala API, the methodsclientConnectionTimeout
andsetClientConnectionTimeout
were removed and onH2OConf
Python API, the methodsset_client_connection_timeout
andset_client_connection_timeout
were removed.
Change of Versioning Scheme¶
Version of Sparkling Water is changed to the following pattern: H2OVersion-SWPatchVersion-SparkVersion
, where:
H2OVersion
is full H2O Version which is integrated to Sparkling Water. SWPatchVersion
is used to specify
a patch version and SparkVersion
is a Spark version. This change of scheme allows us to do releases of Sparkling Water
without the need of releasing H2O if there is only change on the Sparkling Water side. In that case, we just increment the
SWPatchVersion
. The new version therefore looks, for example, like 3.26.0.9-2-2.4
. This version tells us this
Sparkling Water is integrating H2O 3.26.0.9
, it is the second release with 3.26.0.9
version and is for Spark 2.4
.
Renamed Property for Passing Extra HTTP Headers for Flow UI¶
The configuration property spark.ext.h2o.client.flow.extra.http.headers
was renamed to
to spark.ext.h2o.flow.extra.http.headers
since Flow UI can also run on H2O nodes and the value of the property is
also propagated to H2O nodes since the major version 3.28.0.1-1
.
External Backend now keeps H2O Flow accessible on worker nodes¶
The option spark.ext.h2o.node.enable.web
does not have any effect anymore for automatic mode of external
backend as we required H2O Flow to be accessible on the worker nodes. The associated getters and setters do also
not have any effect in this case.
It is also required that the users of manual mode of external backend
keep REST api available on all worker nodes. In particular, the H2O option -disable_web
can’t be specified
when starting H2O.
Default Values of Some AutoML Parameters Have Changed¶
The default values of the following AutoML parameters have changed across all APIs.
Parameter Name |
Old Value |
New Value |
---|---|---|
|
|
|
|
|
|
|
|
|
From any previous version to 3.26.11¶
Users of Sparkling Water external cluster in manual mode on Hadoop need to update the command the external cluster is launched with. A new parameter
-sw_ext_backend
needs to be added to the h2odriver invocation.