Migration Guide¶
Migration guide between Sparkling Water versions.
From 3.38 to 3.40¶
The parameter
namedMojoOutputColumnsand methodsgetNamedMojoOutputColumns,setNamedMojoOutputColumnsonH2OAlgorithmCommonParamshave been removed without replacement. The behaviour will stay the same as it was for “true” value which was the default value in the past.
From 3.36 to 3.38¶
org.apache.spark.h2o.H2OConf has been replaced by ai.h2o.sparkling.H2OConf
org.apache.spark.h2o.H2OContext has been replaced by ai.h2o.sparkling.H2OContext
The support for Apache Spark 2.2.x has been removed.
The parameter
variableImportancesofH2ODeepLearninghas been replaced withcalculateFeatureImportancesas well as the methodsgetVariableImportancesandsetVariableImportancesonH2ODeepLearninghave been replaced withgetCalculateFeatureImportancesandsetCalculateFeatureImportances.The method
getVariableImportancesofH2ODeepLearningMOJOModelhas been replaced withgetCalculateFeatureImportances.The parameter
autoencoderand methodsgetAutoencoder,setAutoencoderonH2ODeepLearninghave been removed without replacement.The method
getAutoencoderofH2ODeepLearningMOJOModelhas been removed without replacement.
From 3.34 to 3.36¶
The methods
getWithDetailedPredictionColandsetWithDetailedPredictionColon all SW Algorithms and MOJO models were removed without replacement.The
withDetailedPredictionColfield onH2OMOJOSettingswas removed without a replacement.Boolean type mapping from Spark’s DataFrame to H20Frame was changed from numerical 0, 1 to “False”, “True” categorical values.
From 3.32.1 to 3.34¶
On
H2OConf, the setterssetClientIcedDirandsetNodeIcedDirare replaced bysetIcedDirand gettersclientIcedDirandnodeIcedDirare replaced byicedDir. Also the spark optionsspark.ext.h2o.client.iced.dirandspark.ext.h2o.node.iced.dirare replaced byspark.ext.h2o.iced.dir.On
H2OConf, the setterssetH2OClientLogLevelandsetH2ONodeLogLevelare replaced bysetLogLeveland gettersh2oClientLogLevelandh2oNodeLogLevelare replaced bylogLevel. Also the spark optionsspark.ext.h2o.client.log.levelandspark.ext.h2o.node.log.levelare replaced byspark.ext.h2o.log.level.Spark option
spark.ext.h2o.client.flow.diris replaced byspark.ext.h2o.flow.dir.On
H2OConf, the setterssetClientBasePortandsetNodeBasePortare replaced bysetBasePortand gettersclientBasePortandnodeBasePortare replaced bybasePort. Also the spark optionsspark.ext.h2o.client.port.baseandspark.ext.h2o.node.port.baseare replaced byspark.ext.h2o.base.port.On
H2OConf, the setterssetH2OClientLogDirandsetH2ONodeLogDirare replaced bysetLogDirand gettersh2oClientLogDirandh2oNodeLogDirare replaced bylogDir. Also the spark optionsspark.ext.h2o.client.log.dirandspark.ext.h2o.node.log.dirare replaced byspark.ext.h2o.log.dir.On
H2OConf, the setterssetClientExtraPropertiesandsetNodeExtraPropertiesare replaced bysetExtraPropertiesand gettersclientExtraPropertiesandnodeExtraPropertiesare replaced byextraProperties. Also the spark optionsspark.ext.h2o.client.extraandspark.ext.h2o.node.extraare replaced byspark.ext.h2o.extra.properties.On
H2OConf, the settersetMapperXmxis replaced bysetExternalMemoryand the gettermapperXmxis replaced byexternalMemory. Also the Spark optionspark.ext.h2o.hadoop.memoryis replaced byspark.ext.h2o.external.memory.The
weightColparameter onH2OKmeanswas removed without a replacement.The
distributionparameter onH2OGLMwas removed without a replacement.The support for Apache Spark 2.1.x has been removed.
Binary models could be downloaded only if the algorithm parameter
keepBinaryModelswas set totrue.
From 3.32 to 3.32.1¶
The data type of H2OTargetEncoder output columns has been changed from
DoubleTypetoml.linalg.VectorUDT.The sub-columns of the
predictioncolumn produced byH2OMOJOPipelineModelcould be of the typefloatinstead ofdoubleifMOJOModelSettings.namedMojoOutputColumnsis set totrue.
From 3.30.1 to 3.32¶
We have created two new classes -
ai.h2o.sparkling.H2OContextandai.h2o.sparkling.H2OConf. The behaviour of the context and configuration is the same as in the originalorg.apache.spark.h2opackage except that in this case we no longer use H2O client on Spark driver. This means that H2O is running only on worker nodes and not on Spark driver. This change affects only Scala API as PySparkling and RSparkling are already using the new API internally. We have also changed all documentation materials to point to the new classes.We are keeping the original context and conf available in the code in case the user needs to use some of the H2O client features directly on the Spark driver, but please note that once the API in the new package
ai.h2o.sparklingis complete, the context and conf classes will get removed. We therefore encourage users to migrate to new context class and report missing features to us via Github issues.For example, if the user is training XGboost model using H2O Java api but is not using
H2OXGBoostfrom Sparkling Water, this feature requires the H2O client to be available.In PySparkling, the
H2OConfno longer accepts any arguments. To create newH2OConf, please just callconf = H2OConf(). Also theH2OContext.getOrCreatemethod no longer accepts the spark argument. You can start H2OContext asH2OContext.getOrCreate()orH2OContext.getOrCreate(conf)In RSparkling, the
H2OConfno longer accepts any arguments. To create newH2OConf, please just callconf <- H2OConf(). Also theH2OContext.getOrCreatemethod no longer accepts the spark argument. You can start H2OContext asH2OContext.getOrCreate()orH2OContext.getOrCreate(conf)In Scala,
H2OConfcan be created asnew H2OConf()ornew H2OConf(sparkConf). Other constructor variants have been removed. Also,H2OContextcan be created asH2OContext.getOrCreate()orH2OContext.getOrCreate(conf). The other variants of this method have been removed.The
setH2OCluster(ip, port)method onH2OConfin all APIs doesn’t implicitly set the external backend anymore. The methodsetExternalClusterMode()must be called explicitly.The method
classifyin thehex.ModelUtilsobject is removed. Please use Sparkling Water algorithm API to train and score H2O models. This removal affects only Scala API as other APIs don’t have such functionality.The method
DLModelinwater.support.DeepLearningSupportis removed. Please useH2ODeepLearninginstead. The same holds for methodGBMModelinwater.support.GBMSupport. Please useH2OGBMinstead. The classes wrapping these methods are removed as well. This removal affects only Scala API as other APIs don’t have such functionality.The method
splitFrameandsplitinwater.support.H2OFrameSupportis removed. Please useai.h2o.sparkling.H2OFrame(frameKeyString).split(ratios)instead.The method
withLockAndUpdateinwater.support.H2OFrameSupportis removed. Please useai.h2o.sparkling.backend.utils.H2OClientUtils.withLockAndUpdateinstead.The methods
columnsToCategoricalwith both indices and column names argument inwater.support.H2OFrameSupportare removed. Please useai.h2o.sparkling.H2OFrame(frameKeyString).convertColumnsToCategoricalinstead.Method
modelMetricsinwater.support.ModelMetricsSupportis removed. Please use methodsgetTrainingMetrics,getValidationMetricsorgetCrossValidationMetricson theH2OMOJOModel. You can also use methodgetCurrentMetrics, which returns cross validation metrics if nfolds was specified and higher than 0, validation metrics if validation frame has been specified ( splitRatio was set and lower than 1 ) and nfolds was 0 and training metrics otherwise ( splitRatio is 1 and nfolds is 0).The whole trait
ModelSerializationSupportin Scala is removed. The MOJO is a first class citizen in Sparkling Water and most code works with our Spark MOJO wrapper. Please use the following approaches to migrate from previous methods in the model serialization support:To create Spark MOJO wrapper in Sparkling Water, you can load it from H2O-3 as:
val mojoModel = H2OMOJOModel.createFromMojo(path)
or train model using Sparkling Water API, such as
val gbm = H2OGBM().setLabelCol("label") val mojoModel = gbm.fit(data)
In this case the
mojoModelis Spark wrapper around the H2O’s mojo providing Spark friendly API. This also means that the such model can be embedded into Spark pipelines without any additional work.To export it as, please call:
mojoModel.write.save("path")
The advantage is that this variant is H2O-version independent and when such model is loaded, H2O run-time is not required.
You can load the exported model from Sparkling Water as:
val mojoModel = H2OMOJOModel.read.load("path")
For additional information about how to load MOJO into Sparkling Water, please see Loading MOJOs into Sparkling Water.
The methods
join,innerJoin,outerJoin,leftJoinandrightJoininwater.support.JoinSupportare removed together with their encapsulating class. The enumwater.support.munging.JoinMethodis also removed. In order to perform joins, please use the following methods:Inner join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).innerJoin(rightFrame)Outer join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).outerJoin(rightFrame)Left join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).leftJoin(rightFrame)Right join:
ai.h2o.sparkling.H2OFrame(idOfLeftFrame).rightJoin(rightFrame)
The
JoinMethodenum is removed as it is no longer required.Since the method
asH2OFrameofH2OContextconverts strings to categorical columns automatically according to the heuristic from H2O parsers, the methodsgetAllStringColumnsToCategoricalandsetAllStringColumnsToCategoricalhave been removed from all SW API algorithms in Python and Scala API.Methods
setH2ONodeLogLevelandsetH2OClientLogLevelare removed onH2OContext. Please usesetH2OLogLevelinstead.Methods
asDataFrameon ScalaH2OContexthas been replaced by methodsasSparkFramewith same arguments. This was done to ensure full consistency between Scala, Python and R APIs.JavaH2OContext is removed. Please use
org.apache.spark.h2o.H2OContextinstead.When using H2O as Spark data source, the approach
val df = spark.read.h2o(key)has been removed. Please useval df = spark.read.format("h2o").load(key)instead. The same holds forspark.write.h2o(key). Please usedf.write.format("h2o").save("new_key")instead.Starting from the version 3.32,
H2OGridSearchhyper-parameters now correspond to parameter names in Sparkling Water. Previously, the hyper-parameters were specified using internal H2O names such as_ntreesor_max_depth. At this version, the parameter names follow the naming convention of getters and setters of the corresponding parameter, such asntreesormaxDepth.Also the output of
getGridModelsParamsnow contains column names which correspond to Sparkling Water parameter names instead of H2O internal ones. When updating to version 3.32, please make sure to update your hyper parameter names.On
H2OConf, the methodssetHiveSupportEnabled,setHiveSupportDisabledandisHiveSupportEnabledare replaced bysetKerberizedHiveEnabled,setKerberizedHiveDisabledandisKerberizedHiveEnabledto reflect their actual meaning. Also the optionspark.ext.h2o.hive.enabledis replaced byspark.ext.h2o.kerberized.hive.enabled.The below list of Grid Search parameters with their getters and setters were replaced by the same parameters on the algorithm the grid search is applied to.
Parameter Name |
Getter |
Setter |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Schema of detailed predictions produced by
H2OMOJOModeland thus by all Sparkling Water algorithms has been changed a bit. TheMapTypesub-columnsprobabilities,calibratedProbabilitiesandcontributionshave been changed toStructTypecolumns.On H2OXGBoost, the options
minSumHessianInLeafandminDataInLeafhave been removed as well as the corresponding getters and setters. The methods are removed without replacement as these parameters weren’t valid XGBoost parameters.
From 3.30 to 3.30.1¶
The detailed prediction columns is always enabled for all types of MOJO predictions.
From 3.28.1 to 3.30¶
It is now required to explicitly create
H2OContextbefore you run any of our exposed algorithms. Previously, the algorithm would create the H2OContext on demand.It is no longer possible to disable web (REST API endpoints) on the worker nodes in the internal client as we require the endpoints to be available. In particular, the methods
setH2ONodeWebEnabled,setH2ONodeWebDisabledandh2oNodeWebEnabledare removed without replacement. Also the optionspark.ext.h2o.node.enable.webdoes not have any effect anymore.It is no longer possible to disable web (REST API endpoints) on the client node as we require the Rest API to be available. In particular, the methods
setClientWebEnabled,setClientWebDisabledandclientWebEnabledare removed without replacement. Also the optionspark.ext.h2o.client.enable.webdoes not have any effect anymore.The property
spark.ext.h2o.node.iced.dirand the setter methodsetNodeIcedDironH2OConfhas no effect in all 3.30.x.y-z versions. If users need to set a custom iced directory for executors, they can set the propertyspark.ext.h2o.node.extrato-ice_root dir, wherediris a user-specified directory.
Removal of Deprecated Methods and Classes¶
On PySparkling, passing authentication on
H2OContextviaauthparam is removed in favor of methodssetUserNameandsetPasswordond theH2OConfor via the Spark optionsspark.ext.h2o.user.nameandspark.ext.h2o.passworddirectly.On Pysparkling, passing
verify_ssl_certificatesparameter as H2OContext argument is removed in favor of methodsetVerifySslCertificatesonH2OConfor via the spark optionspark.ext.h2o.verify_ssl_certificates.On RSparkling, the method
h2o_contextis removed. To create H2OContext, please callhc <- H2OContext.getOrCreate(). Also the methodsh2o_flow,as_h2o_frameandas_spark_dataframeare removed. Please use the methods available on theH2OContextinstance created viahc <- H2OContext.getOrCreate(). Instead ofh2o_flow, usehc$openFlow, instead ofas_h2o_frame, useasH2OFrameand instead ofas_spark_dataframeuseasSparkFrame.Also the
H2OContext.getOrCreate()does not haveusernameandpasswordarguments anymore. The correct way how to pass authentication details toH2OContextis viaH2OConfclass, such as:conf <- H2OConf() conf$setUserName(username) conf$setPassword(password) hc <- H2OContext.getOrCreate(conf)
The Spark options
spark.ext.h2o.user.nameandspark.ext.h2o.passwordcorrespond to these setters and can be also used directly.In
H2OContextPython API, the methodas_spark_frameis replaced by the methodasSparkFrameand the methodas_h2o_frameis replaced byasH2OFrame.In
H2OXGBoostScala And Python API, the methodsgetNEstimatorsandsetNEstimatorsare removed. Please usegetNtreesandsetNtreesinstead.In Scala and Python API for tree-based algorithms, the method
getR2Stoppingis removed in favor ofgetStoppingRounds,getStoppingMetric,getStoppingTolerancemethods and the methodsetR2Stoppingis removed in favor ofsetStoppingRounds,setStoppingMetric,setStoppingTolerancemethods.Method
download_h2o_logson PySparklingH2OContextis removed in favor of thedownloadH2OLogsmethod.Method
get_confon PySparklingH2OContextis removed in favor of thegetConfmethod.On Python and Scala
H2OGLMAPI, the methodssetExactLambdasandgetExactLambdasare removed without replacement.On H2OConf Python API, the following methods have been renamed to be consistent with the Scala counterparts:
h2o_cluster->h2oClusterh2o_cluster_host->h2oClusterHosth2o_cluster_port->h2oClusterPortcluster_size->clusterSizecluster_start_timeout->clusterStartTimeoutcluster_config_file->clusterInfoFilemapper_xmx->mapperXmxhdfs_output_dir->HDFSOutputDircluster_start_mode->clusterStartModeis_auto_cluster_start_used->isAutoClusterStartUsedis_manual_cluster_start_used->isManualClusterStartUsedh2o_driver_path->h2oDriverPathyarn_queue->YARNQueueis_kill_on_unhealthy_cluster_enabled->isKillOnUnhealthyClusterEnabledkerberos_principal->kerberosPrincipalkerberos_keytab->kerberosKeytabrun_as_user->runAsUserset_h2o_cluster->setH2OClusterset_cluster_size->setClusterSizeset_cluster_start_timeout->setClusterStartTimeoutset_cluster_config_file->setClusterInfoFileset_mapper_xmx->setMapperXmxset_hdfs_output_dir->setHDFSOutputDiruse_auto_cluster_start->useAutoClusterStartuse_manual_cluster_start->useManualClusterStartset_h2o_driver_path->setH2ODriverPathset_yarn_queue->setYARNQueueset_kill_on_unhealthy_cluster_enabled->setKillOnUnhealthyClusterEnabledset_kill_on_unhealthy_cluster_disabled->setKillOnUnhealthyClusterDisabledset_kerberos_principal->setKerberosPrincipalset_kerberos_keytab->setKerberosKeytabset_run_as_user->setRunAsUsernum_h2o_workers->numH2OWorkersdrdd_mul_factor->drddMulFactornum_rdd_retries->numRddRetriesdefault_cloud_size->defaultCloudSizesubseq_tries->subseqTriesh2o_node_web_enabled->h2oNodeWebEnablednode_iced_dir->nodeIcedDirset_num_h2o_workers->setNumH2OWorkersset_drdd_mul_factor->setDrddMulFactorset_num_rdd_retries->setNumRddRetriesset_default_cloud_size->setDefaultCloudSizeset_subseq_tries->setSubseqTriesset_h2o_node_web_enabled->setH2ONodeWebEnabledset_h2o_node_web_disabled->setH2ONodeWebDisabledset_node_iced_dir->setNodeIcedDirbackend_cluster_mode->backendClusterModecloud_name->cloudNameis_h2o_repl_enabled->isH2OReplEnabledscala_int_default_num->scalaIntDefaultNumis_cluster_topology_listener_enabled->isClusterTopologyListenerEnabledis_spark_version_check_enabled->isSparkVersionCheckEnabledis_fail_on_unsupported_spark_param_enabled->isFailOnUnsupportedSparkParamEnabledjks_pass->jksPassjks_alias->jksAliashash_login->hashLoginldap_login->ldapLoginkerberos_login->kerberosLoginlogin_conf->loginConfssl_conf->sslConfauto_flow_ssl->autoFlowSslh2o_node_log_level->h2oNodeLogLevelh2o_node_log_dir->h2oNodeLogDircloud_timeout->cloudTimeoutnode_network_mask->nodeNetworkMaskstacktrace_collector_interval->stacktraceCollectorIntervalcontext_path->contextPathflow_scala_cell_async->flowScalaCellAsyncmax_parallel_scala_cell_jobs->maxParallelScalaCellJobsinternal_port_offset->internalPortOffsetmojo_destroy_timeout->mojoDestroyTimeoutnode_base_port->nodeBasePortnode_extra_properties->nodeExtraPropertiesflow_extra_http_headers->flowExtraHttpHeadersis_internal_secure_connections_enabled->isInternalSecureConnectionsEnabledflow_dir->flowDirclient_ip->clientIpclient_iced_dir->clientIcedDirh2o_client_log_level->h2oClientLogLevelh2o_client_log_dir->h2oClientLogDirclient_base_port->clientBasePortclient_web_port->clientWebPortclient_verbose_output->clientVerboseOutputclient_network_mask->clientNetworkMaskignore_spark_public_dns->ignoreSparkPublicDNSclient_web_enabled->clientWebEnabledclient_flow_baseurl_override->clientFlowBaseurlOverrideclient_extra_properties->clientExtraPropertiesruns_in_external_cluster_mode->runsInExternalClusterModeruns_in_internal_cluster_mode->runsInInternalClusterModeclient_check_retry_timeout->clientCheckRetryTimeoutset_internal_cluster_mode->setInternalClusterModeset_external_cluster_mode->setExternalClusterModeset_cloud_name->setCloudNameset_nthreads->setNthreadsset_repl_enabled->setReplEnabledset_repl_disabled->setReplDisabledset_default_num_repl_sessions->setDefaultNumReplSessionsset_cluster_topology_listener_enabled->setClusterTopologyListenerEnabledset_cluster_topology_listener_disabled->setClusterTopologyListenerDisabledset_spark_version_check_disabled->setSparkVersionCheckDisabledset_fail_on_unsupported_spark_param_enabled->setFailOnUnsupportedSparkParamEnabledset_fail_on_unsupported_spark_param_disabled->setFailOnUnsupportedSparkParamDisabledset_jks->setJksset_jks_pass->setJksPassset_jks_alias->setJksAliasset_hash_login_enabled->setHashLoginEnabledset_hash_login_disabled->setHashLoginDisabledset_ldap_login_enabled->setLdapLoginEnabledset_ldap_login_disabled->setLdapLoginDisabledset_kerberos_login_enabled->setKerberosLoginEnabledset_kerberos_login_disabled->setKerberosLoginDisabledset_login_conf->setLoginConfset_ssl_conf->setSslConfset_auto_flow_ssl_enabled->setAutoFlowSslEnabledset_auto_flow_ssl_disabled->setAutoFlowSslDisabledset_h2o_node_log_level->setH2ONodeLogLevelset_h2o_node_log_dir->setH2ONodeLogDirset_cloud_timeout->setCloudTimeoutset_node_network_mask->setNodeNetworkMaskset_stacktrace_collector_interval->setStacktraceCollectorIntervalset_context_path->setContextPathset_flow_scala_cell_async_enabled->setFlowScalaCellAsyncEnabledset_flow_scala_cell_async_disabled->setFlowScalaCellAsyncDisabledset_max_parallel_scala_cell_jobs->setMaxParallelScalaCellJobsset_internal_port_offset->setInternalPortOffsetset_node_base_port->setNodeBasePortset_mojo_destroy_timeout->setMojoDestroyTimeoutset_node_extra_properties->setNodeExtraPropertiesset_flow_extra_http_headers->setFlowExtraHttpHeadersset_internal_secure_connections_enabled->setInternalSecureConnectionsEnabledset_internal_secure_connections_disabled->setInternalSecureConnectionsDisabledset_flow_dir->setFlowDirset_client_ip->setClientIpset_client_iced_dir->setClientIcedDirset_h2o_client_log_level->setH2OClientLogLevelset_h2o_client_log_dir->setH2OClientLogDirset_client_port_base->setClientBasePortset_client_web_port->setClientWebPortset_client_verbose_enabled->setClientVerboseEnabledset_client_verbose_disabled->setClientVerboseDisabledset_client_network_mask->setClientNetworkMaskset_ignore_spark_public_dns_enabled->setIgnoreSparkPublicDNSEnabledset_ignore_spark_public_dns_disabled->setIgnoreSparkPublicDNSDisabledset_client_web_enabled->setClientWebEnabledset_client_web_disabled->setClientWebDisabledset_client_flow_baseurl_override->setClientFlowBaseurlOverrideset_client_check_retry_timeout->setClientCheckRetryTimeoutset_client_extra_properties->setClientExtraProperties
In
H2OAutoMLPython and Scala API, the memberleaderboard()/leaderboardis replaced by the methodgetLeaderboard().The method
setClusterConfigFilewas removed fromH2OConfin Scala API. The replacement method issetClusterInfoFileonH2OConf.The method
setClientPortBasewas removed fromH2OConfin Scala API. The replacement method issetClientBasePortonH2OConf.In
H2OGridSearchPython API, the methods:get_grid_models,get_grid_models_paramsand `` get_grid_models_metrics`` are removed and replaced bygetGridModels,getGridModelsParamsand `` getGridModelsMetrics``.On
H2OXGboostScala and Python API, the methodsgetInitialScoreIntervals,setInitialScoreIntervals,getScoreIntervalandsetScoreIntervalare removed without replacement. They correspond to an internal H2O argument which should not be exposed.On
H2OXGboostScala and Python API, the methodsgetLearnRateAnnealingandsetLearnRateAnnealingare removed without replacement as this parameter is currently not exposed in H2O.The methods
ignoreSparkPublicDNS,setIgnoreSparkPublicDNSEnabledandsetIgnoreSparkPublicDNSDisabledare removed without replacement as they are no longer required. Also the optionspark.ext.h2o.client.ignore.SPARK_PUBLIC_DNSdoes not have any effect anymore.
From 3.28.0 to 3.28.1¶
On
H2OConfPython API, the methodsexternal_write_confirmation_timeoutandset_external_write_confirmation_timeoutare removed without replacement. OnH2OConfScala API, the methodsexternalWriteConfirmationTimeoutandsetExternalWriteConfirmationTimeoutare removed without replacement. Also the optionspark.ext.h2o.external.write.confirmation.timeoutdoes not have any effect anymore.The environment variable
H2O_EXTENDED_JARspecifying path to an extended driver jar was entirely replaced withH2O_DRIVER_JAR. TheH2O_DRIVER_JARshould contain a path to a plain H2O driver jar without any extensions. For more details, see External Backend.The location of Sparkling Water assembly JAR has changed inside the Sparkling Water distribution archive which you can download from our download page. It has been moved from
assembly/build/libsto justjars.H2OSVMhas been removed from the Scala API. We have removed this API as it was just wrapping Spark SVM and complicated the future development. If you still need to useSVM, please use Spark SVM directly. All the parameters remain the same. We are planning to expose proper H2O’s SVM implementation in Sparkling Water in the following major releases.In case of binomial predictions on H2O MOJOs, the fields
p0andp1in the detailed prediction column are replaced by a single fieldprobabilitieswhich is a map from label to predicted probability. The same is done for the fieldsp0_calibratedandp1_calibrated. These fields are replaced by a single fieldcalibratedProbabilitieswhich is a map from label to predicted calibrated probability.In case of multinomial predictions on H2O MOJOs, the type of field
probabilitiesin the detailed prediction column is changed from array of probabilities to a map from label to predicted probability.In case of ordinal predictions on H2O MOJOs, the type of field
probabilitiesin the detailed prediction column is changed from array of probabilities to a map from label to predicted probability.On
H2OConfin all clients, the methodsexternalCommunicationBlockSizeAsBytes,externalCommunicationBlockSizeandsetExternalCommunicationBlockSizehave been removed as they are no longer needed.Method
Security.enableSSLin Scala API has been removed. Please usesetInternalSecureConnectionsEnabledon H2OConf to secure your cluster. This setter is available on Scala, Python and R clients.For the users of the manual backend we have simplified the configuration and there is no need to specify a cluster size anymore in advance. Sparkling Water automatically discovers the cluster size. In particular
spark.ext.h2o.external.cluster.sizedoes not have any effect anymore.
From 3.26 To 3.28.0¶
Passing Authentication in Scala¶
The users of Scala who set up any form of authentication on the backend side are now required to specify credentials on the
H2OConf object via setUserName and setPassword. It is also possible to specify these directly
as Spark options spark.ext.h2o.user.name and spark.ext.h2o.password. Note: Actually only users of external
backend need to specify these options at this moment as the external backend is using communication via REST api
but all our documentation is using these options already as the internal backend will start using the REST api
soon as well.
String instead of enums in Sparkling Water Algo API¶
In scala, setters of the pipeline wrappers for H2O algorithms now accepts strings in places where they accepted enum values before. Before, we called, for example:
import hex.genmodel.utils.DistributionFamily
val gbm = H2OGBM()
gbm.setDistribution(DistributionFamily.multinomial)
Now, the correct code is:
val gbm = H2OGBM()
gbm.setDistribution("multinomial")
which makes the Python and Scala APIs consistent. Both upper case and lower case values are valid and if a wrong input is entered, warning is printed out with correct possible values.
Switch to Java 1.8 on Spark 2.1¶
Sparkling Water for Spark 2.1 now requires Java 1.8 and higher.
DRF exposed into Sparkling Water Algorithm API¶
DRF is now exposed in the Sparkling Water. Please see our documentation to learn how to use it Train DRF Model in Sparkling Water.
Also we can run our Grid Search API on DRF.
Change Default Name of Prediction Column¶
The default name of the prediction column has been changed from prediction_output to prediction.
Single value in prediction column¶
The prediction column contains directly the predicted value. For example, before this change, the prediction column contained
another struct field called value (in case of regression issue), which contained the value. From now on, the predicted value
is always stored directly in the prediction column. In case of regression issue, the predicted numeric value
and in case of classification, the predicted label. If you are interested in more details created during the prediction,
please make sure to set withDetailedPredictionCol to true via the setters on both PySparkling and Sparkling Water.
When enabled, additional column named detailed_prediction is created which contains additional prediction details, such as
probabilities, contributions and so on.
In manual mode of external backend always require a specification of cluster location¶
In previous versions, H2O client was able to discover nodes using the multicast search. That is now removed and IP:Port of any node of external cluster to which we need to connect is required. This also means that in the users of multicast cloud up in case of external H2O backend in manual standalone (no Hadoop) mode now need to pass the flatfile argument external H2O. For more information, please see Manual Mode of External Backend without Hadoop (standalone).
Removal of Deprecated Methods and Classes¶
getColsampleBytreeandsetColsampleBytreemethods are removed from the XGBoost API. Please use the newgetColSampleByTreeandsetColSampleByTree.Removal of deprecated option
spark.ext.h2o.external.cluster.num.h2o.nodesand corresponding setters. Please usespark.ext.h2o.external.cluster.sizeor the corresponding settersetClusterSize.Removal of deprecated algorithm classes in package
org.apache.spark.h2o.ml.algos. Please use the classes from the packageai.h2o.sparkling.ml.algos. Their API remains the same as before. This is the beginning of moving Sparkling Water classes to our distinct packageai.h2o.sparklingRemoval of deprecated option
spark.ext.h2o.external.read.confirmation.timeoutand related setters. This option is removed without a replacement as it is no longer needed.Removal of deprecated parameter
SelectBestModelDecreasingon the Grid Search API. Related getters and setters have been also removed. This method is removed without replacement as we now internally sort the models with the ordering meaningful to the specified sort metric.TargetEncoder transformer now accepts the
outputColsparameter which can be used to override the default output column names.On PySparkling
H2OGLMAPI, we removed deprecated parameteralphain favor ofalphaValueandlambda_in favor oflambdaValue. On Both PySparkling and Sparkling WaterH2OGLMAPI, we removed methodsgetAlphain favor ofgetAlphaValue,getLambdain favor ofgetLambdaValue,setAlphain favor ofsetAlphaValueandsetLambdain favor ofsetLambdaValue. These changes ensure the consistency across Python and Scala APIs.In Sparkling Water
H2OConfAPI, we removed methodh2oDriverIfin favor ofexternalH2ODriverIfandsetH2ODriverIfin favor ofsetExternalH2ODriverIf. In PySparklingH2OConfAPI, we removed methodh2o_driver_ifin favor ofexternalH2ODriverIfandset_h2o_driver_ifin favor ofsetExternalH2ODriverIf.On PySparkling
H2OConfAPI, the methoduser_namehas been removed in favor of theuserNamemethod and methodset_user_namehad been removed in favor of thesetUserNamemethod.The configurations
spark.ext.h2o.external.kill.on.unhealthy.interval,spark.ext.h2o.external.health.check.intervalandspark.ext.h2o.ui.update.intervalhave been removed and were replaced by a single optionspark.ext.h2o.backend.heartbeat.interval. OnH2OConfScala API, the methodsbackendHeartbeatIntervalandsetBackendHeartbeatIntervalwere added and the following methods were removed:uiUpdateInterval,setUiUpdateInterval,killOnUnhealthyClusterInterval,setKillOnUnhealthyClusterInterval,healthCheckIntervalandsetHealthCheckInterval. OnH2OConfPython API, the methodsbackendHeartbeatIntervalandsetBackendHeartbeatIntervalwere added and the following methods were removed:ui_update_interval,set_ui_update_interval,kill_on_unhealthy_cluster_interval,set_kill_on_unhealthy_cluster_interval,get_health_check_intervalandset_health_check_interval. The added methods are used to configure single interval which was previously specified by these 3 different methods.The configuration
spark.ext.h2o.cluster.client.connect.timeoutis removed without replacement as it is no longer needed. onH2OConfScala API, the methodsclientConnectionTimeoutandsetClientConnectionTimeoutwere removed and onH2OConfPython API, the methodsset_client_connection_timeoutandset_client_connection_timeoutwere removed.
Change of Versioning Scheme¶
Version of Sparkling Water is changed to the following pattern: H2OVersion-SWPatchVersion-SparkVersion, where:
H2OVersion is full H2O Version which is integrated to Sparkling Water. SWPatchVersion is used to specify
a patch version and SparkVersion is a Spark version. This change of scheme allows us to do releases of Sparkling Water
without the need of releasing H2O if there is only change on the Sparkling Water side. In that case, we just increment the
SWPatchVersion. The new version therefore looks, for example, like 3.26.0.9-2-2.4. This version tells us this
Sparkling Water is integrating H2O 3.26.0.9, it is the second release with 3.26.0.9 version and is for Spark 2.4.
Renamed Property for Passing Extra HTTP Headers for Flow UI¶
The configuration property spark.ext.h2o.client.flow.extra.http.headers was renamed to
to spark.ext.h2o.flow.extra.http.headers since Flow UI can also run on H2O nodes and the value of the property is
also propagated to H2O nodes since the major version 3.28.0.1-1.
External Backend now keeps H2O Flow accessible on worker nodes¶
The option spark.ext.h2o.node.enable.web does not have any effect anymore for automatic mode of external
backend as we required H2O Flow to be accessible on the worker nodes. The associated getters and setters do also
not have any effect in this case.
It is also required that the users of manual mode of external backend
keep REST api available on all worker nodes. In particular, the H2O option -disable_web can’t be specified
when starting H2O.
Default Values of Some AutoML Parameters Have Changed¶
The default values of the following AutoML parameters have changed across all APIs.
Parameter Name |
Old Value |
New Value |
|---|---|---|
|
|
|
|
|
|
|
|
|
From any previous version to 3.26.11¶
Users of Sparkling Water external cluster in manual mode on Hadoop need to update the command the external cluster is launched with. A new parameter
-sw_ext_backendneeds to be added to the h2odriver invocation.