Performs k-means clustering on an H2O dataset
h2o.kmeans( training_frame, x, model_id = NULL, validation_frame = NULL, nfolds = 0, keep_cross_validation_models = TRUE, keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), fold_column = NULL, ignore_const_cols = TRUE, score_each_iteration = FALSE, k = 1, estimate_k = FALSE, user_points = NULL, max_iterations = 10, standardize = TRUE, seed = -1, init = c("Random", "PlusPlus", "Furthest", "User"), max_runtime_secs = 0, categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"), export_checkpoints_dir = NULL, cluster_size_constraints = NULL )
training_frame | Id of the training data frame. |
---|---|
x | A vector containing the |
model_id | Destination id for this model; auto-generated if not specified. |
validation_frame | Id of the validation data frame. |
nfolds | Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0. |
keep_cross_validation_models |
|
keep_cross_validation_predictions |
|
keep_cross_validation_fold_assignment |
|
fold_assignment | Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO. |
fold_column | Column with cross-validation fold index assignment per observation. |
ignore_const_cols |
|
score_each_iteration |
|
k | The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids. Defaults to 1. |
estimate_k |
|
user_points | This option allows you to specify a dataframe, where each row represents an initial cluster center. The user- specified points must have the same number of columns as the training observations. The number of rows must equal the number of clusters |
max_iterations | Maximum training iterations (if estimate_k is enabled, then this is for each inner Lloyds iteration) Defaults to 10. |
standardize |
|
seed | Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number). |
init | Initialization mode Must be one of: "Random", "PlusPlus", "Furthest", "User". Defaults to Furthest. |
max_runtime_secs | Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0. |
categorical_encoding | Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO. |
export_checkpoints_dir | Automatically export generated models to this directory. |
cluster_size_constraints | An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters. |
an object of class H2OClusteringModel.
h2o.cluster_sizes
, h2o.totss
, h2o.num_iterations
, h2o.betweenss
, h2o.tot_withinss
, h2o.withinss
, h2o.centersSTD
, h2o.centers
# NOT RUN { library(h2o) h2o.init() prostate_path <- system.file("extdata", "prostate.csv", package = "h2o") prostate <- h2o.uploadFile(path = prostate_path) h2o.kmeans(training_frame = prostate, k = 10, x = c("AGE", "RACE", "VOL", "GLEASON")) # }