impute_missing

  • Available in: PCA
  • Hyperparameter: no

Description

In some cases, dataset used can contain a fewer number of rows due to the removal of rows with NA/missing values. If this is not the desired behavior, then you can use the impute_missing option to impute missing entries in each column with the column mean value.

This value defaults to False.

Example

  • r
  • python
library(h2o)
h2o.init()

# Load the Birds dataset
birds.hex <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/birds.csv")

# Train with impute_missing enabled
birds.pca <- h2o.prcomp(training_frame = birds.hex, transform = "STANDARDIZE",
                        k = 3, pca_method="Power", use_all_factor_levels=TRUE,
                        impute_missing=TRUE)

# View the importance of components
birds.pca@model$importance
Importance of components:
                            pc1      pc2      pc3
Standard deviation     1.496991 1.351000 1.014182
Proportion of Variance 0.289987 0.236184 0.133098
Cumulative Proportion  0.289987 0.526171 0.659269

# View the eigenvectors
birds.pca@model$eigenvectors
Rotation:
                  pc1      pc2       pc3
patch.Ref1a  0.007207 0.007449  0.001161
patch.Ref1b -0.003090 0.011257 -0.001066
patch.Ref1c  0.002962 0.008850 -0.000264
patch.Ref1d -0.001295 0.011003  0.000501
patch.Ref1e  0.006559 0.006904 -0.001206

---
                pc1       pc2       pc3
S          0.463591 -0.053410  0.184799
year      -0.055934  0.009691 -0.968635
area       0.533375 -0.289381 -0.130338
log.area.  0.583966 -0.262287 -0.089582
ENN       -0.270615 -0.573900  0.038835
log.ENN.  -0.231368 -0.640231  0.026325

# Train again without imputing missing values
birds2.pca <- h2o.prcomp(training_frame = birds.hex, transform = "STANDARDIZE",
                         k = 3, pca_method="Power", use_all_factor_levels=TRUE,
                         impute_missing=FALSE)

Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  _train: Dataset used may contain fewer number of rows due to removal of rows
  with NA/missing values. If this is not desirable, set impute_missing argument
  in pca call to TRUE/True/true/... depending on the client language.

# View the importance of components
birds2.pca@model$importance
Importance of components:
                            pc1      pc2      pc3
Standard deviation     1.546397 1.348276 1.055239
Proportion of Variance 0.300269 0.228258 0.139820
Cumulative Proportion  0.300269 0.528527 0.668347

# View the eigenvectors
birds2.pca@model$eigenvectors
Rotation:
                  pc1       pc2       pc3
patch.Ref1a  0.009848 -0.005947 -0.001061
patch.Ref1b -0.001628 -0.014739 -0.001007
patch.Ref1c  0.004994 -0.009486 -0.000523
patch.Ref1d  0.000117 -0.004400 -0.004917
patch.Ref1e  0.003627 -0.001467 -0.004268

---
                pc1       pc2       pc3
S          0.515048  0.226915 -0.123136
year      -0.066269 -0.069526  0.971250
area       0.414050  0.344332  0.149339
log.area.  0.497313  0.363609  0.131261
ENN       -0.390235  0.545631 -0.007944
log.ENN.  -0.345665  0.562834 -0.002092