Imputing Data
-------------

The impute function allows you to perform in-place imputation by filling missing values with aggregates computed on the "na.rmâ€™d" vector. Additionally, you can also perform imputation based on groupings of columns from within the dataset. These columns can be passed by index or by column name to the ``by`` parameter. Note that if a factor column is supplied, then the method must be ``mode``.

The ``impute`` function accepts the following arguments:

- ``dataset``: The dataset containing the column to impute
- ``column``: A specific column to impute. The default of ``0`` specifies to impute the entire frame.
- ``method``: The type of imputation to perform. ``mean`` replaces NAs with the column mean; ``median`` replaces NAs with the column median; ``mode`` replaces with the most common factor (for factor columns only).
- ``combine_method``: If method is ``median``, then choose how to combine quantiles on even sample sizes. This parameter is ignored in all other cases. Available options for ``combine_method`` include ``interpolate``, ``average``, ``low``, and ``high``. 
- ``by``: Group by columns
- ``groupByFrame`` or ``group_by_frame``: Impute the column with this pre-computed grouped frame.
- ``values``:  A vector of impute values (one per column). NaN indicates to skip the column.

.. example-code::
   .. code-block:: r

	library(h2o)
	h2o.init()

   	# Upload the Airlines dataset
   	filePath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
   	air <- h2o.importFile(filePath, "air")
   	print(dim(air))
   	43978    31

   	# Show the number of rows with NA.
   	print(numNAs <- sum(is.na(air$DepTime)))
   	[1] 1086

   	DepTime_mean <- mean(air$DepTime, na.rm = TRUE)
   	print(DepTime_mean)
   	[1] 1345.847

   	# Mean impute the DepTime column
   	h2o.impute(air, "DepTime", method = "mean")
   	 [1]     NaN      NaN      NaN      NaN 1345.847      NaN      NaN      NaN
	 [9]     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
	[17]     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
	[25]     NaN      NaN      NaN      NaN      NaN      NaN      NaN

	# Revert the imputations
	air <- h2o.importFile(filePath, "air")

	# Impute the column using a grouping based on the Origin and Distance
	# If the Origin and Distance produce groupings of NAs, then no imputation will be done (NAs will result).
	h2o.impute(air, "DepTime", method = "mean", by = c("Dest"))
	  Dest mean_DepTime
	1  ABE     1671.795
	2  ABQ     1308.074
	3  ACY     1651.095
	4  ALB     1405.412
	5  AMA     1404.333
	6  ANC     2022.000

	[134 rows x 2 columns]

	# Revert the imputations
	air <- h2o.importFile(filePath, "air")

	# Impute a factor column by the most common factor in that column
	h2o.impute(air, "TailNum", method = "mode")
	 [1]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 3499  NaN  NaN  NaN  NaN
	[16]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
	[31]  NaN

	# Revert imputations
	air <- h2o.importFile(filePath, "air")

	# Impute a factor column using a grouping based on the Month
	h2o.impute(air, "TailNum", method = "mode", by=c("Month"))
	  Month mode_TailNum
	1     1         3499
	2    10         3499

   .. code-block:: python

    import h2o
    h2o.init()

    # Import the airlines dataset
    air_path = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    air = h2o.import_file(path=air_path)
    air.dim
    [43978, 31]

    # Mean impute the DepTime column based on the Origin and Distance columns
    DeptTime_impute = air.impute("DepTime", method = "mean", by = ["Origin", "Distance"])
    DeptTime_impute
    Origin      Distance    mean_DepTime
    --------  ----------  --------------
    ABE              253         1149.7
    ABE              481          812
    ABQ              223         1229.33
    ABQ              277         1565
    ABQ              289         1529
    ABQ              321         1267.06
    ABQ              328         1301.85
    ABQ              332         1655
    ABQ              349          813.28
    ABQ              487         1536.14

    [1497 rows x 3 columns]

    # Revert imputations
    air = h2o.import_file(path=air_path)

    # Mode impute the TailNum column
    mode_impute = air.impute("TailNum", method = "mode")
    mode_impute
    [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 3499.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

    # Revert imputations
    air = h2o.import_file(path=air_path)

    # Mode impute the TailNum column based on the Month and Year columns
    mode_impute = air.impute("TailNum", method = "mode", by=["Month", "Year"])
    mode_impute
    Year    Month    mode_TailNum
    ------  -------  --------------
      1987       10            3499
      1988        1            3499
      1989        1            3499
      1990        1            3499
      1991        1            3499
      1992        1            3499
      1993        1            3499
      1994        1            3499
      1995        1            3500
      1996        1             672

    [22 rows x 3 columns]