Imputing Data
-------------

The impute function allows you to perform in-place imputation by filling missing values with aggregates computed on the "na.rm’d" vector. Additionally, you can also perform imputation based on groupings of columns from within the dataset. These columns can be passed by index or by column name to the ``by`` parameter. Note that if a factor column is supplied, then the method must be ``mode``.

The ``impute`` function accepts the following arguments:

- ``dataset``: The dataset containing the column to impute
- ``column``: A specific column to impute. The default of ``0`` specifies to impute the entire frame.
- ``method``: The type of imputation to perform. ``mean`` replaces NAs with the column mean; ``median`` replaces NAs with the column median; ``mode`` replaces with the most common factor (for factor columns only).
- ``combine_method``: If method is ``median``, then choose how to combine quantiles on even sample sizes. This parameter is ignored in all other cases. Available options for ``combine_method`` include ``interpolate``, ``average``, ``low``, and ``high``. 
- ``by``: Group by columns
- ``groupByFrame`` or ``group_by_frame``: Impute the column with this pre-computed grouped frame.
- ``values``:  A vector of impute values (one per column). NaN indicates to skip the column.

.. example-code::
   .. code-block:: r

	library(h2o)
	h2o.init()

   	# Upload the Airlines dataset
   	filePath <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
   	air <- h2o.importFile(filePath, "air")
   	print(dim(air))
   	43978    31

   	# Show the number of rows with NA.
   	print(numNAs <- sum(is.na(air$DepTime)))
   	[1] 1086

   	DepTime_mean <- mean(air$DepTime, na.rm = TRUE)
   	print(DepTime_mean)
   	[1] 1345.847

   	# Mean impute the DepTime column
   	h2o.impute(air, "DepTime", method = "mean")
   	 [1]     NaN      NaN      NaN      NaN 1345.847      NaN      NaN      NaN
	 [9]     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
	[17]     NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
	[25]     NaN      NaN      NaN      NaN      NaN      NaN      NaN

	# Revert the imputations
	air <- h2o.importFile(filePath, "air")

	# Impute the column using a grouping based on the Origin and Distance
	# If the Origin and Distance produce groupings of NAs, then no imputation will be done (NAs will result).
	h2o.impute(air, "DepTime", method = "mean", by = c("Dest"))
	  Dest mean_DepTime
	1  ABE     1671.795
	2  ABQ     1308.074
	3  ACY     1651.095
	4  ALB     1405.412
	5  AMA     1404.333
	6  ANC     2022.000

	[134 rows x 2 columns]

	# Revert the imputations
	air <- h2o.importFile(filePath, "air")

	# Impute a factor column by the most common factor in that column
	h2o.impute(air, "TailNum", method = "mode")
	 [1]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 3499  NaN  NaN  NaN  NaN
	[16]  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
	[31]  NaN

	# Revert imputations
	air <- h2o.importFile(filePath, "air")

	# Impute a factor column using a grouping based on the Month
	h2o.impute(air, "TailNum", method = "mode", by=c("Month"))
	  Month mode_TailNum
	1     1         3499
	2    10         3499

   .. code-block:: python

    import h2o
    h2o.init()

    # Import the airlines dataset
    air_path = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
    air = h2o.import_file(path=air_path)
    air.dim
    [43978, 31]

    # Mean impute the DepTime column based on the Origin and Distance columns
    DeptTime_impute = air.impute("DepTime", method = "mean", by = ["Origin", "Distance"])
    DeptTime_impute
    Origin      Distance    mean_DepTime
    --------  ----------  --------------
    ABE              253         1149.7
    ABE              481          812
    ABQ              223         1229.33
    ABQ              277         1565
    ABQ              289         1529
    ABQ              321         1267.06
    ABQ              328         1301.85
    ABQ              332         1655
    ABQ              349          813.28
    ABQ              487         1536.14

    [1497 rows x 3 columns]

    # Revert imputations
    air = h2o.import_file(path=air_path)

    # Mode impute the TailNum column
    mode_impute = air.impute("TailNum", method = "mode")
    mode_impute
    [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 3499.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

    # Revert imputations
    air = h2o.import_file(path=air_path)

    # Mode impute the TailNum column based on the Month and Year columns
    mode_impute = air.impute("TailNum", method = "mode", by=["Month", "Year"])
    mode_impute
    Year    Month    mode_TailNum
    ------  -------  --------------
      1987       10            3499
      1988        1            3499
      1989        1            3499
      1990        1            3499
      1991        1            3499
      1992        1            3499
      1993        1            3499
      1994        1            3499
      1995        1            3500
      1996        1             672

    [22 rows x 3 columns]