Merging Two Datasets¶

You can use the merge function to combine two datasets that share a common column name. By default, all columns in common are used as the merge key; uncommon will be ignored. Also, if you want to use only a subset of the columns in common, rename the other columns so the columns are unique in the merged result.

Note that in order for a merge to work in multinode clusters, one of the datasets must be small enough to exist in every node.

# Currently, this function only supports `all.x = TRUE`. All other permutations will fail.
library(h2o)
h2o.init()

# Create two simple, two-column R data frames by inputting values, ensuring that both have a common column (in this case, "fruit").
left <- data.frame(fruit = c('apple', 'orange', 'banana', 'lemon', 'strawberry', 'blueberry'),
                   color = c('red', 'orange', 'yellow', 'yellow', 'red', 'blue'))
right <- data.frame(fruit = c('apple', 'orange', 'banana', 'lemon', 'strawberry', 'watermelon'),
                    citrus = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE))

# Create the H2O data frames from the inputted data.
left_frame <- as.h2o(left)
print(left_frame)
        fruit  color
 1      apple    red
 2     orange orange
 3     banana yellow
 4      lemon yellow
 5 strawberry    red
 6  blueberry   blue

[6 rows x 2 columns]

right_frame <- as.h2o(right)
print(right_frame)
        fruit citrus
 1      apple  FALSE
 2     orange   TRUE
 3     banana  FALSE
 4      lemon   TRUE
 5 strawberry  FALSE
 6 watermelon  FALSE

[6 rows x 2 columns]

# Merge the data frames. The result is a single dataset with three columns.
new_frame <- h2o.merge(left_frame, right_frame, all.x = TRUE)
print(new_frame)
       fruit  color citrus
1  blueberry   blue   <NA>
2      apple    red  FALSE
3     banana yellow  FALSE
4      lemon yellow   TRUE
5     orange orange   TRUE
6 strawberry    red  FALSE

[6 rows x 3 columns]

import h2o
h2o.init()
import numpy as np

# Create a dataset by inputting raw data.
df1 = h2o.H2OFrame.from_python({'A':['Hello', 'World', 'Welcome', 'To', 'H2O', 'World'],
                                'n': [0,1,2,3,4,5]})
df1.describe
A          n
-------  ---
Hello      0
World      1
Welcome    2
To         3
H2O        4
World      5

[6 rows x 2 columns]

# Generate a random dataset from python.
df2 = h2o.H2OFrame.from_python([[x] for x in np.random.randint(0, 10, size=20).tolist()], column_names=['n'])
df2.describe
  n
---
nan
  0
  8
  6
  1
  7
  8
  5
  1
  3

[21 rows x 1 column]

# Merge the first dataset into the second dataset. Note that only columns
# in common are merged (i.e, values in df2 greater than 5 will not be merged).
df3 = df2.merge(df1)
df3.describe
  n  A
---  -------
nan  Hello
  3  To
  3  To
  0  Hello
  5  World
  3  To
  0  Hello
  5  World
  1  World
  2  Welcome

[14 rows x 2 columns]

# Merge all of df2 into df1. Note that this will result in missing values for
# column A, which does not include values greater than 5.
df4 = df2.merge(df1, all_x=True)
df4.describe
  n  A
---  -----
nan  Hello
  0  Hello
  8
  6
  1  World
  7
  8
  5  World
  1  World
  3  To

[21 rows x 2 columns]