Hadoop¶

Why did I get an error in R when I tried to save my model to my home directory in Hadoop?¶

To save the model in HDFS, prepend the save directory with hdfs://:

# build model
model = h2o.glm(model params)

# save model
hdfs_name_node <- "mr-0x6"
hdfs_tmp_dir <- "/tmp/runit”
model_path <- sprintf("hdfs://%s%s", hdfs_name_node, hdfs_tmp_dir)
h2o.saveModel(model, dir = model_path, name = “mymodel")

What amount of resources are used and reported back to YARN?¶

The following table provides a summary of the YARN resource usage.

Size	Nodes	Memory	AM Resource	Total Vcore	Total Memory	Assuming Default Value = 5gb
Tiny	1	4g	<Default value>	1 + 1 = 2	1.1(1 4g) + <Default value> = 4.4gb+	9.4
Small	2	4g	<Default value>	2 + 1 = 3	1.1(2 4g) + <Default value> = 8.8gb+	13.8
Medium	3	8g	<Default value>	3 + 1 = 4	1.1(3 8g) + <Default value> = 26.4gb+	31.4
Large	5	8g	<Default value>	5 + 1 = 6	1.1(5 8g) + <Default value> = 44gb+	49

Notes:

Each time you launch an H2O cluster, you need one container or one vcore for the AM (Application Master). So a tiny cluster actually takes two vcores, not one.
H2O will pad the JVM with 10% (default) overhead in the container. So when you request 4gb JVMs, H2O will actually request 4.4gb containers from YARN.
The amount of memory requested for the AM container is defined by default in YARN configurations. Specifically you have to look at config param: yarn.app.mapreduce.am.resource.mb. Assuming a value of 5gb for yarn.app.mapreduce.am.resource.mb and launching one medium and two tiny clusters, you would use 8 vcores and 50.2gb memory.

You can also review the YARN resource manager for more information. Refer to the image below for an example. Note that this example assumes that the user has run the following:

hadoop jar h2odriver.jar -Dyarn.app.mapreduce.am.resource.mb=1024 -nodes 2 -mapperXmx 4g -extramempercent 10 -output outputdir

How do I specify which nodes should run H2O in a Hadoop cluster?¶

After creating and applying the desired node labels and associating them with specific queues as described in the Hadoop documentation, launch H2O using the following command:

hadoop jar h2odriver.jar -Dmapreduce.job.queuename=<my-h2o-queue> -nodes <num-nodes> -mapperXmx 6g -output hdfsOutputDirName

-Dmapreduce.job.queuename=<my-h2o-queue> represents the queue name
-nodes <num-nodes> represents the number of nodes
-mapperXmx 6g launches H2O with 6g of memory
-output hdfsOutputDirName specifies the HDFS output directory as hdfsOutputDirName

How do I import data from HDFS in R and in Flow?¶

To import a folder from HDFS in R:

h2o.importFolder(path, pattern = "", destination_frame = "", parse = TRUE, header = NA, sep = "", col.names = NULL, na.strings = NULL)

Here is another example:

# pathToAirlines <- "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
# airlines.hex <- h2o.importFile(path = pathToAirlines, destination_frame = "airlines.hex")

In Flow, the easiest way is to let the auto-suggestion feature in the Search: field complete the path for you. Just start typing the path to the file, starting with the top-level directory, and H2O provides a list of matching files.

Click the file to add it to the Search: field.

Why do I receive the following error when I try to save my notebook in Flow?¶

Error saving notebook: Error calling POST /3/NodePersistentStorage/notebook/Test%201 with opts

When you are running H2O on Hadoop, H2O tries to determine the home HDFS directory so it can use that as the download location. If the default home HDFS directory is not found, manually set the download location from the command line using the -flow_dir parameter (for example, hadoop jar h2odriver.jar <...> -flow_dir hdfs:///user/yourname/yourflowdir). You can view the default download directory in the logs by clicking Admin > View logs… and looking for the line that begins Flow dir:.

How do I access data in HDFS without launching H2O on YARN?¶

Each h2odriver.jar file is built with a specific Hadoop distribution so in order to have a working HDFS connection download the h2odriver.jar file for your Hadoop distribution from here.

Then run the command to launch the H2O Application in the driver by specifying the classpath:

unzip h2o-3.42.0.2.zip
cd h2o-3.42.0.2
java -cp h2odriver.jar water.H2OApp

Can I configure HDFS multiple times?¶

Yes, you can specify multiple Hadoop configuration files at once and each provided resource will be processed.

For example, if you need to specify both core-site.xml and hdfs-site.xml you can configure both at once:

java -jar h2o.jar -hdfs_config core-site.xml -hdfs_config hdfs-site.xml