Setting up an H2O Hadoop cluster on a Mac

Note

The following instructions should work on any reasonably modern OS X (10.6 and up), but were only tested on OS X 10.9

Installing H2O on a Mac

Prerequisites

  1. Install Xcode from the Apple store
  2. Download and install Java 1.7 from Oracle’s website NOTE: DO NOT USE JAVA 1.8.
  3. Download and install R from CRAN

Warning

Homebrew might cause conflicts if you already have MacPorts enabled. Proceed with the next step at your own risk or use MacPorts instead of Homebrew.

  1. Install the Homebrew package manager by issuing the following commands in a terminal
$ ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go/install)"
$ brew doctor
$ brew update

Optional Dependencies

  1. Optional: Install sphinx (to build the documentation)
$ sudo easy_install sphinx
$ sudo easy_install sphinxcontrib-fulltoc
  1. Optional: Download and install LaTex for Mac
  2. Optional: Install PDFUnite (to build some PDFs)
$ brew install poppler

Building H2O From Github

  1. Get H2O from Github

Installing Hadoop on a Mac

  1. Install Hadoop via Homebrew
$ brew install hadoop
  1. Optional - Give yourself permission to write to /usr/local/{include,lib,etc} (or use sudo to launch Hadoop)
$ sudo chmod -R a+w /usr/local/{include,lib,etc}
  1. Configure Hadoop (modify the file paths or version number if applicable):

Note: In Hadoop 1.x these files are found in, e.g., /usr/local/Cellar/hadoop/1.2.1/libexec/conf/. In Hadoop 2.x these files are found in, e.g., /usr/local/Cellar/hadoop/2.2.0/libexec/etc/hadoop/.

Modify core-site.xml to contain the following:

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://localhost:8020</value>
        </property>
</configuration>

Modify mapred-site.xml to contain the following (NOTE: you may need to create the file from mapred-site.xml.template):

<configuration>
        <property>
                <name>mapred.job.tracker</name>
                <value>localhost:9001</value>
        </property>
        <property>
                <name>mapred.tasktracker.map.tasks.maximum</name>
                <value>5</value>
        </property>
</configuration>

Modify hdfs-site.xml to contain the following:

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>
  1. Optional: Enable password-less SSH from localhost to localhost for convenience.

First enable remote login in the system sharing control panel, and then:

$ brew install ssh-copy-id
$ ssh-keygen
$ ssh-copy-id -i ~/.ssh/id_rsa.pub localhost
  1. Start Hadoop MapReduce services, e.g.:
$ /usr/local/Cellar/hadoop/1.2.1/bin/start-all.sh

or

$ /usr/local/Cellar/hadoop/2.2.0/sbin/start-dfs.sh
$ /usr/local/Cellar/hadoop/2.2.0/sbin/start-yarn.sh
  1. Verify that Hadoop is up and running by checking the output of jps (look for NameNode, DataNode, JobTracker, TaskTracker)
$ jps
            81829 JobTracker
            81556 NameNode
            81756 SecondaryNameNode
            9382 Jps
            81655 DataNode
            81928 TaskTracker
  1. Format HDFS and leave the safe mode.
$ hadoop namenode -format
$ hadoop dfsadmin -safemode leave

Launching H2O on Hadoop

  1. Launch a 5-node H2O Hadoop cluster (from the h2o directory), assuming you have enough free memory (>5GB)
$ hadoop jar target/hadoop/h2odriver_cdh4.jar water.hadoop.h2odriver \
                                 -libjars target/h2o.jar -mapperXmx 1g -nodes 5 -output out
  1. Point your web browser to the HTTP URL http://localhost:54321; H2O will run from there.
  2. Optional: Delete the output file after shutting down H2O
$ hadoop fs -rmr out