Python ------ **I tried to install H2O in Python but ``pip install scikit-learn`` failed - what should I do?** Use the following commands (prepending with ``sudo`` if necessary): :: easy_install pip pip install numpy brew install gcc pip install scipy pip install scikit-learn If you are still encountering errors and you are using OSX, the default version of Python may be installed. We recommend installing the Homebrew version of Python instead: :: brew install python If you are encountering errors related to missing Python packages when using H2O, refer to the following list for a complete list of all Python packages, including dependencies: - ``grip`` - ``tabulate`` - ``wheele`` - ``jsonlite`` - ``ipython`` - ``numpy`` - ``scipy`` - ``pandas`` - ``-U gensim`` - ``jupyter`` - ``-U PIL`` - ``nltk`` - ``beautifulsoup4`` -------------- **How do I specify a value as an enum in Python? Is there a Python equivalent of ``as.factor()`` in R?** Use ``.asfactor()`` to specify a value as an enum. -------------- **I received the following error when I tried to install H2O using the Python instructions on the downloads page - what should I do to resolve it?** :: Downloading/unpacking http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/12/Python/h2o-3.0.0.12-py2.py3-none-any.whl Downloading h2o-3.0.0.12-py2.py3-none-any.whl (43.1Mb): 43.1Mb downloaded Running setup.py egg_info for package from http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/12/Python/h2o-3.0.0.12-py2.py3-none-any.whl Traceback (most recent call last): File "", line 14, in IOError: [Errno 2] No such file or directory: '/tmp/pip-nTu3HK-build/setup.py' Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 14, in IOError: [Errno 2] No such file or directory: '/tmp/pip-nTu3HK-build/setup.py' --- Command python setup.py egg_info failed with error code 1 in /tmp/pip-nTu3HK-build With Python, there is no automatic update of installed packages, so you must upgrade manually. Additionally, the package distribution method recently changed from ``distutils`` to ``wheel``. The following procedure should be tried first if you are having trouble installing the H2O package, particularly if error messages related to ``bdist_wheel`` or ``eggs`` display. :: # this gets the latest setuptools # see https://pip.pypa.io/en/latest/installing.html wget https://bootstrap.pypa.io/ez_setup.py -O - | sudo python # platform dependent ways of installing pip are at # https://pip.pypa.io/en/latest/installing.html # but the above should work on most linux platforms? # on ubuntu # if you already have some version of pip, you can skip this. sudo apt-get install python-pip # the package manager doesn't install the latest. upgrade to latest # we're not using easy_install any more, so don't care about checking that pip install pip --upgrade # I've seen pip not install to the final version ..i.e. it goes to an almost # final version first, then another upgrade gets it to the final version. # We'll cover that, and also double check the install. # after upgrading pip, the path name may change from /usr/bin to /usr/local/bin # start a new shell, just to make sure you see any path changes bash # Also: I like double checking that the install is bulletproof by reinstalling. # Sometimes it seems like things say they are installed, but have errors during the install. Check for no errors or stack traces. pip install pip --upgrade --force-reinstall # distribute should be at the most recent now. Just in case # don't do --force-reinstall here, it causes an issue. pip install distribute --upgrade # Now check the versions pip list | egrep '(distribute|pip|setuptools)' distribute (0.7.3) pip (7.0.3) setuptools (17.0) # Re-install wheel pip install wheel --upgrade --force-reinstall After completing this procedure, go to Python and use ``h2o.init()`` to start H2O in Python. **Notes**: - If you use gradlew to build the jar yourself, you have to start the jar >yourself before you do ``h2o.init()``. - If you download the jar and the H2O package, ``h2o.init()`` will work like R >and you don't have to start the jar yourself. -------------- **How should I specify the datatype during import in Python?** Refer to the following example: :: #Let's say you want to change the second column "CAPSULE" of prostate.csv #to categorical. You have 3 options. #Option 1. Use a dictionary of column names to types. fr = h2o.import_file("smalldata/logreg/prostate.csv", col_types = {"CAPSULE":"Enum"}) fr.describe() #Option 2. Use a list of column types. c_types = [None]*9 c_types[1] = "Enum" fr = h2o.import_file("smalldata/logreg/prostate.csv", col_types = c_types) fr.describe() #Option 3. Use parse_setup(). fraw = h2o.import_file("smalldata/logreg/prostate.csv", parse = False) fsetup = h2o.parse_setup(fraw) fsetup["column_types"][1] = '"Enum"' fr = h2o.parse_raw(fsetup) fr.describe() -------------- **How do I view a list of variable importances in Python?** Use ``model.varimp(return_list=True)`` as shown in the following example: :: model = h2o.gbm(y = "IsDepDelayed", x = ["Month"], training_frame = df) vi = model.varimp(return_list=True) Out[26]: [(u'Month', 69.27436828613281, 1.0, 1.0)] -------------- **How can I get the H2O Python Client to work with third-party plotting libraries for plotting metrics outside of Flow?** In Flow, plots are created using the H2O UI and using specific RESTful commands that are issued from the UI. You can obtain similar plotting specific data in Python using a third-party plotting library such as Pandas or Matplotlib. In addition, every metric that H2O displays in the Flow is calculated on the backend and stored for each model. So you can inspect any metric after getting the data from H2O and then using a plotting library in Python to create the graphs. The example below shows how to plot the logloss for training and validation using Pandas to store the data and also generate the plot. Pandas has a simplified but limited plotting API, and it is also based on Matplotlib. :: # import pandas and matplotlib import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # get the scoring history for the model scoring_history = pd.DataFrame(model.score_history()) # plot the validation and training logloss scoring_history.plot(x='number_of_trees', y = ['validation_logloss', 'training_logloss']) -------------- **What is PySparkling? How can I use it for grid search or early stopping?** PySparkling basically calls H2O Python functions for all operations on H2O data frames. You can perform all H2O Python operations available in H2O Python version 3.6.0.3 or later from PySparkling. For help on a function within IPython Notebook, run ``H2OGridSearch?`` Here is an example of grid search in PySparkling: :: from h2o.grid.grid_search import H2OGridSearch from h2o.estimators.gbm import H2OGradientBoostingEstimator iris = h2o.import_file("/Users/nidhimehta/h2o-dev/smalldata/iris/iris.csv") ntrees_opt = [5, 10, 15] max_depth_opt = [2, 3, 4] learn_rate_opt = [0.1, 0.2] hyper_parameters = {"ntrees": ntrees_opt, "max_depth":max_depth_opt, "learn_rate":learn_rate_opt} gs = H2OGridSearch(H2OGradientBoostingEstimator(distribution='multinomial'), hyper_parameters) gs.train(x=range(0,iris.ncol-1), y=iris.ncol-1, training_frame=iris, nfold=10) #gs.show print gs.sort_by('logloss', increasing=True) Here is an example of early stopping in PySparkling: :: from h2o.grid.grid_search import H2OGridSearch from h2o.estimators.deeplearning import H2ODeepLearningEstimator hidden_opt = [[32,32],[32,16,8],[100]] l1_opt = [1e-4,1e-3] hyper_parameters = {"hidden":hidden_opt, "l1":l1_opt} model_grid = H2OGridSearch(H2ODeepLearningEstimator, hyper_params=hyper_parameters) model_grid.train(x=x, y=y, distribution="multinomial", epochs=1000, training_frame=train, validation_frame=test, score_interval=2, stopping_rounds=3, stopping_tolerance=0.05, stopping_metric="misclassification") -------------- **Do you have a tutorial for grid search in Python?** Yes, a notebook is available `here `__ that demonstrates the use of grid search in Python.