.. _rsparkling_on_windows: Use RSparkling in Windows Environments -------------------------------------- Prepare Spark Environment ~~~~~~~~~~~~~~~~~~~~~~~~~ Initially, please follow the tutorial of running :ref:`run_on_windows`. The configurations applies to RSparkling as well. Prepare R Environment ~~~~~~~~~~~~~~~~~~~~~ Please follow the :ref:`rsparkling` Documentation in order to properly set up R packages and environment. Test the Functionality ~~~~~~~~~~~~~~~~~~~~~~ Use the following script below to test if you have any RSparkling issues. The script will check that you can: 1. Connect to Spark 2. Start H2O 3. Copy a R dataframe from R to a Spark DataFrame. .. code:: r library(sparklyr) options(rsparkling.sparklingwater.version = "SUBST_SW_VERSION") # Using Sparkling Water SUBST_SW_VERSION library(rsparkling) # Set spark connection sc <- spark_connect(master = "local", version = "SUBST_SPARK_VERSION") # Create H2O Context h2o_context(sc) # Copy R dataset to Spark library(dplyr) mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE) mtcars_tbl Troubleshooting ~~~~~~~~~~~~~~~ - Error from running ``spark_connect`` .. code:: java Error in force(code) : Failed while connecting to sparklyr to port (8880) for sessionid (4388): Gateway in port (8880) did not respond. .... :: problems summary :: :::: WARNINGS [NOT FOUND ] commons-io#commons-io;2.4!commons-io.jar (16ms) ==== local-m2-cache: tried This may be caused by the Windows Firewall. The package sparklyR is downloading jars from Maven Center repository to run the ``spark_connect`` function. This issue is due to Windows Firewall which actively prevents ``Java.exe`` from reaching out to the internet and downloading. To fix this error, download the files manually from the Maven repository. - Error from running ``h2o_context`` .. code:: java Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2.0 (TID 13, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:483) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:471) This is caused because ``HADOOP_HOME`` environment variable is not explicitly set. Set the `HADOOP_HOME` environment to ``%SPARK_HOME%/tmp/hadoop`` or location where ``bin\winutils.exe`` is located. Download winutils.exe binary from https://github.com/steveloughran/winutils repository. **NOTE**: You need to select correct Hadoop version which is compatible with your Spark distribution. Hadoop version is often encoded in spark download name, for example, ``spark-SUBST_SPARK_VERSION-bin-hadoop2.7.tgz``. - Error from running ``copy_to`` .. code:: java Error: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) This is caused because there are no permissions to the folder: ``file:///tmp/hive``. You can run a command in the command prompt which will change the permissions of the ``/tmp/hive`` directory. It will change the permissions of the ``/tmp/hive`` directory so that all three users (Owner, Group, and Public) can Read, Write, and Execute. In order to change the permissions, go to the command prompt and write: ``\path\to\winutils\Winutils.exe chmod 777 \tmp\hive`` You can also create a file ``hive-site.xml`` in ``%HADOOP_HOME%\conf`` and modify location of default Hive scratch dir (which is ``/tmp/hive``): .. code:: xml hive.exec.scratchdir /Users/michal/hive/ Scratch space for Hive jobs In this case, do not forget to set the variable ``HADOOP_CONF_DIR``: .. code:: bash SET HADOOP_CONF_DIR=%HADOOP_HOME%\conf If the previous does not work, you can delete the ``metastore_db`` folder in your R working directory. References ~~~~~~~~~~ - :ref:`rsparkling` - `H2O.ai website `__ - `Running Spark Applications on Windows `__