Use RSparkling in Windows Environments¶

Prepare Spark Environment¶

Initially, please follow the tutorial of running Use Sparkling Water in Windows Environments. The configurations applies to RSparkling as well.

Prepare R Environment¶

Please follow the RSparkling Documentation in order to properly set up R packages and environment.

Test the Functionality¶

Use the following script below to test if you have any RSparkling issues.

The script will check that you can:

Connect to Spark
Start H2O
Copy a R dataframe from R to a Spark DataFrame.

library(sparklyr)
options(rsparkling.sparklingwater.version = "2.4.11-SNAPSHOT-91") # Using Sparkling Water 2.4.11-SNAPSHOT-91
library(rsparkling)

# Set spark connection
sc <- spark_connect(master = "local", version = "2.4.0")

# Create H2O Context
h2o_context(sc)

# Copy R dataset to Spark
library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl

Troubleshooting¶

Error from running spark_connect
Error in force(code) : Failed while connecting to sparklyr to port (8880) for sessionid (4388): Gateway in port (8880) did not respond. .... :: problems summary :: :::: WARNINGS [NOT FOUND ] commons-io#commons-io;2.4!commons-io.jar (16ms) ==== local-m2-cache: tried
This may be caused by the Windows Firewall. The package sparklyR is downloading jars from Maven Center repository to run the spark_connect function. This issue is due to Windows Firewall which actively prevents Java.exe from reaching out to the internet and downloading.

To fix this error, download the files manually from the Maven repository.

Error from running h2o_context

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2.0 (TID 13, localhost): java.lang.NullPointerException
       at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
       at org.apache.hadoop.util.Shell.runCommand(Shell.java:483)
       at org.apache.hadoop.util.Shell.run(Shell.java:456)
       at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
       at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
       at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
       at org.apache.spark.util.Utils$.fetchFile(Utils.scala:471)
This is caused because HADOOP_HOME environment variable is not explicitly set. Set the HADOOP_HOME environment to %SPARK_HOME%/tmp/hadoop or location where bin\winutils.exe is located.

Download winutils.exe binary from https://github.com/steveloughran/winutils repository.

NOTE: You need to select correct Hadoop version which is compatible with your Spark distribution. Hadoop version is often encoded in spark download name, for example, spark-2.4.0-bin-hadoop2.7.tgz.

Error from running copy_to

Error: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
        at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
        at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
This is caused because there are no permissions to the folder: file:///tmp/hive. You can run a command in the command prompt which will change the permissions of the /tmp/hive directory. It will change the permissions of the /tmp/hive directory so that all three users (Owner, Group, and Public) can Read, Write, and Execute.

In order to change the permissions, go to the command prompt and write: \path\to\winutils\Winutils.exe chmod 777 \tmp\hive

You can also create a file hive-site.xml in %HADOOP_HOME%\conf and modify location of default Hive scratch dir (which is /tmp/hive):
<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/Users/michal/hive/</value>
    <description>Scratch space for Hive jobs</description>
  </property>
</configuration>
In this case, do not forget to set the variable HADOOP_CONF_DIR:
SET HADOOP_CONF_DIR=%HADOOP_HOME%\conf
If the previous does not work, you can delete the metastore_db folder in your R working directory.

Use RSparkling in Windows Environments¶

Prepare Spark Environment¶

Prepare R Environment¶

Test the Functionality¶

Troubleshooting¶

References¶