Use RSparkling in Windows Environments

Prepare Spark Environment

Initially, please follow the tutorial of running Use Sparkling Water in Windows Environments. The configurations applies to RSparkling as well.

Prepare R Environment

Please follow the RSparkling Documentation in order to properly set up R packages and environment.

Test the Functionality

Use the following script below to test if you have any RSparkling issues.

The script will check that you can:

  1. Connect to Spark
  2. Start H2O
  3. Copy a R dataframe from R to a Spark DataFrame.
library(sparklyr)
options(rsparkling.sparklingwater.version = "2.3.20") # Using Sparkling Water 2.3.20
library(rsparkling)

# Set spark connection
sc <- spark_connect(master = "local", version = "2.3.2")

# Create H2O Context
h2o_context(sc)

# Copy R dataset to Spark
library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl

Troubleshooting

  • Error from running spark_connect

    Error in force(code) :
      Failed while connecting to sparklyr to port (8880) for sessionid (4388): Gateway in port (8880) did not respond.
      ....
      :: problems summary ::
    :::: WARNINGS
                   [NOT FOUND  ] commons-io#commons-io;2.4!commons-io.jar (16ms)
    
            ==== local-m2-cache: tried
    

    This may be caused by the Windows Firewall. The package sparklyR is downloading jars from Maven Center repository to run the spark_connect function. This issue is due to Windows Firewall which actively prevents Java.exe from reaching out to the internet and downloading.

    To fix this error, download the files manually from the Maven repository.

  • Error from running h2o_context

    Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2.0 (TID 13, localhost): java.lang.NullPointerException
           at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
           at org.apache.hadoop.util.Shell.runCommand(Shell.java:483)
           at org.apache.hadoop.util.Shell.run(Shell.java:456)
           at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
           at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
           at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
           at org.apache.spark.util.Utils$.fetchFile(Utils.scala:471)
    

    This is caused because HADOOP_HOME environment variable is not explicitly set. Set the HADOOP_HOME environment to %SPARK_HOME%/tmp/hadoop or location where bin\winutils.exe is located.

    Download winutils.exe binary from https://github.com/steveloughran/winutils repository.

    NOTE: You need to select correct Hadoop version which is compatible with your Spark distribution. Hadoop version is often encoded in spark download name, for example, spark-2.3.2-bin-hadoop2.7.tgz.

  • Error from running copy_to

    Error: java.lang.reflect.InvocationTargetException
            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
            at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
            at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
            at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
            at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
    

    This is caused because there are no permissions to the folder: file:///tmp/hive. You can run a command in the command prompt which will change the permissions of the /tmp/hive directory. It will change the permissions of the /tmp/hive directory so that all three users (Owner, Group, and Public) can Read, Write, and Execute.

    In order to change the permissions, go to the command prompt and write: \path\to\winutils\Winutils.exe chmod 777 \tmp\hive

    You can also create a file hive-site.xml in %HADOOP_HOME%\conf and modify location of default Hive scratch dir (which is /tmp/hive):

    <configuration>
      <property>
        <name>hive.exec.scratchdir</name>
        <value>/Users/michal/hive/</value>
        <description>Scratch space for Hive jobs</description>
      </property>
    </configuration>
    

    In this case, do not forget to set the variable HADOOP_CONF_DIR:

    SET HADOOP_CONF_DIR=%HADOOP_HOME%\conf
    

    If the previous does not work, you can delete the metastore_db folder in your R working directory.