Use RSparkling in Windows Environments¶
Prepare Spark Environment¶
Initially, please follow the tutorial of running Use Sparkling Water in Windows Environments. The configurations applies to RSparkling as well.
Prepare R Environment¶
Please follow the RSparkling Documentation in order to properly set up R packages and environment.
Test the Functionality¶
Use the following script below to test if you have any RSparkling issues.
The script will check that you can:
- Connect to Spark
- Start H2O
- Copy a R dataframe from R to a Spark DataFrame.
library(sparklyr)
options(rsparkling.sparklingwater.version = "2.4.11-SNAPSHOT-91") # Using Sparkling Water 2.4.11-SNAPSHOT-91
library(rsparkling)
# Set spark connection
sc <- spark_connect(master = "local", version = "2.4.0")
# Create H2O Context
h2o_context(sc)
# Copy R dataset to Spark
library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl
Troubleshooting¶
Error from running
spark_connect
Error in force(code) : Failed while connecting to sparklyr to port (8880) for sessionid (4388): Gateway in port (8880) did not respond. .... :: problems summary :: :::: WARNINGS [NOT FOUND ] commons-io#commons-io;2.4!commons-io.jar (16ms) ==== local-m2-cache: tried
This may be caused by the Windows Firewall. The package sparklyR is downloading jars from Maven Center repository to run the
spark_connect
function. This issue is due to Windows Firewall which actively preventsJava.exe
from reaching out to the internet and downloading.To fix this error, download the files manually from the Maven repository.
Error from running
h2o_context
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 1 times, most recent failure: Lost task 3.0 in stage 2.0 (TID 13, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:483) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:471)
This is caused because
HADOOP_HOME
environment variable is not explicitly set. Set the HADOOP_HOME environment to%SPARK_HOME%/tmp/hadoop
or location wherebin\winutils.exe
is located.Download winutils.exe binary from https://github.com/steveloughran/winutils repository.
NOTE: You need to select correct Hadoop version which is compatible with your Spark distribution. Hadoop version is often encoded in spark download name, for example,
spark-2.4.0-bin-hadoop2.7.tgz
.Error from running
copy_to
Error: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
This is caused because there are no permissions to the folder:
file:///tmp/hive
. You can run a command in the command prompt which will change the permissions of the/tmp/hive
directory. It will change the permissions of the/tmp/hive
directory so that all three users (Owner, Group, and Public) can Read, Write, and Execute.In order to change the permissions, go to the command prompt and write:
\path\to\winutils\Winutils.exe chmod 777 \tmp\hive
You can also create a file
hive-site.xml
in%HADOOP_HOME%\conf
and modify location of default Hive scratch dir (which is/tmp/hive
):<configuration> <property> <name>hive.exec.scratchdir</name> <value>/Users/michal/hive/</value> <description>Scratch space for Hive jobs</description> </property> </configuration>
In this case, do not forget to set the variable
HADOOP_CONF_DIR
:SET HADOOP_CONF_DIR=%HADOOP_HOME%\conf
If the previous does not work, you can delete the
metastore_db
folder in your R working directory.