Data Sharing
------------

Sparkling Water enables transformation between different types of Spark's ``RDD`` and H2O's ``H2OFrame``, and vice versa.

Conversion Design
~~~~~~~~~~~~~~~~~

When converting from ``H2OFrame`` to ``RDD``, a wrapper is created around the ``H2OFrame`` to provide an RDD-like API. In this case, no data is duplicated; instead, the data is served directly from the underlying ``H2OFrame``.

Conversion in the opposite direction (i.e, from Spark ``RDD``/``DataFrame`` to ``H2OFrame``) requires evaluation of the data stored in the Spark ``RDD`` and then transferring that from RDD storage into ``H2OFrame``. However, data stored in ``H2OFrame`` is heavily compressed.

Exchanging the Data
~~~~~~~~~~~~~~~~~~~

The way that data is transferred between Spark and H2O differs based on the used Sparkling Water backend. (Refer to :ref:`backend` for more information about the Internal and External backends.)

In the Internal Sparkling Water Backend, Spark and H2O share the same JVM, as is depicted in the following figure.

|Data Sharing|

In the External Sparkling Water Backend, Spark and H2O are separated clusters, and data has to be sent between those clusters over the network.

.. |Data Sharing| image:: ../images/internal_backend_data_sharing.png

Memory Consideration When Converting Between Data Frames Types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When Using Sparkling Water External Backend:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you have allocated the recommended memory amount to your H2O cluster (4 x the size of your dataset),
you don't need to worry about memory constraints when converting between a Spark DataFrame
and an H2OFrame; there is no collision with Spark storage.

Note: the 4 x the size of your dataset assumes your dataset is represented as a CSV.
If your dataset is represented as JSON, XML or parquet, the requirements may differ significantly.

When Using Sparkling Water Internal Backend:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the internal backend mode, H2O-3 shares the JVM with Spark executors. In this case, you will want to
allocate enough memory to run Spark transformations on your DataFrame (which means allocating a
minimum memory of your dataset and memory for those transformations), plus allocate an additional 4 x the
size of your dataset.

Note: there is data duplication when you convert between a Spark DataFrame and an
H2Oframe (though H2O uses compression tricks to help reduce the memory requirements for this
conversion); there is no data duplication when you convert between an H2OFrame and
a Spark DataFrame because Sparkling Water uses a wrapper around the H2OFrame, which
uses the RDD/DataFrame API.