Start Sparkling Water on Amazon EMR using our Terraform Template¶
This terraform template supports only Sparkling Water builds for Apache Spark 2.4 and higher.
Sparkling Water comes with the pre-defined Terraform templates that can be used to deploy Sparkling Water to Amazon EMR.
Before you start, you need to have Terraform installed on your machine. If you are not familiar with Terraform, we suggest reading the Terraform documentation.
The Terraform scripts for EMR are available in the release distribution in the templates/build/terraform/aws directory or on-line in S3 as part of each Sparkling Water release.
Sparkling Water provides 3 templates/modules:
network
module (/templates/build/terraform/aws/modules/network). This module is used to set up network infrastructure on your AWS. It sets up VPC, internet gateway, subnet, routing tables and default DHCP settings.This module accepts the following arguments:
aws_access_key
(mandatory) - access key to access AWS
aws_secret_key
(mandatory) - secret key to access AWS
aws_region
(optional) - AWS region. Defaults tous-east-1
aws_availability_zone
(optional) - AWS availability zone. Defaults tous-east-1e
.
aws_vpc_cidr_block
(optional) - VPC CIDR block. Defaults to10.0.0.0/16
.
aws_subnet_cidr_block
(optional) - VPC subnet CIDR block. Defaults to10.0.0.0/24
.
emr
module (/templates/build/terraform/aws/modules/emr). This module is used to start the EMR cluster on an already existing AWS network infrastructure. It sets up the correct roles, instance profiles, and security groups, and it starts EMR with the correct dependencies to run Sparkling Water.This module accepts the following arguments:
aws_access_key
(mandatory) - access key to access AWS
aws_secret_key
(mandatory) - secret key to access AWS
aws_ssh_public_key
(optional) - public key (to be able to access EC2 instances via ssh later)
aws_vpc_id
(mandatory) - ID of existing VPC
aws_subnet_id
(mandatory) - ID of existing VPC subnet
aws_region
(optional) - AWS region. Defaults tous-east-1
.
aws_emr_version
(optional) - EMR version. Defaults toemr-5.19.0
.
aws_core_instance_count
(optional) - Number of worker nodes. Defaults to2
.
aws_instance_type
(optional) - type of EC2 instances. Defaults tom5.xlarge
.
sw_version
(optional) - Sparkling Water version. Defaults to3.42.0.3-1-2.3
.
jupyter_name
(optional) - User name for Jupyter Notebook. Defaults toadmin
.
default
module (/templates/build/terraform/aws). This module is a combination of the two previous modules. It starts the network infrastructure and starts EMR with Sparkling Water on top of it.This module accepts the following arguments:
aws_access_key
(mandatory) - access key to access AWS
aws_secret_key
(mandatory) - secret key to access AWS
aws_ssh_public_key
(optional) - public key (to be able to access EC2 instances via ssh later)
aws_region
(optional) - AWS region. Defaults tous-east-1
.
aws_emr_version
(optional) - EMR version. Defaults toemr-5.19.0
.
aws_core_instance_count
(optional) - Number of worker nodes. Defaults to2
.
aws_instance_type
(optional) - type of EC2 instances. Defaults tom5.xlarge
.
sw_version
(optional) - Sparkling Water version. Defaults to3.42.0.3-1-2.3
.
jupyter_name
(optional) - User name for Jupyter Notebook. Defaults toadmin
.
We provide these 3 templates as we realize some users already have some network infrastructure on their AWS clusters available, and they don’t want to use a template that creates it again.
You can see that one of the mandatory arguments is a public key. This public key is associated with
EC2 machines so you can actually log in via ssh. You can create yourself a key-pair using ssh-keygen -t rsa
tool.
To start any template you want, please run the following in the corresponding module directory:
terraform init
terraform apply
You will be asked to provide mandatory variables. Please see Terraform documentation for more information how to set up variables.
To access the Jupyter Notebook, please go to the URL returned by jupyter_notebook_url
Terraform output.
You need to approve the security exception (self-signed certificate) in your browser. The only way how to access the
notebook is by using the link above (with the token). Password login is not enabled.
If you would like to connect to the master machine via SSH, make sure you have filled aws_ssh_public_key
argument
and please run:
ssh -i path/to/private.key hadoop@public_master_dns
where private.key
is the private key for the public key we specified as input and public_master_dns
is public DNS name of the master node. This DNS name is printed as output after terraform apply
finishes.