Setup and manage a Apache Spark cluster in EC2
Project description
The CGCloud Spark project lets you setup a functioning Apache Spark cluster in EC2 in just minutes, regardless of the number of nodes. It is a plugin to CGCloud. While Apache Spark already comes with a script called spark-ec2 that lets you build a cluster in EC2, CGCloud Spark differs from spark-ec2 in the following ways (bad news first):
Tachyon and Yarn are not yet supported.
Setup time does not scale linearly with the number of nodes. Setting up a 100 node cluster takes just as long as setting up a 10 node cluster (2-3 min, as opposed to 45min with spark-ec2). This is made possible by baking all required software into a single AMI. All slave nodes boot up concurrently and autonomously in just a few minutes.
Unlike with spark-ec2, the cluster can be stopped and started via the EC2 API or the EC2 console, without involvement of cgcloud.
The Spark services (master and worker) run as an unprivileged user, not root as with spark-ec2. Ditto for the HDFS services (namenode, datanode and secondarynamenode).
The Spark and Hadoop services are started automatically as the instance boots up, via a regular init script.
Nodes can be added easily, simply by booting up new instances from the AMI. They will join the cluster automatically. HDFS may have to be rebalanced after that.
You can customize the AMI that cluster nodes boot from by subclassing the SparkMaster and SparkSlave classes.
CGCloud Spark uses the CGCLoud Agent which takes care of maintaining a list of authorized keypairs on each node.
CGCloud Spark is based on the official Ubuntu Trusty 14.04 LTS, not the Amazon Linux AMI.
Prerequisites
The cgcloud-spark package
requires that the cgcloud-core package and its prerequisites are present
also depends on lxml which requires gcc, libxml2 and libxslt
Installation
Read the entire section before pasting any commands and ensure that all prerequisites are installed. It is recommended to install cgcloud into a virtualenv. Create a virtualenv and use pip to install cgcloud-spark:
virtualenv cgcloud source cgcloud/bin/activate pip install cgcloud-spark
On OS X systems with a Python that was installed via HomeBrew, you should omit sudo. You can find out if that applies to your system by running which python. If it prints /usr/local/bin/python you are most likely using a HomeBrew Python and should therefore omit sudo. If it prints /usr/bin/python you need to run pip with sudo.
If you get an error message during the installation of the lxml dependency, you might have to install the libxml2 and libxslt headers. On Ubuntu, for example, run:
sudo apt-get install gcc libxml2-dev libxslt-dev
On Redhat-based distros, run the following:
sudo yum install gcc libxml2-devel libxslt-devel
Be sure to configure cgcloud-core before proceeding.
Configuration
Modify your .profile or .bash_profile by adding the following line:
export CGCLOUD_PLUGINS=cgcloud.spark
Login and out (or, on OS X, start a new Terminal tab/window).
Verify the installation by running:
cgcloud list-roles
The output should include the spark-box role.
Usage
Create a single t2.micro box to serve as the template for the cluster nodes:
cgcloud create spark-box -I -T
The -I switch stops the box once it is fully set up and takes an AMI of it. The -T switch terminates it after that.
Create a cluster by booting a master and the slaves from that AMI:
cgcloud create-spark-cluster -s 2 -t m3.large
This will launch a master and two slaves using the m3.large instance type.
SSH into the master:
cgcloud ssh spark-master
… or the first slave:
cgcloud ssh spark-slave -o 0
… or the second slave:
cgcloud ssh spark-slave -o 1
Interactions with Spark and HDFS should be done as the sparkbox user:
cgcloud ssh spark-master -l sparkbox hdfs dfs -ls / spark-shell
Otherwise you are likely to run into permission problems.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cgcloud-spark-1.0.dev8.tar.gz
.
File metadata
- Download URL: cgcloud-spark-1.0.dev8.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ede9ff1e9def8b6fae460ae69495ca15cf28dd1b97f5d6e4984acee2fe83a85a |
|
MD5 | 05e60f608b18e7bac1458cd699c5e8f1 |
|
BLAKE2b-256 | a6c6717566d5845015643f790cea45d47063fe56466d17604cef61ba630b1a2c |
Provenance
File details
Details for the file cgcloud_spark-1.0.dev8-py2.7.egg
.
File metadata
- Download URL: cgcloud_spark-1.0.dev8-py2.7.egg
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ae4b7f679d1ce967ab3eceb543619e3f55398fd7c07f204b05fe716ef4b7388 |
|
MD5 | 6c000c745e1fc489c3c2a5217234b368 |
|
BLAKE2b-256 | 2c0db52e858179b745482b774f86a86ac46c845ae832c7e7ec75d60407e7232f |