Skip to main content

Setup and manage a Apache Spark cluster in EC2

Project description

The CGCloud Spark project lets you setup a fully configured Apache Spark cluster in EC2 in just minutes, regardless of the number of nodes. It is a plugin to CGCloud. While Apache Spark already comes with a script called spark-ec2 that lets you build a cluster in EC2, CGCloud Spark differs from spark-ec2 in the following ways (bad news first):

  • Tachyon or Yarn are not included

  • Setup time does not scale linearly with the number of nodes. Setting up a 100 node cluster takes just as long as setting up a 10 node cluster (2-3 min, as opposed to 45min with spark-ec2). This is made possible by baking all required software into a single AMI. All slave nodes boot up concurrently and autonomously in just a few minutes.

  • Unlike with spark-ec2, the cluster can be stopped and started via the EC2 API or the EC2 console, without involvement of cgcloud.

  • The Spark services (master and worker) run as an unprivileged user, not root as with spark-ec2. Ditto for the HDFS services (namenode, datanode and secondarynamenode).

  • The Spark and Hadoop services are started automatically as the instance boots up, via a regular init script.

  • Nodes can be added easily, simply by booting up new instances from the AMI. They will join the cluster automatically. HDFS may have to be rebalanced after that.

  • You can customize the AMI that cluster nodes boot from by subclassing the SparkMaster and SparkSlave classes.

  • CGCloud Spark uses the CGCLoud Agent which takes care of maintaining a list of authorized keypairs on each node.

  • CGCloud Spark is based on the official Ubuntu Trusty 14.04 LTS, not the Amazon Linux AMI.

Prerequisites

The cgcloud-spark package requires that the cgcloud-core package and its prerequisites are present.

Installation

Read the entire section before pasting any commands and ensure that all prerequisites are installed. It is recommended to install cgcloud into a virtualenv. Create a virtualenv and use pip to install cgcloud-spark:

cd
virtualenv cgcloud
source cgcloud/bin/activate
pip install cgcloud-spark
export CGCLOUD_PLUGINS="cgcloud.spark:$CGCLOUD_PLUGINS"

If you get DistributionNotFound: No distributions matching the version for cgcloud-spark, try running pip install --pre cgcloud-spark.

Be sure to configure cgcloud-core before proceeding.

Configuration

Modify your .profile or .bash_profile by adding the following line:

export CGCLOUD_PLUGINS=cgcloud.spark

Login and out (or, on OS X, start a new Terminal tab/window).

Verify the installation by running:

cgcloud list-roles

The output should include the spark-box role.

Usage

Create a single t2.micro box to serve as the template for the cluster nodes:

cgcloud create spark-box -I -T

The -I switch stops the box once it is fully set up and takes an AMI of it. The -T switch terminates it after that.

Create a cluster by booting a master and the slaves from that AMI:

cgcloud create-spark-cluster -s 2 -t m3.large

This will launch a master and two slaves using the m3.large instance type.

SSH into the master:

cgcloud ssh spark-master

… or the first slave:

cgcloud ssh -o 0 spark-slave

… or the second slave:

cgcloud ssh -o 1 spark-slave

Project details


Release history Release notifications | RSS feed

This version

1.2

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cgcloud-spark-1.2.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

cgcloud_spark-1.2-py2.7.egg (23.3 kB view details)

Uploaded Source

File details

Details for the file cgcloud-spark-1.2.tar.gz.

File metadata

  • Download URL: cgcloud-spark-1.2.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for cgcloud-spark-1.2.tar.gz
Algorithm Hash digest
SHA256 88e2fc7705c9f348a7924cf6ceab5bf29107cd1b0c681d95f9201b5880e56ff0
MD5 ff5c23dfb1bba0beacc4905f5249e0e7
BLAKE2b-256 54d47b6c4d172029fae526734aa95b103f8cbc6195c6202359a726bafd3c6d4e

See more details on using hashes here.

Provenance

File details

Details for the file cgcloud_spark-1.2-py2.7.egg.

File metadata

File hashes

Hashes for cgcloud_spark-1.2-py2.7.egg
Algorithm Hash digest
SHA256 a1b941dbab40d6b05bdb6730add7afb6e1d75c894bb2e2281b59605400597c52
MD5 2482c2b10e6e256e66bfa93879c00d99
BLAKE2b-256 52d5ccd33822136d2182ea9d8ecfb099e46bb65514ff51e56d1a648494a44d5f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page