Skip to main content

Run python packages on AWS EMR

Project description

# Spark-EMR

[![Build Status](https://api.travis-ci.org/delijati/spark-emr.svg?branch=master)](https://travis-ci.org/delijati/spark-emr)

Run an python package on AWS EMR

## Install

Develop install:

$ pip install -e .

Testing:

$ pip install tox
$ tox

## Setup

The easiest way to get EMR up and running is to go through the Web-Interface
and create a ssh key, and start a cluster by hand. This will then create the
needed subnet_key and EMR roles.

## Config yaml file

Create a ``config.yaml`` per project or as a default into
`~/.config/spark-emr.yaml`

bootstrap_uri: s3://foo/bar
master:
instance_type: m4.large
size_in_gb: 100
core:
instance_type: m4.large
instance_count: 2
size_in_gb: 100
ssh_key: XXXXX
subnet_id: subnet-XXXXXX
python_version: python36
emr_version: emr-5.20.0
consistent: false
optimization: false
region: eu-central-1
job_flow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole

## CLI-Interface

### Start

To run a python code on EMR you need to build a proper python package aka
`setup.py` with `console_scripts` the script needs to end on `.py` or yarn
won't be able to execute it |-(

Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll
cluster until finished, stop cluster:

$ spark-emr start \
[--config config.yaml] \
--name "Spark-ETL" \
--bid-master 0.04 \
--bid-core 0.04 \
--cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
--tags foo 2 bar 4 \
--poll \
--yarn-log \
--package "../"

Running with a released pypackage version (pip):

$ spark-emr start \
... \
--package pip+etl_pypackage

### Status

Returns the status of a cluster (also terminated ones):

$ spark-emr status --cluster-id j-XXXXX

### List

List all cluster and filter optionally by tag:

$ spark-emr list [--config config.yaml] [--filter somekey somevalue]

### Stop

Stop a running cluster:

$ spark-emr stop --cluster-id j-XXXXX

### Spot price check

This call returns for all regions and configured instances the spot price:

$ spark-emr spot

# Appendix

### Running commands on EMR

The created command can also be run directly from the master:

$ /usr/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
--conf spark.executorEnv.PYSPARK_PYTHON=python35 \
/usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv

### Running commands on docker

To test if our spark is running as expected we can run it locally in docker.

$ git clone https://github.com/delijati/spark-docker
$ cd spark-docker
$ docker build . --pull -t spark

Now we can run our spark job locally.

$ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark \
bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"


# CHANGES

0.1.2 (2019-03-10)
------------------

- Add spot price check cli.
- Add spot BidPrice.
- Show estimated cost.
- Filter by tag for list cli.


0.1.1 (2019-02-21)
------------------

- Fixed url in setup.py.


0.1.0 (2019-02-21)
------------------

- Initial release.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_emr-0.1.2.tar.gz (13.4 kB view hashes)

Uploaded Source

Built Distribution

spark_emr-0.1.2-py2.py3-none-any.whl (18.7 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page