spark-emr

Run python packages on AWS EMR

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

# Spark-EMR

[![Build Status](https://api.travis-ci.org/delijati/spark-emr.svg?branch=master)](https://travis-ci.org/delijati/spark-emr)

Run an python package on AWS EMR

## Install

Develop install:

$ pip install -e .

Testing:

$ pip install tox
$ tox

## Setup

The easiest way to get EMR up and running is to go through the Web-Interface
and create a ssh key, and start a cluster by hand. This will then create the
needed subnet_key and EMR roles.

## Config yaml file

Create a ``config.yaml`` per project or as a default into
`~/.config/spark-emr.yaml`

bootstrap_uri: s3://foo/bar
master:
instance_type: m4.large
size_in_gb: 100
core:
instance_type: m4.large
instance_count: 2
size_in_gb: 100
ssh_key: XXXXX
subnet_id: subnet-XXXXXX
python_version: python36
emr_version: emr-5.20.0
consistent: false
optimization: false
region: eu-central-1
job_flow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole

## CLI-Interface

### Start

To run a python code on EMR you need to build a proper python package aka
`setup.py` with `console_scripts` the script needs to end on `.py` or yarn
won't be able to execute it |-(

Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll
cluster until finished, stop cluster:

$ spark-emr start \
[--config config.yaml] \
--name "Spark-ETL" \
--bid-master 0.04 \
--bid-core 0.04 \
--cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
--tags foo 2 bar 4 \
--poll \
--yarn-log \
--package "../"

Running with a released pypackage version (pip):

$ spark-emr start \
... \
--package pip+etl_pypackage

### Status

Returns the status of a cluster (also terminated ones):

$ spark-emr status --cluster-id j-XXXXX

### List

List all cluster and filter optionally by tag:

$ spark-emr list [--config config.yaml] [--filter somekey somevalue]

### Stop

Stop a running cluster:

$ spark-emr stop --cluster-id j-XXXXX

### Spot price check

This call returns for all regions and configured instances the spot price:

$ spark-emr spot

# Appendix

### Running commands on EMR

The created command can also be run directly from the master:

$ /usr/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
--conf spark.executorEnv.PYSPARK_PYTHON=python35 \
/usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv

### Running commands on docker

To test if our spark is running as expected we can run it locally in docker.

$ git clone https://github.com/delijati/spark-docker
$ cd spark-docker
$ docker build . --pull -t spark

Now we can run our spark job locally.

$ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark \
bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"

# CHANGES

0.1.2 (2019-03-10)
------------------

- Add spot price check cli.
- Add spot BidPrice.
- Show estimated cost.
- Filter by tag for list cli.

0.1.1 (2019-02-21)
------------------

- Fixed url in setup.py.

0.1.0 (2019-02-21)
------------------

- Initial release.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.1.2

Mar 10, 2019

0.1.1

Feb 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_emr-0.1.2.tar.gz (13.4 kB view details)

Uploaded Mar 10, 2019 Source

Built Distribution

spark_emr-0.1.2-py2.py3-none-any.whl (18.7 kB view details)

Uploaded Mar 10, 2019 Python 2 Python 3

File details

Details for the file spark_emr-0.1.2.tar.gz.

File metadata

Download URL: spark_emr-0.1.2.tar.gz
Upload date: Mar 10, 2019
Size: 13.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for spark_emr-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`9dc61d3f3b25b311d3e86e294062fd5145a6e4274f8668ef8c3c891de798a208`
MD5	`43904396dfaf247d054903b53f134805`
BLAKE2b-256	`08a8795758983d206c610cb69de875a8a56f42f1a069b4a4ec5043f5275c3634`

See more details on using hashes here.

File details

Details for the file spark_emr-0.1.2-py2.py3-none-any.whl.

File metadata

Download URL: spark_emr-0.1.2-py2.py3-none-any.whl
Upload date: Mar 10, 2019
Size: 18.7 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for spark_emr-0.1.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3c3f127ed2df309b89115f8aaef9c2c66b89dc311f8cf00d3323d403d7b8f5d`
MD5	`fcece6b7d068dd960f353e3a27bdb146`
BLAKE2b-256	`99ff672e1bf5d3da274ab38bf816f10dd47635bf6b56a6a037efb1424d1b5dae`