A tool that upgrades your PySpark scripts to the latest Spark version as per Spark migration Guideline
Project description
PySparkler
PySparkler is a tool that upgrades your PySpark scripts to latest Spark version. It is a command line tool that takes a PySpark script as input and outputs a latest Spark version compatible script. It is written in Python and uses the LibCST module to parse the input script and generate the output script.
Basic Usage
Install from PyPI:
pip install pysparkler
Provide the path to the script you want to upgrade:
pysparkler upgrade --input-file /path/to/script.py
PySpark Upgrades Supported
This tool follows the Apache Spark Migration guide for PySpark to upgrade your PySpark scripts. In the latest stable version it supports the following upgrades from the migration guide:
Migration | Supported | Details |
---|---|---|
Upgrading from PySpark 3.3 to 3.4 | ❌ | Link |
Upgrading from PySpark 3.2 to 3.3 | ✅ | Link |
Upgrading from PySpark 3.1 to 3.2 | ✅ | Link |
Upgrading from PySpark 2.4 to 3.0 | ✅ | Link |
Upgrading from PySpark 2.3 to 2.4 | ✅ | Link |
Upgrading from PySpark 2.3.0 to 2.3.1 and above | ✅ | Link |
Upgrading from PySpark 2.2 to 2.3 | ❌ | Link |
Upgrading from PySpark 1.4 to 1.5 | ❌ | Link |
Upgrading from PySpark 1.0-1.2 to 1.3 | ❌ | Link |
Features Supported
The tool supports the following features:
Feature | Supported |
---|---|
Upgrade PySpark Python script | ✅ |
Upgrade PySpark Jupyter Notebook | ✅ |
Dry-run Mode | ✅ |
Verbose Mode | ✅ |
Upgrade PySpark Python script
The tool can upgrade a PySpark Python script. It takes the path to the script as input and upgrades it in place:
pysparkler upgrade --input-file /path/to/script.py
If you want to output the upgraded script to a different directory, you can use the --output-file
flag:
pysparkler upgrade --input-file /path/to/script.py --output-file /path/to/output.py
Upgrade PySpark Jupyter Notebook
The tool can upgrade a PySpark Jupyter Notebook to latest Spark version. It takes the path to the notebook as input and upgrades it in place:
pysparkler upgrade --input-file /path/to/notebook.ipynb
Similar to upgrading python scripts, if you want to output the upgraded notebook to a different directory, you can use
the --output-file
flag:
pysparkler upgrade --input-file /path/to/notebook.ipynb --output-file /path/to/output.ipynb
To change the output kernel name in the output Jupyter notebook, you can use the --output-kernel
flag:
pysparkler upgrade --input-file /path/to/notebook.ipynb --output-kernel spark33-python3
Dry-Run Mode
For both the above upgrade options, to run in dry mode, you can use the --dry-run
flag. This will not write the
upgraded script but will print a unified diff of the input and output scripts for you to inspect the changes:
pysparkler upgrade --input-file /path/to/script.py --dry-run
Verbose Mode
For both the above upgrade options, to run in verbose mode, you can use the --verbose
flag. This will print tool's
input variables, the input file content, the output content, and a unified diff of the input and output content:
pysparkler --verbose upgrade --input-file /path/to/script.py
Contributing
For the development, Poetry is used for packing and dependency management. You can install this using:
pip install poetry
If you have an older version of pip and virtualenv you need to update these:
pip install --upgrade virtualenv pip
Installation
To get started, you can run make install
, which installs Poetry and all the dependencies of the PySparkler library.
This also installs the development dependencies.
make install
If you don't want to install the development dependencies, you need to install using poetry install --only main
.
If you want to install the library on the host, you can simply run pip3 install -e .
. If you wish to use a virtual
environment, you can run poetry shell
. Poetry will open up a virtual environment with all the dependencies set.
IDE Setup
To set up IDEA with Poetry:
- Open up the Python project in IntelliJ
- Make sure that you're on latest master (that includes Poetry)
- Go to File -> Project Structure (⌘;)
- Go to Platform Settings -> SDKs
- Click the + sign -> Add Python SDK
- Select Poetry Environment from the left hand side bar and hit OK
- It can take some time to download all the dependencies based on your internet
- Go to Project Settings -> Project
- Select the Poetry SDK from the SDK dropdown, and click OK
For IDEA ≤2021 you need to install the Poetry integration as a plugin.
Now you're set using Poetry, and all the tests will run in Poetry, and you'll have syntax highlighting in the pyproject.toml to indicate stale dependencies.
Linting
pre-commit
is used for autoformatting and linting:
make lint
Pre-commit will automatically fix the violations such as import orders, formatting etc. Pylint errors you need to fix yourself.
In contrast to the name suggest, it doesn't run the checks on the commit. If this is something that you like, you can
set this up by running pre-commit install
.
You can bump the integrations to the latest version using pre-commit autoupdate
. This will check if there is a newer
version of {black,mypy,isort,...}
and update the yaml.
Testing
For Python, pytest
is used a testing framework in combination with coverage
to enforce 90%+ code coverage.
make test
To pass additional arguments to pytest, you can use PYTEST_ARGS
. For example, to run pytest in verbose mode:
make test PYTEST_ARGS="-v"
Architecture
Why LibCST?
LibCST is a Python library that provides a concrete syntax tree (CST) for Python code. CST preserves even the whitespaces of the source code which is very important since we only want to modify the code and not the formatting.
How does it work?
Using the codemod module of LibCST can simplify the process of writing a PySpark migration script, as it allows us to write small, reusable transformers and chain them together to perform a sequence of transformations.
Why Transformer Codemod? Why not Visitor?
The main advantage of using a Transformer is that it allows for more fine-grained control over the transformation process. Transformer classes can be defined to apply specific transformations to specific parts of the codebase, and multiple Transformer classes can be combined to form a chain of transformations. This can be useful when dealing with complex codebases where different parts of the code require different transformations.
More on this can be found here.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pysparkler-0.4.dev1681747316.tar.gz
.
File metadata
- Download URL: pysparkler-0.4.dev1681747316.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/5.15.0-1035-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f4f6ecdc8c7fe87b68d2df5b0f835dde06d128fc362327fd95800e216257d6e |
|
MD5 | 14595673ee1d9191b89abbad87fff523 |
|
BLAKE2b-256 | 66d9ffd743b4052a8ca1ee1d99b222739392ffc226e45dfd92a5614e079a283d |
File details
Details for the file pysparkler-0.4.dev1681747316-py3-none-any.whl
.
File metadata
- Download URL: pysparkler-0.4.dev1681747316-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/5.15.0-1035-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6a5daa0be2d6afd31b28596e0dd4c92931d1924790ebda777c8494f05d4da87 |
|
MD5 | cc417b3cee845c03d84b443118cb6cb0 |
|
BLAKE2b-256 | 238c6c7a154890962e5d337d154536acac51d9437ef7668f30b770b343824064 |