A tool that upgrades your PySpark scripts to the latest Spark version as per Spark migration Guideline

These details have not been verified by PyPI

Project links

Project description

PySparkler

PySparkler is a tool that upgrades your PySpark scripts to latest Spark version. It is a command line tool that takes a PySpark script as input and outputs latest Spark version compatible script. It is written in Python and uses the LibCST module to parse the input script and generate the output script.

Installation

We recommend installing PySparkler from PyPI using pipx which allows us to install and run Python Applications in Isolated Environments. To install pipx on your system, follow the instructions here. Once pipx is installed, you can install PySparkler using:

pipx install pysparkler

That's it! You are now ready to use PySparkler.

pysparkler --help

Getting Started

Provide the path to the script you want to upgrade:

pysparkler upgrade --input-file /path/to/script.py

PySparkler parses the code and can perform either of the following actions:

Code Transformations - These are modifications that are performed on the code to make it compatible with the latest Spark version. For example, if you are upgrading from Spark 2.4 to 3.0, PySparkler will alphabetically sort the keyword arguments in the Row constructor to preserve backwards compatible behavior. This action will also add a comment to the end of statement line being modified to indicate that the code was modified by PySparkler, explaining why.
Code Hints - Python is a dynamically-typed language, so there are situations wherein PySparkler cannot, with 100% accuracy, determine if the code is eligible for a transformation. In such situations, PySparkler adds code hints to guide the end-user to make an appropriate change if needed. Code hints are comments that are added to the end of a statement line to suggest changes that may be needed it to make it compatible with the latest Spark version. For example, if you are upgrading from Spark 2.4 to 3.0 and PySparkler detects spark.sql.execution.arrow.enabled is set to True in your code, it will add a code hint to the end of the line to suggest setting spark.sql.execution.pandas.convertToArrowArraySafely to True in case you want to raise errors in case of Integer overflow or Floating point truncation, instead of silent allows. As you can see the suggestion is pretty contextual and may not be applicable in all cases. In cases where not applicable, the end-user can choose to ignore the code hint.

NOTE: PySparkler tries to keep the code formatting intact as much as possible. However, it is possible that the statement lines it takes actions on may fail the linting checks post changes. In such situations, the end-user will have to fix the linting errors manually.

PySpark Upgrades Supported

This tool follows the Apache Spark Migration guide for PySpark to upgrade your PySpark scripts. In the latest stable version it supports the following upgrades from the migration guide:

Migration	Supported	Details
Upgrading from PySpark 3.3 to 3.4	❌	Link
Upgrading from PySpark 3.2 to 3.3	✅	Link
Upgrading from PySpark 3.1 to 3.2	✅	Link
Upgrading from PySpark 2.4 to 3.0	✅	Link
Upgrading from PySpark 2.3 to 2.4	✅	Link
Upgrading from PySpark 2.3.0 to 2.3.1 and above	✅	Link
Upgrading from PySpark 2.2 to 2.3	✅	Link
Upgrading from PySpark 2.1 to 2.2	✅	NA
Upgrading from PySpark 1.4 to 1.5	❌	Link
Upgrading from PySpark 1.0-1.2 to 1.3	❌	Link

Features Supported

The tool supports the following features:

Feature	Supported
Upgrade PySpark Python script	✅
Upgrade PySpark Jupyter Notebook	✅
Upgrade SQL	✅
Dry-run Mode	✅
Verbose Mode	✅
Customize code transformers using YAML config	✅

Upgrade PySpark Python script

The tool can upgrade a PySpark Python script. It takes the path to the script as input and upgrades it in place:

pysparkler upgrade --input-file /path/to/script.py

If you want to output the upgraded script to a different directory, you can use the --output-file flag:

pysparkler upgrade --input-file /path/to/script.py --output-file /path/to/output.py

Upgrade PySpark Jupyter Notebook

The tool can upgrade a PySpark Jupyter Notebook to latest Spark version. It takes the path to the notebook as input and upgrades it in place:

pysparkler upgrade --input-file /path/to/notebook.ipynb

Similar to upgrading python scripts, if you want to output the upgraded notebook to a different directory, you can use the --output-file flag:

pysparkler upgrade --input-file /path/to/notebook.ipynb --output-file /path/to/output.ipynb

To change the output kernel name in the output Jupyter notebook, you can use the --output-kernel flag:

pysparkler upgrade --input-file /path/to/notebook.ipynb --output-kernel spark33-python3

Upgrade SQL

PySparkler when encounters a SQL statement in the input script makes an attempt to upgrade them. However, it is not always possible to upgrade certain formatted string SQL statements that have complex expressions within. In such cases the tool does leave code hints to let users know that they need to upgrade the SQL themselves.

To facilitate this, it exposes a command upgrade-sql for users to perform this DIY. The steps for that include:

De-template the SQL.
Upgrade the de-templated SQL using pysparkler upgrade-sql. See below for details.
Re-template the upgraded SQL.
Replace the old SQL with the upgraded SQL in the input script.

In order to perform step #2 i.e. you can either echo the SQL statement and pipe it to the tool:

echo "SELECT * FROM table" | pysparkler upgrade-sql

or you can use the cat command to pipe the SQL statement to the tool:

cat /path/to/sql.sql | pysparkler upgrade-sql

Dry-Run Mode

For both the above upgrade options, to run in dry mode, you can use the --dry-run flag. This will not write the upgraded script but will print a unified diff of the input and output scripts for you to inspect the changes:

pysparkler upgrade --input-file /path/to/script.py --dry-run

Verbose Mode

For both the above upgrade options, to run in verbose mode, you can use the --verbose flag. This will print tool's input variables, the input file content, the output content, and a unified diff of the input and output content:

pysparkler --verbose upgrade --input-file /path/to/script.py

Customize code transformers using YAML config

The tool uses a YAML config file to customize the code transformers. The config file can be passed using the --config-yaml flag:

pysparkler --config-yaml /path/to/config.yaml upgrade --input-file /path/to/script.py

The config file is a YAML file with the following structure:

pysparkler:
  dry_run: false # Whether to run in dry-run mode
  PY24-30-001: # The code transformer ID
    comment: A new comment # The overriden code hint comment to be used by the code transformer
  PY24-30-002:
    enabled: false # Disable the code transformer

Contributing

For the development, Poetry is used for packing and dependency management. You can install this using:

pip install poetry

If you have an older version of pip and virtualenv you need to update these:

pip install --upgrade virtualenv pip

Installation

To get started, you can run make install, which installs Poetry and all the dependencies of the PySparkler library. This also installs the development dependencies.

make install

If you don't want to install the development dependencies, you need to install using poetry install --only main.

If you want to install the library on the host, you can simply run pip3 install -e .. If you wish to use a virtual environment, you can run poetry shell. Poetry will open up a virtual environment with all the dependencies set.

IDE Setup

To set up IDEA with Poetry:

Open up the Python project in IntelliJ
Make sure that you're on latest master (that includes Poetry)
Go to File -> Project Structure (⌘;)
Go to Platform Settings -> SDKs
Click the + sign -> Add Python SDK
Select Poetry Environment from the left hand side bar and hit OK
It can take some time to download all the dependencies based on your internet
Go to Project Settings -> Project
Select the Poetry SDK from the SDK dropdown, and click OK

For IDEA ≤2021 you need to install the Poetry integration as a plugin.

Now you're set using Poetry, and all the tests will run in Poetry, and you'll have syntax highlighting in the pyproject.toml to indicate stale dependencies.

Linting

pre-commit is used for autoformatting and linting:

make lint

Pre-commit will automatically fix the violations such as import orders, formatting etc. Pylint errors you need to fix yourself.

In contrast to the name suggest, it doesn't run the checks on the commit. If this is something that you like, you can set this up by running pre-commit install.

You can bump the integrations to the latest version using pre-commit autoupdate. This will check if there is a newer version of {black,mypy,isort,...} and update the yaml.

Testing

For Python, pytest is used a testing framework in combination with coverage to enforce 90%+ code coverage.

make test

To pass additional arguments to pytest, you can use PYTEST_ARGS. For example, to run pytest in verbose mode:

make test PYTEST_ARGS="-v"

Architecture

Why LibCST?

LibCST is a Python library that provides a concrete syntax tree (CST) for Python code. CST preserves even the whitespaces of the source code which is very important since we only want to modify the code and not the formatting.

How does it work?

Using the codemod module of LibCST can simplify the process of writing a PySpark migration script, as it allows us to write small, reusable transformers and chain them together to perform a sequence of transformations.

Why Transformer Codemod? Why not Visitor?

The main advantage of using a Transformer is that it allows for more fine-grained control over the transformation process. Transformer classes can be defined to apply specific transformations to specific parts of the codebase, and multiple Transformer classes can be combined to form a chain of transformations. This can be useful when dealing with complex codebases where different parts of the code require different transformations.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.dev1727640711 pre-release

Sep 29, 2024

0.9.dev1723608122 pre-release

Aug 14, 2024

0.9.dev1723140059 pre-release

Aug 8, 2024

0.9.dev1718237702 pre-release

Jun 13, 2024

0.9.dev1718236353 pre-release

Jun 12, 2024

0.9.dev1715454452 pre-release

May 11, 2024

0.9.dev1715454332 pre-release

May 11, 2024

0.9.dev1715453058 pre-release

May 11, 2024

0.9.dev1706788159 pre-release

Feb 1, 2024

0.9.dev1704907151 pre-release

Jan 10, 2024

0.9.dev1704906412 pre-release

Jan 10, 2024

0.9.dev1697489241 pre-release

Oct 16, 2023

0.9.dev1697488621 pre-release

Oct 16, 2023

0.9.dev1694892767 pre-release

Sep 16, 2023

0.9.dev1693511554 pre-release

Aug 31, 2023

0.9.dev1693511292 pre-release

Aug 31, 2023

0.9.dev1689270832 pre-release

Jul 13, 2023

0.9.dev1689188494 pre-release

Jul 12, 2023

0.9.dev1689125747 pre-release

Jul 12, 2023

0.9.dev1687900650 pre-release

Jun 27, 2023

This version

0.8.0

Jun 27, 2023

0.8.dev1686771730 pre-release

Jun 14, 2023

0.8.dev1686599891 pre-release

Jun 12, 2023

0.7.0

Jun 12, 2023

0.7.dev1686599090 pre-release

Jun 12, 2023

0.7.dev1686592407 pre-release

Jun 12, 2023

0.7.dev1686161652 pre-release

Jun 7, 2023

0.7.dev1686161506 pre-release

Jun 7, 2023

0.7.dev1686161408 pre-release

Jun 7, 2023

0.7.dev1686160331 pre-release

Jun 7, 2023

0.7.dev1686160124 pre-release

Jun 7, 2023

0.7.dev1686156997 pre-release

Jun 7, 2023

0.7.dev1684340458 pre-release

May 17, 2023

0.7.dev1684180437 pre-release

May 15, 2023

0.7.dev1683676746 pre-release

May 9, 2023

0.7.dev1683673164 pre-release

May 9, 2023

0.7.dev1683593394 pre-release

May 9, 2023

0.7.dev1683592298 pre-release

May 9, 2023

0.7.dev1683588376 pre-release

May 8, 2023

0.7.dev1683586020 pre-release

May 8, 2023

0.7.dev1683145210 pre-release

May 3, 2023

0.7.dev1683145186 pre-release

May 3, 2023

0.7.dev1683143295 pre-release

May 3, 2023

0.7.dev1683132940 pre-release

May 3, 2023

0.7.dev1683067550 pre-release

May 2, 2023

0.7.dev1682356897 pre-release

Apr 24, 2023

0.7.dev1682356852 pre-release

Apr 24, 2023

0.7.dev1682353499 pre-release

Apr 24, 2023

0.7.dev1682352964 pre-release

Apr 24, 2023

0.6.0

Apr 24, 2023

0.6.dev1682027344 pre-release

Apr 20, 2023

0.6.dev1681913928 pre-release

Apr 19, 2023

0.5.0

Apr 19, 2023

0.5.dev1681845137 pre-release

Apr 18, 2023

0.5.dev1681834316 pre-release

Apr 18, 2023

0.5.dev1681764938 pre-release

Apr 17, 2023

0.4.0

Apr 17, 2023

0.4.dev1681762692 pre-release

Apr 17, 2023

0.4.dev1681747316 pre-release

Apr 17, 2023

0.4.dev1681497870 pre-release

Apr 14, 2023

0.4.dev1681490474 pre-release

Apr 14, 2023

0.3.0

Apr 14, 2023

0.3.dev1681426864 pre-release

Apr 13, 2023

0.3.dev1681426546 pre-release

Apr 13, 2023

0.3.dev1681170435 pre-release

Apr 10, 2023

0.3.dev1681163476 pre-release

Apr 10, 2023

0.3.dev1681163447 pre-release

Apr 10, 2023

0.3.dev1681151702 pre-release

Apr 10, 2023

0.3.dev1680911761 pre-release

Apr 7, 2023

0.3.dev1680800414 pre-release

Apr 6, 2023

0.3.dev1680800121 pre-release

Apr 6, 2023

0.3.dev1680724204 pre-release

Apr 5, 2023

0.3.dev1680708324 pre-release

Apr 5, 2023

0.2.0

Apr 5, 2023

0.2.dev1680651913 pre-release

Apr 4, 2023

0.2.dev1680651866 pre-release

Apr 4, 2023

0.2.dev1680641915 pre-release

Apr 4, 2023

0.2.dev1680640328 pre-release

Apr 4, 2023

0.2.dev1680620189 pre-release

Apr 4, 2023

0.2.dev1680608152 pre-release

Apr 4, 2023

0.2.dev1680549740 pre-release

Apr 3, 2023

0.2.dev1680543014 pre-release

Apr 3, 2023

0.2.dev1680282264 pre-release

Mar 31, 2023

0.2.dev1680206690 pre-release

Mar 30, 2023

0.1.0

Mar 27, 2023

0.1.dev1680123860 pre-release

Mar 29, 2023

0.1.dev1680044815 pre-release

Mar 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparkler-0.8.0.tar.gz (19.7 kB view details)

Uploaded Jun 27, 2023 Source

Built Distribution

pysparkler-0.8.0-py3-none-any.whl (24.6 kB view details)

Uploaded Jun 27, 2023 Python 3

File details

Details for the file pysparkler-0.8.0.tar.gz.

File metadata

Download URL: pysparkler-0.8.0.tar.gz
Upload date: Jun 27, 2023
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1040-azure

File hashes

Hashes for pysparkler-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`a7135451802f85809b3d58effc762fa549ea210695e1711423dd4b8086657d43`
MD5	`10982f0fe8e0ab5ad9fd06ab5e68f298`
BLAKE2b-256	`88678c8958a1cc0c576070c4020d62944ea95950bc5e6b3a795010e59f760e70`

See more details on using hashes here.

File details

Details for the file pysparkler-0.8.0-py3-none-any.whl.

File metadata

Download URL: pysparkler-0.8.0-py3-none-any.whl
Upload date: Jun 27, 2023
Size: 24.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1040-azure

File hashes

Hashes for pysparkler-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e832dd57718f0c8668dd7eb3e2c29beac609a2cd23179b32428a2957fc318c72`
MD5	`495ed1b65387c9b46d0d6d79a53919eb`
BLAKE2b-256	`cddd556280b367999c27539f023d7684a46dcf233c4ed2f0f46a5a68fea00e92`