Skip to main content

No project description provided

Project description

congruity

GitHub Actions Build PyPI Downloads

In many ways, the migration from using classic Spark applications using the full power and flexibility to be using only the Spark Connect compatible DataFrame API can be challenging.

The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.

Non-Goals

This library is not intended to be a long-term solution. The goal is to provide a compatibility layer that becomes obsolete over time. In addition, we do not aim to provide compatibility for all methods and features but only a select subset. Lastly, we do not aim to achieve the same performance as using some of the native RDD APIs.

Usage

Spark JVM & Spark Connect compatibility library.

pip install spark-congruity
import congruity

Example

Here is code that works on Spark JVM:

from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()

This code doesn't work with Spark Connect. The congruity library rearranges the code under the hood, so the old syntax works on Spark Connect clusters as well:

import congruity  # noqa: F401
from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()

Contributing

We very much welcome contributions to this project. The easiest way to start is to pick any of the below RDD or SparkContext methods and implement the compatibility layer. Once you have done that open a pull request and we will review it.

What's supported?

RDD

RDD API Comment
aggregate :white_check_mark:
aggregateByKey :x:
barrier :x:
cache :x:
cartesian :x:
checkpoint :x:
cleanShuffleDependencies :x:
coalesce :x:
cogroup :x:
collect :white_check_mark:
collectAsMap :x:
collectWithJobGroup :x:
combineByKey :x:
count :white_check_mark:
countApprox :x:
countByKey :x:
countByValue :x:
distinct :x:
filter :white_check_mark:
first :white_check_mark:
flatMap :x:
fold :white_check_mark: First version
foreach :x:
foreachPartition :x:
fullOuterJoin :x:
getCheckpointFile :x:
getNumPartitions :x:
getResourceProfile :x:
getStorageLevel :x:
glom :white_check_mark:
groupBy :white_check_mark:
groupByKey :white_check_mark:
groupWith :x:
histogram :white_check_mark:
id :x:
intersection :x:
isCheckpointed :x:
isEmpty :x:
isLocallyCheckpointed :x:
join :x:
keyBy :white_check_mark:
keys :white_check_mark:
leftOuterJoin :x:
localCheckpoint :x:
lookup :x:
map :white_check_mark:
mapPartitions :white_check_mark: First version, based on mapInArrow.
mapPartitionsWithIndex :x:
mapPartitionsWithSplit :x:
mapValues :white_check_mark:
max :white_check_mark:
mean :white_check_mark:
meanApprox :x:
min :white_check_mark:
name :x:
partitionBy :x:
persist :x:
pipe :x:
randomSplit :x:
reduce :white_check_mark:
reduceByKey :x:
repartition :x:
repartitionAndSortWithinPartition :x:
rightOuterJoin :x:
sample :x:
sampleByKey :x:
sampleStdev :white_check_mark:
sampleVariance :white_check_mark:
saveAsHadoopDataset :x:
saveAsHadoopFile :x:
saveAsNewAPIHadoopDataset :x:
saveAsNewAPIHadoopFile :x:
saveAsPickleFile :x:
saveAsTextFile :x:
setName :x:
sortBy :x:
sortByKey :x:
stats :white_check_mark:
stdev :white_check_mark:
subtract :x:
substractByKey :x:
sum :white_check_mark: First version.
sumApprox :x:
take :white_check_mark: Ordering might not be guaranteed in the same way as it is in RDD.
takeOrdered :x:
takeSample :x:
toDF :white_check_mark:
toDebugString :x:
toLocalIterator :x:
top :x:
treeAggregate :x:
treeReduce :x:
union :x:
unpersist :x:
values :white_check_mark:
variance :white_check_mark:
withResources :x:
zip :x:
zipWithIndex :x:
zipWithUniqueId :x:

SparkContext

RDD API Comment
parallelize :white_check_mark: Does not support numSlices yet.

Limitations

  • Error handling and checking is kind of limited right now. We try to emulate the existing behavior, but this is not always possible because the invariants are not encode in Python but rather somewhere in Scala.
  • numSlices - we don't emulate this behavior for now.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_congruity-0.0.1rc3.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

spark_congruity-0.0.1rc3-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file spark_congruity-0.0.1rc3.tar.gz.

File metadata

  • Download URL: spark_congruity-0.0.1rc3.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for spark_congruity-0.0.1rc3.tar.gz
Algorithm Hash digest
SHA256 4b48ddbcc47edd9e99b6c237c01883f83c29829b1a8b10f3131f6a4da3f16722
MD5 f892e8c5ec766da7a97b0dce3b4882ff
BLAKE2b-256 41de7ebccc812dc4472c36fd8300eaa18d57c2e1c016cedc90c1655691f762a4

See more details on using hashes here.

File details

Details for the file spark_congruity-0.0.1rc3-py3-none-any.whl.

File metadata

File hashes

Hashes for spark_congruity-0.0.1rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 7dbf13b62a8b49c3762474ddef888cf4ab58bdf3928370a2c1d73471509376cb
MD5 1666213fba1f78fae1890aba3024a0ba
BLAKE2b-256 aebd25a3e26b8f45e2a552eac01cf9a9700b7e4ee64b4a83c966999d52c9767e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page