Skip to main content

No project description provided

Project description

congruity

In many ways, the migration from using classic Spark applications using the full power and flexibility to be using only the Spark Connect compatible DataFrame API can be challenging.

The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.

Non-Goals

This library is not intended to be a long-term solution. The goal is to provide a compatibility layer that becomes obsolete over time. In addition, we do not aim to provide compatibility for all methods and features but only a select subset. Lastly, we do not aim to achieve the same performance as using some of the native RDD APIs.

Usage

Spark JVM & Spark Connect compatibility library.

pip install spark-congruity
import congruity

Example

Here is code that works on Spark JVM:

import congruity  # noqa: F401
from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()

This code doesn't work with Spark Connect. The congruity library rearranges the code under the hood, so the old syntax works on Spark Connect clusters as well:

import congruity  # noqa: F401
from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()

Whats supported?

RDD

RDD API Comment
aggregate :x:
aggregateByKey :x:
barrier :x:
cache :x:
cartesian :x:
checkpoint :x:
cleanShuffleDependencies :x:
coalesce :x:
cogroup :x:
collect :white_check_mark:
collectAsMap :x:
collectWithJobGroup :x:
combineByKey :x:
count :x:
countApprox :x:
countByKey :x:
countByValue :x:
distinct :x:
filter :x:
first :white_check_mark:
flatMap :x:
fold :x:
foreach :x:
foreachPartition :x:
fullOuterJoin :x:
getCheckpointFile :x:
getNumPartitions :x:
getResourceProfile :x:
getStorageLevel :x:
glom :x:
groupBy :x:
groupByKey :x:
groupWith :x:
histogram :x:
id :x:
intersection :x:
isCheckpointed :x:
isEmpty :x:
isLocallyCheckpointed :x:
join :x:
keys :x:
leftOuterJoin :x:
localCheckpoint :x:
lookup :x:
map :white_check_mark:
mapPartitions :x:
mapPartitionsWithIndex :x:
mapPartitionsWithSplit :x:
mapValues :x:
max :x:
mean :x:
meanApprox :x:
min :x:
name :x:
partitionBy :x:
persist :x:
pipe :x:
randomSplit :x:
reduce :x:
reduceByKey :x:
repartition :x:
repartitionAndSortWithinPartition :x:
rightOuterJoin :x:
sample :x:
sampleByKey :x:
sampleStdev :x:
sampleVariance :x:
saveAsHadoopDataset :x:
saveAsHadoopFile :x:
saveAsNewAPIHadoopDataset :x:
saveAsNewAPIHadoopFile :x:
saveAsPickleFile :x:
saveAsTextFile :x:
setName :x:
sortBy :x:
sortByKey :x:
stats :x:
stdev :x:
subtract :x:
substractByKey :x:
sum :x:
sumApprox :x:
take :white_check_mark: Ordering might not be guaranteed in the same way as it is in RDD.
takeOrdered :x:
takeSample :x:
toDF :x:
toDebugString :x:
toLocalIterator :x:
top :x:
treeAggregate :x:
treeReduce :x:
union :x:
unpersist :x:
values :x:
variance :x:
withResources :x:
zip :x:
zipWithIndex :x:
zipWithUniqueId :x:

SparkContext

RDD API Comment
parallelize :white_check_mark: Does not support numSlices yet.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_congruity-0.0.1rc1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

spark_congruity-0.0.1rc1-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file spark_congruity-0.0.1rc1.tar.gz.

File metadata

  • Download URL: spark_congruity-0.0.1rc1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for spark_congruity-0.0.1rc1.tar.gz
Algorithm Hash digest
SHA256 a92206ec492667c7e7e5985c5318ab3f7055f87d93de9f637ab553fe696d6443
MD5 d57beced983bbb7b1ffd51d8b9470209
BLAKE2b-256 fba78eda35c0974271e9ea235ce1c04b71b11e7f68490692e5adb531f2495108

See more details on using hashes here.

File details

Details for the file spark_congruity-0.0.1rc1-py3-none-any.whl.

File metadata

File hashes

Hashes for spark_congruity-0.0.1rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 d781e2f8f2cf77672f54fe3b1478f7d3b0851977f71002e73c7d6ba411cfdfeb
MD5 2aebc93017525e7577b9d685482af1bc
BLAKE2b-256 af69d3e9f26363c9c69ac100e43f05a96413e1ea16dd267650e5c953dd26d200

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page