No project description provided
Project description
congruity
In many ways, the migration from using classic Spark applications using the full power and flexibility to be using only the Spark Connect compatible DataFrame API can be challenging.
The goal of this library is to provide a compatibility layer that makes it easier to adopt Spark Connect. The library is designed to be simply imported in your application and will then monkey-patch the existing API to provide the legacy functionality.
Non-Goals
This library is not intended to be a long-term solution. The goal is to provide a compatibility layer that becomes obsolete over time. In addition, we do not aim to provide compatibility for all methods and features but only a select subset. Lastly, we do not aim to achieve the same performance as using some of the native RDD APIs.
Usage
Spark JVM & Spark Connect compatibility library.
pip install spark-congruity
import congruity
Example
Here is code that works on Spark JVM:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()
This code doesn't work with Spark Connect. The congruity library rearranges the code under the hood, so the old syntax works on Spark Connect clusters as well:
import congruity # noqa: F401
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
spark.sparkContext.parallelize(data).toDF()
Contributing
We very much welcome contributions to this project. The easiest way to start is to pick any of the below RDD or SparkContext methods and implement the compatibility layer. Once you have done that open a pull request and we will review it.
What's supported?
RDD
RDD | API | Comment |
---|---|---|
aggregate | :white_check_mark: | |
aggregateByKey | :x: | |
barrier | :x: | |
cache | :x: | |
cartesian | :x: | |
checkpoint | :x: | |
cleanShuffleDependencies | :x: | |
coalesce | :x: | |
cogroup | :x: | |
collect | :white_check_mark: | |
collectAsMap | :x: | |
collectWithJobGroup | :x: | |
combineByKey | :x: | |
count | :white_check_mark: | |
countApprox | :x: | |
countByKey | :x: | |
countByValue | :x: | |
distinct | :x: | |
filter | :white_check_mark: | |
first | :white_check_mark: | |
flatMap | :x: | |
fold | :white_check_mark: | First version |
foreach | :x: | |
foreachPartition | :x: | |
fullOuterJoin | :x: | |
getCheckpointFile | :x: | |
getNumPartitions | :x: | |
getResourceProfile | :x: | |
getStorageLevel | :x: | |
glom | :white_check_mark: | |
groupBy | :white_check_mark: | |
groupByKey | :white_check_mark: | |
groupWith | :x: | |
histogram | :white_check_mark: | |
id | :x: | |
intersection | :x: | |
isCheckpointed | :x: | |
isEmpty | :x: | |
isLocallyCheckpointed | :x: | |
join | :x: | |
keyBy | :white_check_mark: | |
keys | :white_check_mark: | |
leftOuterJoin | :x: | |
localCheckpoint | :x: | |
lookup | :x: | |
map | :white_check_mark: | |
mapPartitions | :white_check_mark: | First version, based on mapInArrow. |
mapPartitionsWithIndex | :x: | |
mapPartitionsWithSplit | :x: | |
mapValues | :white_check_mark: | |
max | :white_check_mark: | |
mean | :white_check_mark: | |
meanApprox | :x: | |
min | :white_check_mark: | |
name | :x: | |
partitionBy | :x: | |
persist | :x: | |
pipe | :x: | |
randomSplit | :x: | |
reduce | :white_check_mark: | |
reduceByKey | :x: | |
repartition | :x: | |
repartitionAndSortWithinPartition | :x: | |
rightOuterJoin | :x: | |
sample | :x: | |
sampleByKey | :x: | |
sampleStdev | :white_check_mark: | |
sampleVariance | :white_check_mark: | |
saveAsHadoopDataset | :x: | |
saveAsHadoopFile | :x: | |
saveAsNewAPIHadoopDataset | :x: | |
saveAsNewAPIHadoopFile | :x: | |
saveAsPickleFile | :x: | |
saveAsTextFile | :x: | |
setName | :x: | |
sortBy | :x: | |
sortByKey | :x: | |
stats | :white_check_mark: | |
stdev | :white_check_mark: | |
subtract | :x: | |
substractByKey | :x: | |
sum | :white_check_mark: | First version. |
sumApprox | :x: | |
take | :white_check_mark: | Ordering might not be guaranteed in the same way as it is in RDD. |
takeOrdered | :x: | |
takeSample | :x: | |
toDF | :white_check_mark: | |
toDebugString | :x: | |
toLocalIterator | :x: | |
top | :x: | |
treeAggregate | :x: | |
treeReduce | :x: | |
union | :x: | |
unpersist | :x: | |
values | :white_check_mark: | |
variance | :white_check_mark: | |
withResources | :x: | |
zip | :x: | |
zipWithIndex | :x: | |
zipWithUniqueId | :x: |
SparkContext
RDD | API | Comment |
---|---|---|
parallelize | :white_check_mark: | Does not support numSlices yet. |
Limitations
- Error handling and checking is kind of limited right now. We try to emulate the existing behavior, but this is not always possible because the invariants are not encode in Python but rather somewhere in Scala.
numSlices
- we don't emulate this behavior for now.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file spark_congruity-0.0.1rc5.tar.gz
.
File metadata
- Download URL: spark_congruity-0.0.1rc5.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aaf06969e68c6248932e5045b79ca8b9aa185a825e89cf0a0d682527e432f381 |
|
MD5 | 34e34975b696429c3b24d97adefd1d3b |
|
BLAKE2b-256 | 7ebc00db37e6c25b4b9f75cb76826ef0fe4871578f448bd02f20312148e895a5 |
Provenance
File details
Details for the file spark_congruity-0.0.1rc5-py3-none-any.whl
.
File metadata
- Download URL: spark_congruity-0.0.1rc5-py3-none-any.whl
- Upload date:
- Size: 19.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e38107b1e561e0509315f0b7c2511b52689e9ef607d8cbbed7edca492b50b6ab |
|
MD5 | ed15d19787301e8b51b90209a1ca9631 |
|
BLAKE2b-256 | 1857c323b4f819e99550038d8164414de186ff694fee9c535ed80c637488f2a3 |