Pyspark helper methods to maximize developer efficiency
Project description
Quinn
Pyspark helper methods to maximize developer productivity.
Quinn validates DataFrames, extends core classes, defines DataFrame transformations, and provides SQL functions.
Setup
Quinn is uploaded to PyPi and can be installed with this command:
pip install quinn
Pyspark Core Class Extensions
from quinn.extensions import *
Column Extensions
isFalsy()
source_df.withColumn("is_stuff_falsy", F.col("has_stuff").isFalsy())
Returns True
if has_stuff
is None
or False
.
isTruthy()
source_df.withColumn("is_stuff_truthy", F.col("has_stuff").isTruthy())
Returns True
unless has_stuff
is None
or False
.
isNullOrBlank()
source_df.withColumn("is_blah_null_or_blank", F.col("blah").isNullOrBlank())
Returns True
if blah
is null
or blank (the empty string or a string that only contains whitespace).
isNotIn()
source_df.withColumn("is_not_bobs_hobby", F.col("fun_thing").isNotIn(bobs_hobbies))
Returns True
if fun_thing
is not included in the bobs_hobbies
list.
nullBetween()
source_df.withColumn("is_between", F.col("age").nullBetween(F.col("lower_age"), F.col("upper_age")))
Returns True
if age
is between lower_age
and upper_age
. If lower_age
is populated and upper_age
is null
, it will return True
if age
is greater than or equal to lower_age
. If lower_age
is null
and upper_age
is populate, it will return True
if age
is lower than or equal to upper_age
.
SparkSession Extensions
create_df()
spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
Creates DataFrame with a syntax that's less verbose than the built-in createDataFrame
method.
DataFrame Extensions
transform()
source_df\
.transform(lambda df: with_greeting(df))\
.transform(lambda df: with_something(df, "crazy"))
Allows for multiple DataFrame transformations to be run and executed.
Quinn Helper Functions
import quinn
DataFrame Validations
validate_presence_of_columns()
quinn.validate_presence_of_columns(source_df, ["name", "age", "fun"])
Raises an exception unless source_df
contains the name
, age
, and fun
column.
validate_schema()
quinn.validate_schema(source_df, required_schema)
Raises an exception unless source_df
contains all the StructFields
defined in the required_schema
.
validate_absence_of_columns()
quinn.validate_absence_of_columns(source_df, ["age", "cool"])
Raises an exception if source_df
contains age
or cool
columns.
Functions
single_space()
actual_df = source_df.withColumn(
"words_single_spaced",
quinn.single_space(col("words"))
)
Replaces all multispaces with single spaces (e.g. changes "this has some"
to "this has some"
.
remove_all_whitespace()
actual_df = source_df.withColumn(
"words_without_whitespace",
quinn.remove_all_whitespace(col("words"))
)
Removes all whitespace in a string (e.g. changes "this has some"
to "thishassome"
.
anti_trim()
actual_df = source_df.withColumn(
"words_anti_trimmed",
quinn.anti_trim(col("words"))
)
Removes all inner whitespace, but doesn't delete leading or trailing whitespace (e.g. changes " this has some "
to " thishassome "
.
remove_non_word_characters()
actual_df = source_df.withColumn(
"words_without_nonword_chars",
quinn.remove_non_word_characters(col("words"))
)
Removes all non-word characters from a string (e.g. changes "si%$#@!#$!@#mpsons"
to "simpsons"
.
exists()
source_df.withColumn(
"any_num_greater_than_5",
quinn.exists(lambda n: n > 5)(col("nums"))
)
nums
contains lists of numbers and exists()
returns True
if any of the numbers in the list are greater than 5. It's similar to the Python any
function.
forall()
source_df.withColumn(
"all_nums_greater_than_3",
quinn.forall(lambda n: n > 3)(col("nums"))
)
nums
contains lists of numbers and forall()
returns True
if all of the numbers in the list are greater than 3. It's similar to the Python all
function.
multi_equals()
source_df.withColumn(
"are_s1_and_s2_cat",
quinn.multi_equals("cat")(col("s1"), col("s2"))
)
multi_equals
returns true if s1
and s2
are both equal to "cat"
.
Transformations
snake_case_col_names()
quinn.snake_case_col_names(source_df)
Converts all the column names in a DataFrame to snake_case. It's annoying to write SQL queries when columns aren't snake cased.
sort_columns()
quinn.sort_columns(source_df, "asc")
Sorts the DataFrame columns in alphabetical order. Wide DataFrames are easier to navigate when they're sorted alphabetically.
DataFrame Helpers
column_to_list()
quinn.column_to_list(source_df, "name")
Converts a column in a DataFrame to a list of values.
two_columns_to_dictionary()
quinn.two_columns_to_dictionary(source_df, "name", "age")
Converts two columns of a DataFrame into a dictionary. In this example, name
is the key and age
is the value.
to_list_of_dictionaries()
quinn.to_list_of_dictionaries(source_df)
Converts an entire DataFrame into a list of dictionaries.
Contributing
We are actively looking for feature requests, pull requests, and bug fixes.
Any developer that demonstrates excellence will be invited to be a maintainer of the project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file quinn-0.6.0.tar.gz
.
File metadata
- Download URL: quinn-0.6.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.5 Darwin/18.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c44cf57d7efd122da8fd587cdbd4cb4d5842940ab35efe46e84dd1fa7230c06f |
|
MD5 | efd1849ffd4444c96a1a931c8a4dcc6f |
|
BLAKE2b-256 | 8140f7fcab3eeae057214f6a83e6093c05ea5ee19251d4f1b231fbc296a9388b |
File details
Details for the file quinn-0.6.0-py2.py3-none-any.whl
.
File metadata
- Download URL: quinn-0.6.0-py2.py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.5 CPython/3.7.5 Darwin/18.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 617c7986447f66bc15743bc59e81e12c6354a5a2b48153f5ba8fc441c5b12494 |
|
MD5 | 25684c9b7c7b7990ca860ee9c15b957c |
|
BLAKE2b-256 | b68482eb93ee28587e6f5923b71ff94b4055044692986a0523e6238bed920ec1 |