Skip to main content

Pyspark helper methods to maximize developer efficiency

Project description

Quinn

CI

Pyspark helper methods to maximize developer productivity.

Quinn validates DataFrames, extends core classes, defines DataFrame transformations, and provides SQL functions.

quinn

Setup

Quinn is uploaded to PyPi and can be installed with this command:

pip install quinn

Pyspark Core Class Extensions

from quinn.extensions import *

Column Extensions

isFalsy()

source_df.withColumn("is_stuff_falsy", F.col("has_stuff").isFalsy())

Returns True if has_stuff is None or False.

isTruthy()

source_df.withColumn("is_stuff_truthy", F.col("has_stuff").isTruthy())

Returns True unless has_stuff is None or False.

isNullOrBlank()

source_df.withColumn("is_blah_null_or_blank", F.col("blah").isNullOrBlank())

Returns True if blah is null or blank (the empty string or a string that only contains whitespace).

isNotIn()

source_df.withColumn("is_not_bobs_hobby", F.col("fun_thing").isNotIn(bobs_hobbies))

Returns True if fun_thing is not included in the bobs_hobbies list.

nullBetween()

source_df.withColumn("is_between", F.col("age").nullBetween(F.col("lower_age"), F.col("upper_age")))

Returns True if age is between lower_age and upper_age. If lower_age is populated and upper_age is null, it will return True if age is greater than or equal to lower_age. If lower_age is null and upper_age is populate, it will return True if age is lower than or equal to upper_age.

SparkSession Extensions

create_df()

spark.create_df(
    [("jose", "a"), ("li", "b"), ("sam", "c")],
    [("name", StringType(), True), ("blah", StringType(), True)]
)

Creates DataFrame with a syntax that's less verbose than the built-in createDataFrame method.

DataFrame Extensions

transform()

source_df\
    .transform(lambda df: with_greeting(df))\
    .transform(lambda df: with_something(df, "crazy"))

Allows for multiple DataFrame transformations to be run and executed.

Quinn Helper Functions

import quinn

DataFrame Validations

validate_presence_of_columns()

quinn.validate_presence_of_columns(source_df, ["name", "age", "fun"])

Raises an exception unless source_df contains the name, age, and fun column.

validate_schema()

quinn.validate_schema(source_df, required_schema)

Raises an exception unless source_df contains all the StructFields defined in the required_schema.

validate_absence_of_columns()

quinn.validate_absence_of_columns(source_df, ["age", "cool"])

Raises an exception if source_df contains age or cool columns.

Functions

single_space()

actual_df = source_df.withColumn(
    "words_single_spaced",
    quinn.single_space(col("words"))
)

Replaces all multispaces with single spaces (e.g. changes "this has some" to "this has some".

remove_all_whitespace()

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

Removes all whitespace in a string (e.g. changes "this has some" to "thishassome".

anti_trim()

actual_df = source_df.withColumn(
    "words_anti_trimmed",
    quinn.anti_trim(col("words"))
)

Removes all inner whitespace, but doesn't delete leading or trailing whitespace (e.g. changes " this has some " to " thishassome ".

remove_non_word_characters()

actual_df = source_df.withColumn(
    "words_without_nonword_chars",
    quinn.remove_non_word_characters(col("words"))
)

Removes all non-word characters from a string (e.g. changes "si%$#@!#$!@#mpsons" to "simpsons".

exists()

source_df.withColumn(
    "any_num_greater_than_5",
    quinn.exists(lambda n: n > 5)(col("nums"))
)

nums contains lists of numbers and exists() returns True if any of the numbers in the list are greater than 5. It's similar to the Python any function.

forall()

source_df.withColumn(
    "all_nums_greater_than_3",
    quinn.forall(lambda n: n > 3)(col("nums"))
)

nums contains lists of numbers and forall() returns True if all of the numbers in the list are greater than 3. It's similar to the Python all function.

multi_equals()

source_df.withColumn(
    "are_s1_and_s2_cat",
    quinn.multi_equals("cat")(col("s1"), col("s2"))
)

multi_equals returns true if s1 and s2 are both equal to "cat".

Transformations

snake_case_col_names()

quinn.snake_case_col_names(source_df)

Converts all the column names in a DataFrame to snake_case. It's annoying to write SQL queries when columns aren't snake cased.

sort_columns()

quinn.sort_columns(source_df, "asc")

Sorts the DataFrame columns in alphabetical order. Wide DataFrames are easier to navigate when they're sorted alphabetically.

DataFrame Helpers

column_to_list()

quinn.column_to_list(source_df, "name")

Converts a column in a DataFrame to a list of values.

two_columns_to_dictionary()

quinn.two_columns_to_dictionary(source_df, "name", "age")

Converts two columns of a DataFrame into a dictionary. In this example, name is the key and age is the value.

to_list_of_dictionaries()

quinn.to_list_of_dictionaries(source_df)

Converts an entire DataFrame into a list of dictionaries.

Contributing

We are actively looking for feature requests, pull requests, and bug fixes.

Any developer that demonstrates excellence will be invited to be a maintainer of the project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quinn-0.9.0.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

quinn-0.9.0-py2.py3-none-any.whl (9.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file quinn-0.9.0.tar.gz.

File metadata

  • Download URL: quinn-0.9.0.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Darwin/19.6.0

File hashes

Hashes for quinn-0.9.0.tar.gz
Algorithm Hash digest
SHA256 6851c828ec33bbc59cdbcfb143233a1dae0f1f1c25ee4549e64b8520e5355503
MD5 e2d4b267427e815e3bea2d6ffaf7ba52
BLAKE2b-256 2bdf41dbb798c78da6511f09fade1bd7bd8f0947579706b0552c2b939feee7d4

See more details on using hashes here.

File details

Details for the file quinn-0.9.0-py2.py3-none-any.whl.

File metadata

  • Download URL: quinn-0.9.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.7.5 Darwin/19.6.0

File hashes

Hashes for quinn-0.9.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9680b61c24e419ee5e51f9a25b2f23b6d3e1febc9b9ba62e85fb1931acbbb686
MD5 77a26a70ee654d8613b2d5c22ff50d2f
BLAKE2b-256 0c0722118802e3b19e6afd25d3444704d6e1ba4a1ebf5c336d31cacdf5833a92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page