Taking the Spark out of PySpark by converting to SQL
Project description
SQLFrame implements the PySpark DataFrame API in order to enable running transformation pipelines directly on database engines - no Spark clusters or dependencies required.
SQLFrame currently supports the following engines (many more in development):
SQLFrame also has a "Standalone" session that be used to generate SQL without any connection to a database engine.
SQLFrame is great for:
- Users who want to run PySpark DataFrame code without having to use a Spark cluster
- Users who want a SQL representation of their DataFrame code for debugging or sharing with others
- Users who want a DataFrame API that leverages the full power of their engine to do the processing
Installation
# BigQuery
pip install "sqlframe[bigquery]"
# DuckDB
pip install "sqlframe[duckdb]"
# Postgres
pip install "sqlframe[postgres]"
# Standalone
pip install sqlframe
See specific engine documentation for additional setup instructions.
Example Usage
from sqlframe.bigquery import BigQuerySession
from sqlframe.bigquery import functions as F
from sqlframe.bigquery import Window
session = BigQuerySession()
table_path = "bigquery-public-data.samples.natality"
# Top 5 years with the greatest year-over-year % change in new families with single child
df = (
session.table(table_path)
.where(F.col("ever_born") == 1)
.groupBy("year")
.agg(F.count("*").alias("num_single_child_families"))
.withColumn(
"last_year_num_single_child_families",
F.lag(F.col("num_single_child_families"), 1).over(Window.orderBy("year"))
)
.withColumn(
"percent_change",
(F.col("num_single_child_families") - F.col("last_year_num_single_child_families"))
/ F.col("last_year_num_single_child_families")
)
.orderBy(F.abs(F.col("percent_change")).desc())
.select(
F.col("year").alias("year"),
F.format_number("num_single_child_families", 0).alias("new families single child"),
F.format_number(F.col("percent_change") * 100, 2).alias("percent change"),
)
.limit(5)
)
>>> df.sql()
WITH `t94228` AS (
SELECT
`natality`.`year` AS `year`,
COUNT(*) AS `num_single_child_families`
FROM `bigquery-public-data`.`samples`.`natality` AS `natality`
WHERE
`natality`.`ever_born` = 1
GROUP BY
`natality`.`year`
), `t39093` AS (
SELECT
`t94228`.`year` AS `year`,
`t94228`.`num_single_child_families` AS `num_single_child_families`,
LAG(`t94228`.`num_single_child_families`, 1) OVER (ORDER BY `t94228`.`year`) AS `last_year_num_single_child_families`
FROM `t94228` AS `t94228`
)
SELECT
`t39093`.`year` AS `year`,
FORMAT('%\'.0f', ROUND(CAST(`t39093`.`num_single_child_families` AS FLOAT64), 0)) AS `new families single child`,
FORMAT('%\'.2f', ROUND(CAST((((`t39093`.`num_single_child_families` - `t39093`.`last_year_num_single_child_families`) / `t39093`.`last_year_num_single_child_families`) * 100) AS FLOAT64), 2)) AS `percent change`
FROM `t39093` AS `t39093`
ORDER BY
ABS(`percent_change`) DESC
LIMIT 5
>>> df.show()
+------+---------------------------+----------------+
| year | new families single child | percent change |
+------+---------------------------+----------------+
| 1989 | 1,650,246 | 25.02 |
| 1974 | 783,448 | 14.49 |
| 1977 | 1,057,379 | 11.38 |
| 1985 | 1,308,476 | 11.15 |
| 1975 | 868,985 | 10.92 |
+------+---------------------------+----------------+
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sqlframe-1.2.0.tar.gz
(17.1 MB
view details)
Built Distribution
sqlframe-1.2.0-py3-none-any.whl
(114.3 kB
view details)
File details
Details for the file sqlframe-1.2.0.tar.gz
.
File metadata
- Download URL: sqlframe-1.2.0.tar.gz
- Upload date:
- Size: 17.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25ad61af005aede9a3454902a5547d6c67ea0d653030c5f9ec21e9da36f7b920 |
|
MD5 | bf84976c0bfd14191847470435d90248 |
|
BLAKE2b-256 | 8b3d61db12becb01e175599a7b27a9cb776ef76fd3ccb4acd63ff5de957bb48a |
File details
Details for the file sqlframe-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: sqlframe-1.2.0-py3-none-any.whl
- Upload date:
- Size: 114.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a861c21b82821e77ff6f7a2b2e2272b2c401b01cc92395c36a8b3b128d618b7a |
|
MD5 | 87dc2236f515ef6cc4e4c79fbeb66ff9 |
|
BLAKE2b-256 | 1280e0811b4c382e6cd6319adcd0427c27bf44769ff91d520f6fc2fea9d5a8ad |