A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.
Project description
pyarrow-bigquery
A simple library to write to and download from BigQuery tables as PyArrow tables.
Installation
pip install pyarrow-bigquery
Quick Start
This guide will help you quickly get started with pyarrow-bigquery
, a library that allows you to read from and write to Google BigQuery using PyArrow.
Reading from BigQuery
pyarrow-bigquery
exposes two methods to read BigQuery tables as PyArrow tables. Depending on your use case or the size of the table, you might want to use one method over the other.
Read the Whole Table
When the table is small enough to fit in memory, you can read it directly using bq.read_table
.
import pyarrow.bigquery as bq
table = bq.read_table("gcp_project.dataset.small_table")
print(table.num_rows)
Read with Batches
If the target table is larger than memory or you have other reasons not to fetch the whole table at once, you can use the bq.reader
iterator method along with the batch_size
parameter to limit how much data is fetched per iteration.
import pyarrow.bigquery as bq
for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
print(table.num_rows)
Writing to BigQuery
Similarly, the package exposes two methods to write to BigQuery. Depending on your use case or the size of the table, you might want to use one method over the other.
Write the Whole Table
When you want to write a complete table at once, you can use the bq.write_table
method.
import pyarrow as pa
import pyarrow.bigquery as bq
table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])
bq.write_table(table, 'gcp_project.dataset.table')
Write in Batches (Smaller Chunks)
If you need to write data in smaller chunks, you can use the bq.writer
method with the schema
parameter to define the table structure.
import pyarrow as pa
import pyarrow.bigquery as bq
schema = pa.schema([
("integers", pa.int64())
])
with bq.writer("gcp_project.dataset.table", schema=schema) as w:
w.write_batch(record_batch)
w.write_table(table)
API Reference
pyarrow.bigquery.write_table
Write a PyArrow Table to a BigQuery Table. No return value.
Parameters:
-
table
:pa.Table
PyArrow table. -
where
:str
Destination location in BigQuery catalog. -
project
:str
, defaultNone
BigQuery execution project, also the billing project. If not provided, it will be extracted fromwhere
. -
table_create
:bool
, defaultTrue
Specifies if the BigQuery table should be created. -
table_expire
:None | int
, defaultNone
Amount of seconds after which the created table will expire. Used only iftable_create
isTrue
. Set toNone
to disable expiration. -
table_overwrite
:bool
, defaultFalse
If the table already exists, destroy it and create a new one. -
worker_type
:threading.Thread | multiprocessing.Process
, defaultthreading.Thread
Worker backend for fetching data. -
worker_count
:int
, defaultos.cpu_count()
Number of threads or processes to use for fetching data from BigQuery. -
batch_size
:int
, default100
Batch size for fetched rows.
bq.write_table(table, 'gcp_project.dataset.table')
pyarrow.bigquery.writer
Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.
Parameters:
-
schema
:pa.Schema
PyArrow schema. -
where
:str
Destination location in BigQuery catalog. -
project
:str
, defaultNone
BigQuery execution project, also the billing project. If not provided, it will be extracted fromwhere
. -
table_create
:bool
, defaultTrue
Specifies if the BigQuery table should be created. -
table_expire
:None | int
, defaultNone
Amount of seconds after which the created table will expire. Used only iftable_create
isTrue
. Set toNone
to disable expiration. -
table_overwrite
:bool
, defaultFalse
If the table already exists, destroy it and create a new one. -
worker_type
:threading.Thread | multiprocessing.Process
, defaultthreading.Thread
Worker backend for writing data. -
worker_count
:int
, defaultos.cpu_count()
Number of threads or processes to use for writing data to BigQuery. -
batch_size
:int
, default100
Batch size used for writes. Table will be automatically split to this value.
Depending on the use case, you might want to use one of the methods below to write your data to a BigQuery table, using either pa.Table
or pa.RecordBatch
.
pyarrow.bigquery.writer.write_table
Context manager method to write a table.
Parameters:
table
:pa.Table
PyArrow table.
import pyarrow as pa
import pyarrow.bigquery as bq
schema = pa.schema([("value", pa.list_(pa.int64()))])
with bq.writer("gcp_project.dataset.table", schema=schema) as w:
for a in range(1000):
w.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))
pyarrow.bigquery.writer.write_batch
Context manager method to write a record batch.
Parameters:
batch
:pa.RecordBatch
PyArrow record batch.
import pyarrow as pa
import pyarrow.bigquery as bq
schema = pa.schema([("value", pa.list_(pa.int64()))])
with bq.writer("gcp_project.dataset.table", schema=schema) as w:
for a in range(1000):
w.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))
pyarrow.bigquery.read_table
Parameters:
-
source
:str
BigQuery table location. -
project
:str
, defaultNone
BigQuery execution project, also the billing project. If not provided, it will be extracted fromsource
. -
columns
:str
, defaultNone
Columns to download. When not provided, all available columns will be downloaded. -
row_restrictions
:str
, defaultNone
Row level filtering executed on the BigQuery side. More in BigQuery documentation. -
worker_type
:threading.Thread | multiprocessing.Process
, defaultthreading.Thread
Worker backend for fetching data. -
worker_count
:int
, defaultos.cpu_count()
Number of threads or processes to use for fetching data from BigQuery. -
batch_size
:int
, default100
Batch size used for fetching. Table will be automatically split to this value.
pyarrow.bigquery.reader
Parameters:
-
source
:str
BigQuery table location. -
project
:str
, defaultNone
BigQuery execution project, also the billing project. If not provided, it will be extracted fromsource
. -
columns
:str
, defaultNone
Columns to download. When not provided, all available columns will be downloaded. -
row_restrictions
:str
, defaultNone
Row level filtering executed on the BigQuery side. More in BigQuery documentation. -
worker_type
:threading.Thread | multiprocessing.Process
, defaultthreading.Thread
Worker backend for fetching data. -
worker_count
:int
, defaultos.cpu_count()
Number of threads or processes to use for fetching data from BigQuery. -
batch_size
:int
, default100
Batch size used for fetching. Table will be automatically split to this value.
import pyarrow as pa
import pyarrow.bigquery as bq
parts = []
for part in bq.reader("gcp_project.dataset.table"):
parts.append(part)
table = pa.concat_tables(parts)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyarrow_bigquery-0.1.0.tar.gz
.
File metadata
- Download URL: pyarrow_bigquery-0.1.0.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 594d5baa5aed572e1ebf8368954346a0d995d2e4ecca80cb54383f466d1e1a70 |
|
MD5 | dab2f22f0e0435369b4fa330bc2225ce |
|
BLAKE2b-256 | d84ee42bd71b4b66a9a6f191a9951bd862a7bf88dcf220d8c83d1cb904ecd3e5 |
File details
Details for the file pyarrow_bigquery-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pyarrow_bigquery-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c04b09f03511fd881a077ce3d480f457766a22e8407db49b36509e3be2aea956 |
|
MD5 | a73c34bab640ac418270d4518f51c53a |
|
BLAKE2b-256 | a677259c10ce4d44a6608b0fc65d4d2aa416c138e01747dcb15a6a6f500a6dce |