Skip to main content

An extension library to write to and read from BigQuery tables as PyArrow tables.

Project description

pyarrow-bigquery

An extension library to write to and read from BigQuery tables as PyArrow tables.


Installation

pip install pyarrow-bigquery

Source Code

https://github.com/xando/pyarrow-bigquery/

Quick Start

This guide will help you quickly get started with pyarrow-bigquery, a library that allows you to read from and write to Google BigQuery using PyArrow.

Reading

pyarrow-bigquery offers four methods to read BigQuery tables as PyArrow tables. Depending on your use case and/or the table size, you can choose the most suitable method.

Read from a Table Location

When the table is small enough to fit in memory, you can read it directly using read_table.

import pyarrow.bigquery as bq

table = bq.read_table("gcp_project.dataset.small_table")

print(table.num_rows)

Read from a Query

Alternatively, if the query results are small enough to fit in memory, you can read them directly using read_query.

import pyarrow.bigquery as bq

table = bq.read_query(
    project="gcp_project",
    query="SELECT * FROM `gcp_project.dataset.small_table`"
)

print(table.num_rows)

Read in Batches

If the target table is larger than memory or you prefer not to fetch the entire table at once, you can use the bq.reader iterator method with the batch_size parameter to limit how much data is fetched per iteration.

import pyarrow.bigquery as bq

for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
    print(table.num_rows)

Read Query in Batches

Similarly, you can read data in batches from a query using reader_query.

import pyarrow.bigquery as bq

with bq.reader_query(
    project="gcp_project",
    query="SELECT * FROM `gcp_project.dataset.small_table`"
) as reader:
    print(reader.schema)
    for table in reader:
        print(table.num_rows)

Writing

The package provides two methods to write to BigQuery. Depending on your use case or the table size, you can choose the appropriate method.

Write the Entire Table

To write a complete table at once, use the bq.write_table method.

import pyarrow as pa
import pyarrow.bigquery as bq

table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])

bq.write_table(table, 'gcp_project.dataset.table')

Write in Batches

If you need to write data in smaller chunks, use the bq.writer method with the schema parameter to define the table structure.

import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([
    ("integers", pa.int64())
])

with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
    writer.write_batch(record_batch)
    writer.write_table(table)

API Reference

Writing

pyarrow.bigquery.write_table

Writes a PyArrow Table to a BigQuery Table. No return value.

Parameters:

  • table: pa.Table
    The PyArrow table.

  • where: str
    The destination location in the BigQuery catalog.

  • project: str, default None
    The BigQuery execution project, also the billing project. If not provided, it will be extracted from where.

  • table_create: bool, default True
    Specifies if the BigQuery table should be created.

  • table_expire: None | int, default None
    The number of seconds after which the created table will expire. Used only if table_create is True. Set to None to disable expiration.

  • table_overwrite: bool, default False
    If the table already exists, it will be destroyed and a new one will be created.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    The worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    The number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    The batch size for fetched rows.

bq.write_table(table, 'gcp_project.dataset.table')

pyarrow.bigquery.writer (Context Manager)

Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.

Parameters:

  • schema: pa.Schema
    The PyArrow schema.

  • where: str
    The destination location in the BigQuery catalog.

  • project: str, default None
    The BigQuery execution project, also the billing project. If not provided, it will be extracted from where.

  • table_create: bool, default True
    Specifies if the BigQuery table should be created.

  • table_expire: None | int, default None
    The number of seconds after which the created table will expire. Used only if table_create is True. Set to None to disable expiration.

  • table_overwrite: bool, default False
    If the table already exists, it will be destroyed and a new one will be created.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    The worker backend for writing data.

  • worker_count: int, default os.cpu_count()
    The number of threads or processes to use for writing data to BigQuery.

  • batch_size: int, default 100
    The batch size used for writes. The table will be automatically split to this value.

Depending on your use case, you might want to use one of the methods below to write your data to a BigQuery table, using either pa.Table or pa.RecordBatch.

pyarrow.bigquery.writer.write_table (Context Manager Method)

Context manager method to write a table.

Parameters:

  • table: pa.Table
    The PyArrow table.
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
    for a in range(1000):
        writer.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))

pyarrow.bigquery.writer.write_batch (Context Manager Method)

Context manager method to write a record batch.

Parameters:

  • batch: pa.RecordBatch
    The PyArrow record batch.
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as writer:
    for a in range 1000:
        writer.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))

Reading

pyarrow.bigquery.read_table

Parameters:

  • source: str
    The BigQuery table location.

  • project: str, default None
    The BigQuery execution project, also the billing project. If not provided, it will be extracted from source.

  • columns: str, default None
    The columns to download. When not provided, all available columns will be downloaded.

  • row_restrictions: str, default None
    Row-level filtering executed on the BigQuery side. More information is available in the BigQuery documentation.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    The worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    The number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    The batch size used for fetching. The table will be automatically split into this value.

pyarrow.bigquery.read_query

Parameters:

  • project: str
    The BigQuery query execution (and billing) project.

  • query: str
    The query to be executed.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    The worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    The number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    The batch size used for fetching. The table will be automatically split into this value.

table = bq.read_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`")

pyarrow.bigquery.reader (Context Manager)

Parameters:

  • source: str
    The BigQuery table location.

  • project: str, default None
    The BigQuery execution project, also the billing project. If not provided, it will be extracted from source.

  • columns: str, default None
    The columns to download. When not provided, all available columns will be downloaded.

  • row_restrictions: str, default None
    Row-level filtering executed on the BigQuery side. More information is available in the BigQuery documentation.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    The worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    The number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    The batch size used for fetching. The table will be automatically split into this value.

Attributes:

  • schema: pa.Schema
    Context manager attribute to provide schema of pyarrow table. Works only when context manager is active (after __enter__ was called)
import pyarrow as pa
import pyarrow.bigquery as bq

parts = []

with bq.reader("gcp_project.dataset.table") as r:

    print(r.schema)

    for batch in r:
        parts.append(batch)

table = pa.concat_tables(parts)

pyarrow.bigquery.reader_query (Context Manager)

Parameters:

  • project: str
    The BigQuery query execution (and billing) project.

  • query: str
    The query to be executed.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    The worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    The number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    The batch size used for fetching. The table will be automatically split into this value.

Attributes:

  • schema: pa.Schema
    Context manager attribute to provide schema of pyarrow table. Works only when context manager is active (after __enter__ was called)
with bq.reader_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`") as r:
    for batch in r:
        print(batch.num_rows)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrow_bigquery-0.5.4.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

pyarrow_bigquery-0.5.4-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file pyarrow_bigquery-0.5.4.tar.gz.

File metadata

  • Download URL: pyarrow_bigquery-0.5.4.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for pyarrow_bigquery-0.5.4.tar.gz
Algorithm Hash digest
SHA256 431bf4cc6036fdf19b73cf4c756a0180e9da8263dd849fb2811a2495f5a0a340
MD5 d34df88993c67be76c3065e6b4a2803c
BLAKE2b-256 4c4583c8bfcddbd38cd8799dcd12a42422b51a598f18b881fe3ffd23fa72264d

See more details on using hashes here.

File details

Details for the file pyarrow_bigquery-0.5.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pyarrow_bigquery-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 19e51ffb36a80df8c6fc259bbeefa7c4fc64ab6a19e79c0bce23f1ebe398e6bc
MD5 3fcb8d23ce43069c701e1f3ee0afd72e
BLAKE2b-256 eebd7ecf9b2786e7d0a52ac0c31b8f590c1f601eafecd77fa69d56f5cc79ab0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page