Skip to main content

A simple library to **write to** and **download from** BigQuery tables as PyArrow tables.

Project description

pyarrow-bigquery

A simple library to write to and download from BigQuery tables as PyArrow tables.

Installation

pip install pyarrow-bigquery

Quick Start

This guide will help you quickly get started with pyarrow-bigquery, a library that allows you to read from and write to Google BigQuery using PyArrow.

Reading from BigQuery

pyarrow-bigquery exposes two methods to read BigQuery tables as PyArrow tables. Depending on your use case or the size of the table, you might want to use one method over the other.

Read the Whole Table

When the table is small enough to fit in memory, you can read it directly using bq.read_table.

import pyarrow.bigquery as bq

table = bq.read_table("gcp_project.dataset.small_table")

print(table.num_rows)

Read with Batches

If the target table is larger than memory or you have other reasons not to fetch the whole table at once, you can use the bq.reader iterator method along with the batch_size parameter to limit how much data is fetched per iteration.

import pyarrow.bigquery as bq

for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
    print(table.num_rows)

Writing to BigQuery

Similarly, the package exposes two methods to write to BigQuery. Depending on your use case or the size of the table, you might want to use one method over the other.

Write the Whole Table

When you want to write a complete table at once, you can use the bq.write_table method.

import pyarrow as pa
import pyarrow.bigquery as bq

table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])

bq.write_table(table, 'gcp_project.dataset.table')

Write in Batches (Smaller Chunks)

If you need to write data in smaller chunks, you can use the bq.writer method with the schema parameter to define the table structure.

import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([
    ("integers", pa.int64())
])

with bq.writer("gcp_project.dataset.table", schema=schema) as w:
    w.write_batch(record_batch)
    w.write_table(table)

API Reference

pyarrow.bigquery.write_table

Write a PyArrow Table to a BigQuery Table. No return value.

Parameters:

  • table: pa.Table
    PyArrow table.

  • where: str
    Destination location in BigQuery catalog.

  • project: str, default None
    BigQuery execution project, also the billing project. If not provided, it will be extracted from where.

  • table_create: bool, default True
    Specifies if the BigQuery table should be created.

  • table_expire: None | int, default None
    Amount of seconds after which the created table will expire. Used only if table_create is True. Set to None to disable expiration.

  • table_overwrite: bool, default False
    If the table already exists, destroy it and create a new one.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    Worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    Number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    Batch size for fetched rows.

bq.write_table(table, 'gcp_project.dataset.table')

pyarrow.bigquery.writer

Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.

Parameters:

  • schema: pa.Schema
    PyArrow schema.

  • where: str
    Destination location in BigQuery catalog.

  • project: str, default None
    BigQuery execution project, also the billing project. If not provided, it will be extracted from where.

  • table_create: bool, default True
    Specifies if the BigQuery table should be created.

  • table_expire: None | int, default None
    Amount of seconds after which the created table will expire. Used only if table_create is True. Set to None to disable expiration.

  • table_overwrite: bool, default False
    If the table already exists, destroy it and create a new one.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    Worker backend for writing data.

  • worker_count: int, default os.cpu_count()
    Number of threads or processes to use for writing data to BigQuery.

  • batch_size: int, default 100
    Batch size used for writes. Table will be automatically split to this value.

Depending on the use case, you might want to use one of the methods below to write your data to a BigQuery table, using either pa.Table or pa.RecordBatch.

pyarrow.bigquery.writer.write_table

Context manager method to write a table.

Parameters:

  • table: pa.Table
    PyArrow table.
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as w:
    for a in range(1000):
        w.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))

pyarrow.bigquery.writer.write_batch

Context manager method to write a record batch.

Parameters:

  • batch: pa.RecordBatch
    PyArrow record batch.
import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as w:
    for a in range(1000):
        w.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))

pyarrow.bigquery.read_table

Parameters:

  • source: str
    BigQuery table location.

  • project: str, default None
    BigQuery execution project, also the billing project. If not provided, it will be extracted from source.

  • columns: str, default None
    Columns to download. When not provided, all available columns will be downloaded.

  • row_restrictions: str, default None
    Row level filtering executed on the BigQuery side. More in BigQuery documentation.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    Worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    Number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    Batch size used for fetching. Table will be automatically split to this value.

pyarrow.bigquery.reader

Parameters:

  • source: str
    BigQuery table location.

  • project: str, default None
    BigQuery execution project, also the billing project. If not provided, it will be extracted from source.

  • columns: str, default None
    Columns to download. When not provided, all available columns will be downloaded.

  • row_restrictions: str, default None
    Row level filtering executed on the BigQuery side. More in BigQuery documentation.

  • worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
    Worker backend for fetching data.

  • worker_count: int, default os.cpu_count()
    Number of threads or processes to use for fetching data from BigQuery.

  • batch_size: int, default 100
    Batch size used for fetching. Table will be automatically split to this value.

import pyarrow as pa
import pyarrow.bigquery as bq

parts = []
for part in bq.reader("gcp_project.dataset.table"):
    parts.append(part)

table = pa.concat_tables(parts)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrow_bigquery-0.1.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

pyarrow_bigquery-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file pyarrow_bigquery-0.1.0.tar.gz.

File metadata

  • Download URL: pyarrow_bigquery-0.1.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for pyarrow_bigquery-0.1.0.tar.gz
Algorithm Hash digest
SHA256 594d5baa5aed572e1ebf8368954346a0d995d2e4ecca80cb54383f466d1e1a70
MD5 dab2f22f0e0435369b4fa330bc2225ce
BLAKE2b-256 d84ee42bd71b4b66a9a6f191a9951bd862a7bf88dcf220d8c83d1cb904ecd3e5

See more details on using hashes here.

File details

Details for the file pyarrow_bigquery-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyarrow_bigquery-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c04b09f03511fd881a077ce3d480f457766a22e8407db49b36509e3be2aea956
MD5 a73c34bab640ac418270d4518f51c53a
BLAKE2b-256 a677259c10ce4d44a6608b0fc65d4d2aa416c138e01747dcb15a6a6f500a6dce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page