Partridge is python library for working with GTFS feeds using pandas DataFrames.
Project description
=========
Partridge
=========
.. image:: https://img.shields.io/pypi/v/partridge.svg
:target: https://pypi-hypernode.com/pypi/partridge
.. image:: https://img.shields.io/travis/remix/partridge.svg
:target: https://travis-ci.org/remix/partridge
Partridge is python library for working with `GTFS <https://developers.google.com/transit/gtfs/>`__ feeds using `pandas <https://pandas.pydata.org/>`__ DataFrames.
Partridge is heavily influenced by our experience at `Remix <https://www.remix.com/>`__ analyzing and debugging every GTFS feed we could find.
At the core of Partridge is a dependency graph rooted at ``trips.txt``. Disconnected data is pruned away according to this graph when reading the contents of a feed.
Feeds can also be filtered to create a view specific to your needs. It's most common to filter a feed down to specific dates (``service_id``) or routes (``route_id``), but any field can be filtered.
.. figure:: dependency-graph.png
:alt: dependency graph
Philosphy
---------
The design of Partridge is guided by the following principles:
**As much as possible**
- Favor speed
- Allow for extension
- Succeed lazily on expensive paths
- Fail eagerly on inexpensive paths
**As little as possible**
- Do anything other than efficiently read GTFS files into DataFrames
- Take an opinion on the GTFS spec
Installation
------------
.. code:: console
pip install partridge
Usage
-----
**Setup**
.. code:: python
import partridge as ptg
inpath = 'path/to/caltrain-2017-07-24/'
Inspecting the calendar
~~~~~~~~~~~~~~~~~~~~~~~
**The date with the most trips**
.. code:: python
date, service_ids = ptg.read_busiest_date(inpath)
# datetime.date(2017, 7, 17), frozenset({'CT-17JUL-Combo-Weekday-01'})
**The week with the most trips**
.. code:: python
service_ids_by_date = ptg.read_busiest_week(inpath)
# {datetime.date(2017, 7, 17): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 18): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 19): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 20): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 21): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 22): frozenset({'CT-17JUL-Caltrain-Saturday-03'}),
# datetime.date(2017, 7, 23): frozenset({'CT-17JUL-Caltrain-Sunday-01'})}
**Dates with active service**
.. code:: python
service_ids_by_date = ptg.read_service_ids_by_date(path)
date, service_ids = min(service_ids_by_date.items())
# (datetime.date(2017, 7, 15), frozenset({'CT-17JUL-Caltrain-Saturday-03'}))
date, service_ids = max(service_ids_by_date.items())
# (datetime.date(2019, 7, 20), frozenset({'CT-17JUL-Caltrain-Saturday-03'}))
**Dates with identical service**
.. code:: python
dates_by_service_ids = ptg.read_dates_by_service_ids(inpath)
busiest_date, busiest_service = ptg.read_busiest_date(inpath)
dates = dates_by_service_ids[busiest_service]
min(dates), max(dates)
# datetime.date(2017, 7, 17), datetime.date(2019, 7, 19)
Reading a feed
~~~~~~~~~~~~~~
.. code:: python
_date, service_ids = ptg.read_busiest_date(inpath)
view = {
'trips.txt': {'service_id': service_ids},
'stops.txt': {'stop_name': 'Gilroy Caltrain'},
}
feed = ptg.load_feed(path, view)
Extracting a new feed
~~~~~~~~~~~~~~~~~~~~~
.. code:: python
outpath = 'gtfs-slim.zip'
date, service_ids = ptg.read_busiest_date(inpath)
view = {'trips.txt': {'service_id': service_ids}}
ptg.extract_feed(inpath, outpath, view)
feed = ptg.load_feed(outpath)
assert service_ids == set(feed.trips.service_id)
Features
--------
- Surprisingly fast :)
- Load only what you need into memory
- Built-in support for resolving service dates
- Easily extended to support fields and files outside the official spec
(TODO: document this)
- Handle nested folders and bad data in zips
- Predictable type conversions
Thank You
---------
I hope you find this library useful. If you have suggestions for
improving Partridge, please open an `issue on
GitHub <https://github.com/remix/partridge/issues>`__.
History
=======
1.0.0 (2018-12-18)
------------------
This release is a combination of major internal refactorings and some minor interface changes. Overall, you should expect your upgrade from pre-1.0 versions to be relatively painless. A big thank you to @genhernandez and @csb19815 for their valuable design feedback.
Here is a list of interface changes:
* The class ``partridge.gtfs.feed`` has been renamed to ``partridge.gtfs.Feed``.
* The public interface for instantiating feeds is ``partridge.load_feed``. This function replaces the previously undocumented function ``partridge.get_filtered_feed``.
* A new function has been added for identifying the busiest week in a feed: ``partridge.read_busiest_date``
* The public function ``partridge.get_representative_feed`` has been removed in favor of using ``partridge.read_busiest_date`` directly.
* The public function ``partridge.writers.extract_feed`` is now available via the top level module: ``partridge.extract_feed``.
Miscellaneous minor changes:
* Character encoding detection is now done by the ``cchardet`` package instead of ``chardet``. ``cchardet`` is faster, but may not always return the same result as ``chardet``.
* Zip files are unpacked into a temporary directory instead of reading directly from the zip. These temporary directories are cleaned up when the feed is garbage collected or when the process exits.
* The code base is now annotated with type hints and the build runs ``mypy`` to verify the types.
* DataFrames are cached in a dictionary instead of the ``functools.lru_cache`` decorator.
* The ``partridge.extract_feed`` function now writes files concurrently to improve performance.
0.11.0 (2018-08-01)
-------------------
* Fix major performance issue related to encoding detection. Thank you to @cjer for reporting the issue and advising on a solution.
0.10.0 (2018-04-30)
-------------------
* Improved handling of non-standard compliant file encodings
* Only require functools32 for Python < 3
* ``ptg.parsers.parse_date`` no longer accepts dates, only strings
0.9.0 (2018-03-24)
------------------
* Improves read time for large feeds by adding LRU caching to ``ptg.parsers.parse_time``.
0.8.0 (2018-03-14)
------------------
* Gracefully handle completely empty files. This change unifies the behavior of reading from a CSV with a header only (no data rows) and a completely empty (zero bytes) file in the zip.
0.7.0 (2018-03-09)
------------------
* Fix handling of nested folders and zip containing nested folders.
* Add ``ptg.get_filtered_feed`` for multi-file filtering.
0.6.1 (2018-02-24)
------------------
* Fix bug in ``ptg.read_service_ids_by_date``. Reported by @cjer in #27.
0.6.0 (2018-02-21)
------------------
* Published package no longer includes unnecessary fixtures to reduce the size.
* Naively write a feed object to a zip file with ``ptg.write_feed_dangerously``.
* Read the earliest, busiest date and its ``service_id``'s from a feed with ``ptg.read_busiest_date``.
* Bug fix: Handle ``calendar.txt``/``calendar_dates.txt`` entries w/o applicable trips.
0.6.0.dev1 (2018-01-23)
-----------------------
* Add support for reading files from a folder. Thanks again @danielsclint!
0.5.0 (2017-12-22)
------------------
* Easily build a representative view of a zip with ``ptg.get_representative_feed``. Inspired by `peartree <https://github.com/kuanb/peartree/blob/3bfc3f49ae6986d6020913b63c8ee32582b3dcc3/peartree/paths.py#L26>`_.
* Extract out GTFS zips by agency_id/route_id with ``ptg.extract_{agencies,routes}``.
* Read arbitrary files from a zip with ``feed.get('myfile.txt')``.
* Remove ``service_ids_by_date``, ``dates_by_service_ids``, and ``trip_counts_by_date`` from the feed class. Instead use ``ptg.{read_service_ids_by_date,read_dates_by_service_ids,read_trip_counts_by_date}``.
0.4.0 (2017-12-10)
------------------
* Add support for Python 2.7. Thanks @danielsclint!
0.3.0 (2017-10-12)
------------------
* Fix service date resolution for raw_feed. Previously raw_feed considered all days of the week from calendar.txt to be active regardless of 0/1 value.
0.2.0 (2017-09-30)
------------------
* Add missing edge from fare_rules.txt to routes.txt in default dependency graph.
0.1.0 (2017-09-23)
------------------
* First release on PyPI.
Partridge
=========
.. image:: https://img.shields.io/pypi/v/partridge.svg
:target: https://pypi-hypernode.com/pypi/partridge
.. image:: https://img.shields.io/travis/remix/partridge.svg
:target: https://travis-ci.org/remix/partridge
Partridge is python library for working with `GTFS <https://developers.google.com/transit/gtfs/>`__ feeds using `pandas <https://pandas.pydata.org/>`__ DataFrames.
Partridge is heavily influenced by our experience at `Remix <https://www.remix.com/>`__ analyzing and debugging every GTFS feed we could find.
At the core of Partridge is a dependency graph rooted at ``trips.txt``. Disconnected data is pruned away according to this graph when reading the contents of a feed.
Feeds can also be filtered to create a view specific to your needs. It's most common to filter a feed down to specific dates (``service_id``) or routes (``route_id``), but any field can be filtered.
.. figure:: dependency-graph.png
:alt: dependency graph
Philosphy
---------
The design of Partridge is guided by the following principles:
**As much as possible**
- Favor speed
- Allow for extension
- Succeed lazily on expensive paths
- Fail eagerly on inexpensive paths
**As little as possible**
- Do anything other than efficiently read GTFS files into DataFrames
- Take an opinion on the GTFS spec
Installation
------------
.. code:: console
pip install partridge
Usage
-----
**Setup**
.. code:: python
import partridge as ptg
inpath = 'path/to/caltrain-2017-07-24/'
Inspecting the calendar
~~~~~~~~~~~~~~~~~~~~~~~
**The date with the most trips**
.. code:: python
date, service_ids = ptg.read_busiest_date(inpath)
# datetime.date(2017, 7, 17), frozenset({'CT-17JUL-Combo-Weekday-01'})
**The week with the most trips**
.. code:: python
service_ids_by_date = ptg.read_busiest_week(inpath)
# {datetime.date(2017, 7, 17): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 18): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 19): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 20): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 21): frozenset({'CT-17JUL-Combo-Weekday-01'}),
# datetime.date(2017, 7, 22): frozenset({'CT-17JUL-Caltrain-Saturday-03'}),
# datetime.date(2017, 7, 23): frozenset({'CT-17JUL-Caltrain-Sunday-01'})}
**Dates with active service**
.. code:: python
service_ids_by_date = ptg.read_service_ids_by_date(path)
date, service_ids = min(service_ids_by_date.items())
# (datetime.date(2017, 7, 15), frozenset({'CT-17JUL-Caltrain-Saturday-03'}))
date, service_ids = max(service_ids_by_date.items())
# (datetime.date(2019, 7, 20), frozenset({'CT-17JUL-Caltrain-Saturday-03'}))
**Dates with identical service**
.. code:: python
dates_by_service_ids = ptg.read_dates_by_service_ids(inpath)
busiest_date, busiest_service = ptg.read_busiest_date(inpath)
dates = dates_by_service_ids[busiest_service]
min(dates), max(dates)
# datetime.date(2017, 7, 17), datetime.date(2019, 7, 19)
Reading a feed
~~~~~~~~~~~~~~
.. code:: python
_date, service_ids = ptg.read_busiest_date(inpath)
view = {
'trips.txt': {'service_id': service_ids},
'stops.txt': {'stop_name': 'Gilroy Caltrain'},
}
feed = ptg.load_feed(path, view)
Extracting a new feed
~~~~~~~~~~~~~~~~~~~~~
.. code:: python
outpath = 'gtfs-slim.zip'
date, service_ids = ptg.read_busiest_date(inpath)
view = {'trips.txt': {'service_id': service_ids}}
ptg.extract_feed(inpath, outpath, view)
feed = ptg.load_feed(outpath)
assert service_ids == set(feed.trips.service_id)
Features
--------
- Surprisingly fast :)
- Load only what you need into memory
- Built-in support for resolving service dates
- Easily extended to support fields and files outside the official spec
(TODO: document this)
- Handle nested folders and bad data in zips
- Predictable type conversions
Thank You
---------
I hope you find this library useful. If you have suggestions for
improving Partridge, please open an `issue on
GitHub <https://github.com/remix/partridge/issues>`__.
History
=======
1.0.0 (2018-12-18)
------------------
This release is a combination of major internal refactorings and some minor interface changes. Overall, you should expect your upgrade from pre-1.0 versions to be relatively painless. A big thank you to @genhernandez and @csb19815 for their valuable design feedback.
Here is a list of interface changes:
* The class ``partridge.gtfs.feed`` has been renamed to ``partridge.gtfs.Feed``.
* The public interface for instantiating feeds is ``partridge.load_feed``. This function replaces the previously undocumented function ``partridge.get_filtered_feed``.
* A new function has been added for identifying the busiest week in a feed: ``partridge.read_busiest_date``
* The public function ``partridge.get_representative_feed`` has been removed in favor of using ``partridge.read_busiest_date`` directly.
* The public function ``partridge.writers.extract_feed`` is now available via the top level module: ``partridge.extract_feed``.
Miscellaneous minor changes:
* Character encoding detection is now done by the ``cchardet`` package instead of ``chardet``. ``cchardet`` is faster, but may not always return the same result as ``chardet``.
* Zip files are unpacked into a temporary directory instead of reading directly from the zip. These temporary directories are cleaned up when the feed is garbage collected or when the process exits.
* The code base is now annotated with type hints and the build runs ``mypy`` to verify the types.
* DataFrames are cached in a dictionary instead of the ``functools.lru_cache`` decorator.
* The ``partridge.extract_feed`` function now writes files concurrently to improve performance.
0.11.0 (2018-08-01)
-------------------
* Fix major performance issue related to encoding detection. Thank you to @cjer for reporting the issue and advising on a solution.
0.10.0 (2018-04-30)
-------------------
* Improved handling of non-standard compliant file encodings
* Only require functools32 for Python < 3
* ``ptg.parsers.parse_date`` no longer accepts dates, only strings
0.9.0 (2018-03-24)
------------------
* Improves read time for large feeds by adding LRU caching to ``ptg.parsers.parse_time``.
0.8.0 (2018-03-14)
------------------
* Gracefully handle completely empty files. This change unifies the behavior of reading from a CSV with a header only (no data rows) and a completely empty (zero bytes) file in the zip.
0.7.0 (2018-03-09)
------------------
* Fix handling of nested folders and zip containing nested folders.
* Add ``ptg.get_filtered_feed`` for multi-file filtering.
0.6.1 (2018-02-24)
------------------
* Fix bug in ``ptg.read_service_ids_by_date``. Reported by @cjer in #27.
0.6.0 (2018-02-21)
------------------
* Published package no longer includes unnecessary fixtures to reduce the size.
* Naively write a feed object to a zip file with ``ptg.write_feed_dangerously``.
* Read the earliest, busiest date and its ``service_id``'s from a feed with ``ptg.read_busiest_date``.
* Bug fix: Handle ``calendar.txt``/``calendar_dates.txt`` entries w/o applicable trips.
0.6.0.dev1 (2018-01-23)
-----------------------
* Add support for reading files from a folder. Thanks again @danielsclint!
0.5.0 (2017-12-22)
------------------
* Easily build a representative view of a zip with ``ptg.get_representative_feed``. Inspired by `peartree <https://github.com/kuanb/peartree/blob/3bfc3f49ae6986d6020913b63c8ee32582b3dcc3/peartree/paths.py#L26>`_.
* Extract out GTFS zips by agency_id/route_id with ``ptg.extract_{agencies,routes}``.
* Read arbitrary files from a zip with ``feed.get('myfile.txt')``.
* Remove ``service_ids_by_date``, ``dates_by_service_ids``, and ``trip_counts_by_date`` from the feed class. Instead use ``ptg.{read_service_ids_by_date,read_dates_by_service_ids,read_trip_counts_by_date}``.
0.4.0 (2017-12-10)
------------------
* Add support for Python 2.7. Thanks @danielsclint!
0.3.0 (2017-10-12)
------------------
* Fix service date resolution for raw_feed. Previously raw_feed considered all days of the week from calendar.txt to be active regardless of 0/1 value.
0.2.0 (2017-09-30)
------------------
* Add missing edge from fare_rules.txt to routes.txt in default dependency graph.
0.1.0 (2017-09-23)
------------------
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
partridge-1.0.0.tar.gz
(29.8 kB
view details)
Built Distribution
File details
Details for the file partridge-1.0.0.tar.gz
.
File metadata
- Download URL: partridge-1.0.0.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/3.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb58b87c4950c60eb98ca3f45afcdbcd8bc0d0a0c4b190911d7d65ba0810fb95 |
|
MD5 | 98e0cce40f694314d7547252023946b2 |
|
BLAKE2b-256 | 6fbecd7fb93c7cb49293c9cf2e6fdcec7a56ad7fdf5bd14bb2a1c3e641d025bf |
File details
Details for the file partridge-1.0.0-py2.py3-none-any.whl
.
File metadata
- Download URL: partridge-1.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/3.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c88522bdc2b61a71067f5ea104dbe9b0deb239f9be636368846a7231dd11dc0 |
|
MD5 | 427f1a6a665f772decec9f16c74a0287 |
|
BLAKE2b-256 | 19a63a26b6ffc3a317248a279f9c0057ae92e9f3432af50af8d233c217d80de3 |