Skip to main content

basic streaming text processing

Project description

====
pyin
====

It's like sed, but Python!

.. image:: https://travis-ci.org/geowurster/pyin.svg?branch=0.4
:target: https://travis-ci.org/geowurster/pyin

.. image:: https://coveralls.io/repos/geowurster/pyin/badge.svg?branch=master
:target: https://coveralls.io/r/geowurster/pyin?branch=master


Examples
========

See the `Cookbook <https://github.com/geowurster/pyin/blob/master/Cookbook.rst>`__ for more examples.

Change newline character in a CSV.

.. code-block:: console

$ more sample-data/csv-with-header.csv | pyin "line.replace('\n', '\r\n')" > output.csv

Extract a BigQuery schema from an existing table and pretty print it:

.. code-block:: console
$ bq show --format=json ${DATASET}.${TABLE} | pyin -m json -m pprint "pprint.pformat(json.loads(line)['schema']['fields'])"
[{u'mode': u'NULLABLE', u'name': u'mmsi', u'type': u'STRING'},
{u'mode': u'NULLABLE', u'name': u'longitude', u'type': u'FLOAT'},
{u'mode': u'NULLABLE', u'name': u'latitude', u'type': u'FLOAT'}
...]

Read the first 100K lines of a CSV and write only the lines where column
'Msg type' is equal to 5.

.. code-block:: console

$ pyin -i ${INFILE} -o ${OUTFILE} \
--true \
--lines 100000 \
--reader csv.DictReader \
--import csv \
--import newlinejson \
--writer newlinejson.Writer
"line['Msg type'] == '5'"


Installing
==========

Via pip:

$ pip install git+https://github.com/geowurster/pyin.git

From master branch:

$ git clone https://github.com/geowurster/pyin
$ python setup.py install


Gotchas
=======

It's easy to completely modify the line content:

.. code-block:: console

$ pyin -i sample-data/csv-with-header.csv "'operation'"
operationoperationoperationoperationoperationoperation

Forgetting to use ``-t`` to only get lines that evaluate as ``True``:

.. code-block:: console

$ pyin -i LICENSE.txt "'are' in line"
FalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

$ pyin -i LICENSE.txt "'are' in line" -t
modification, are permitted provided that the following conditions are met:
derived from this software without specific prior written permission.

The ``--reader-option key=val`` values are parsed to their Python type but if the user wants to
specify something like which JSON library to use for a ``newlinejson.Reader()``
instance then they must do that via the ``--statement`` option:

.. code-block:: console

$ pyin -i ${INFILE} -o ${OUTFILE}
--true
--import newlinejson \
--import ujson
--reader newlinejson.Reader \
--writer newlinejson.Writer \
--statement "newlinejson.JSON = ujson" \
"'type' in line and line['type'] is 5"


Developing
==========

Install:

.. code-block:: console

$ git clone https://github.com/geowurster/pyin
$ cd pyin
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements-dev.txt
$ pip install -e .
$ nosetests --with-coverage
$ pep8 --max-line-length=120 pyin.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyin-0.4.2.tar.gz (9.5 kB view details)

Uploaded Source

File details

Details for the file pyin-0.4.2.tar.gz.

File metadata

  • Download URL: pyin-0.4.2.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pyin-0.4.2.tar.gz
Algorithm Hash digest
SHA256 4058e6bf0facc57b4ab5ba0e70e3cc2a614253220098dd07783d2ddd5796c0d7
MD5 6af36df1911199f61b23c1a55b52277b
BLAKE2b-256 4a285f9dd7bbc461466936603d86d75490fdf4a81f092ec4dd1bc73c65036acb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page