python-dwca-reader

A simple Python class to read Darwin Core Archive (DwC-A) files.

Project description

What is it ?

A simple Python class to read Darwin Core Archive (DwC-A) files.

Status

It is currently considered alpha quality. It helped its author a couple of times, but should be improved and tested before widespread use.

Major limitations

Early support for DwC-A extensions.
It sometimes assumes the file has been produced by GBIF’s IPT. For example, only zip compression is curently supported, even tough the Darwin Core Archive allows other compression formats.
No write support.

Tutorial

Installation

Quite simply:

$ pip install python-dwca-reader

Example use

Basic use, access to metadata and “Core lines”

from dwca import DwCAReader
from darwincore.utils import qualname as qn

# Let's open our archive...
# Using the with statement ensure that resources will be properly freed/cleaned after use.
with DwCAReader('my_archive.zip') as dwca:
    # We can now interact with the 'dwca' object

    # We can read scientific metadata (EML) through a BeautifulStoneSoup object in the 'metadata' attribute
    # BeautifulStoneSoup is provided by BS3: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
    print dwca.metadata.prettify()

    # We can get inspect archive to discover what is the Core Type (Occurrence, Taxon, ...):
    print "Core type is: %s" % dwca.core_rowtype
    # => Core type is: http://rs.tdwg.org/dwc/terms/Occurrence

    # Check if a Darwin Core term in present in the core file
    if dwca.core_contains_term('http://rs.tdwg.org/dwc/terms/locality'):
        print "This archive contains the 'locality' term in its core file."
    else:
        print "Locality term is not present."

    # Using full qualnames for DarwincCore terms (such as 'http://rs.tdwg.org/dwc/terms/country') is verbose...
    # The qualname() helper function make life easy for common terms.
    # (here, it has been imported as 'qn'):
    qn('locality')
    # => u'http://rs.tdwg.org/dwc/terms/locality'

    # Combined with previous examples, this can be used to things more clear:
    # For example:
    if dwca.core_contains_term(qn('locality')):
        pass

    # Or:
    if dwca.core_rowtype == qn('Occurrence'):
        pass

    # Finally, let's iterate over the archive lines and get the data:
    for line in dwca.each_line():
        # line is an instance of DwCALine

        # Print can be used for debugging purposes...
        print line

        # => --
        # => Rowtype: http://rs.tdwg.org/dwc/terms/Occurrence
        # => Source: Core file
        # => Line ID:
        # => Data: {u'http://rs.tdwg.org/dwc/terms/basisOfRecord': u'Observation', u'http://rs.tdwg.org/dwc/terms/family': # => u'Tetraodontidae', u'http://rs.tdwg.org/dwc/terms/locality': u'Borneo', u'http://rs.tdwg.#
        # => org/dwc/terms/scientificName': u'tetraodon fluviatilis'}
        # => --

        # You can get the value of a specific Darwin Core term through
        # the "data" dict:
        print "Locality for this line is: %s" % line.data[qn('locality')]
        # => Locality for this line is: Mumbai

    # Alternatively, we can get a list of core lines instead of using each_line():
    lines = dwca.lines

    # Or retrieve a specific line by its id:
    occurrence_number_three = dwca.get_line(3)

Use of Darwin Core Archives using extensions (star schema)

from dwca import DwCAReader
from darwincore.utils import qualname as qn

with DwCAReader('archive_with_vernacularnames_extension.zip') as dwca:
    # Let's ask the archive what kind of extensions are in use:
    print dwca.extensions_rowtype
    # => [u'http://rs.gbif.org/terms/1.0/VernacularName']

    # For convenience
    core_lines = dwca.lines

    # a) Data access
    # Extension lines are accessible as a list of DwcALine instances in the 'extensions' attribute:
    for e in core_lines[0].extensions:
        # Display all extensions line that refers to the first Core line
        print e

    # b) We can now see in a given archive, a DwcALine can come from multiple sources...
    # Se we can ask it where it's from:
    print core_lines[0].from_core
    # => True
    print core_lines[0].extensions[0].from_extension
    # => True

    # ... and what its rowtype is:
    print core_lines[0].rowtype
    # => http://rs.tdwg.org/dwc/terms/Taxon

Another example with multiple extensions (no new API here):

from dwca import DwCAReader
from darwincore.utils import qualname as qn

with DwCAReader('multiext_archive.zip') as dwca:
    lines = dwca.lines
    ostrich = lines[0]

    print "You'll find below all extensions line reffering to Ostrich"
    print "There should be 3 verncaular names and 2 taxon description"
    for ext in ostrich.extensions:
        print ext

    print "We can then simply filter by type..."
    for ext in ostrich.extensions:
        if ext.rowtype == 'http://rs.gbif.org/terms/1.0/VernacularName':
            print ext

    print "We can also use list comprehensions for this:"
    description_ext = [e for e in ostrich.extensions if
                   e.rowtype == 'http://rs.gbif.org/terms/1.0/Description']

    for ext in description_ext:
        print ext

Run the test suite

$ pip install nose
$ nosetests

Test coverage can easily be obtained after installing coverage.py

$ nosetests --with-coverage --cover-erase --cover-package=dwca
.....................
Name                    Stmts   Miss  Cover   Missing
-----------------------------------------------------
dwca                        0      0   100%
dwca.darwincore             0      0   100%
dwca.darwincore.terms       1      0   100%
dwca.darwincore.utils       3      0   100%
dwca.dwca                 130     16    88%   23-45
dwca.utils                  5      1    80%   12
-----------------------------------------------------
TOTAL                     139     17    88%
----------------------------------------------------------------------
Ran 21 tests in 0.830s

OK

Project details

Release history Release notifications | RSS feed

0.16.4

Oct 18, 2024

0.16.3

Oct 18, 2024

0.16.2

Aug 23, 2024

0.16.1

Jul 8, 2024

0.16.0

Nov 13, 2023

0.15.1

Jan 17, 2023

0.15.0

Sep 9, 2020

0.14.0

Apr 28, 2020

0.13.2

Sep 27, 2019

0.13.1

Aug 30, 2018

0.13.0

Dec 1, 2017

0.12.0

Nov 10, 2017

0.11.2

Oct 18, 2017

0.11.1

Oct 12, 2017

0.11.0

Oct 10, 2017

0.10.2

Apr 11, 2017

0.10.1

Apr 4, 2017

0.10.0

Mar 16, 2017

0.9.2

Apr 29, 2016

0.9.1

Apr 28, 2016

0.9.0

Apr 5, 2016

0.8.1

Mar 10, 2016

0.8.0

Feb 11, 2016

0.7.0

Aug 20, 2015

0.6.5

Aug 18, 2015

0.6.4

Feb 17, 2015

0.6.3

Feb 16, 2015

0.6.2

Jan 26, 2015

0.6.1

Jan 9, 2015

0.6.0

Aug 8, 2014

0.5.1

Aug 5, 2014

0.5.0

Jan 21, 2014

0.4.0

Sep 24, 2013

0.3.3

Sep 5, 2013

0.3.2

Aug 28, 2013

0.3.1

Aug 9, 2013

0.3.0

Aug 8, 2013

0.2.1

Aug 2, 2013

0.2.0

Jul 31, 2013

0.1.1

May 28, 2013

This version

0.1.0

May 28, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-dwca-reader-0.1.0.tar.gz (12.5 kB view details)

Uploaded May 28, 2013 Source

File details

Details for the file python-dwca-reader-0.1.0.tar.gz.

File metadata

Download URL: python-dwca-reader-0.1.0.tar.gz
Upload date: May 28, 2013
Size: 12.5 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for python-dwca-reader-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`71380d6f8c87539ee6de6acb2c29b5aaa86af928ede80845d59f825113f68fad`
MD5	`0c00a934fdc77fd086f064feeace9ecc`
BLAKE2b-256	`7ccc972d939534d918f356c68514e792520d8920a20a5c9088b7666c754db890`