Skip to main content

Get book information (isbn or search) from real bookstores.

Project description

# Web scraping to get book informations

This library is to get book informations. We can search with keywords, with the isbn, with an advanced search, and do pagination.

We get the data from existing websites. We scrape:

we get: the title and authors, the price, the publisher(s), the cover, etc

<img src=”cli-search.png”</img>

## Import data from an ods or csv file

If your file has an ‘isbn’ and a ‘quantity’ column, it’s easy, we will find all the book information.

If it has the title and the publisher, it’s doable but error prone. We can still do it, but you shall do an inventory of your stock afterwards.

See the odsimport module. It gives back a json. It’s your responsibility to add what you want in your database (this is done in Abelujo).

Usable, but work in progress.

### Accepted format and columns

We can read ods and csv files.

  • a file with an “isbn” and “quantity” column,

  • a file with columns “title”, “publisher”, “isbn” (optionnal in this case), “shelf”, “distributor”, “quantity”. There is no “price” column. “authors” is optionnal (it can help to fetch the correct book).

If the file has no headers, use the “odsettings.py” configuration file (in particular, to set the csv delimiter, either “,” or “;”).

## Why not Amazon ?

Amazon kills the book industry and its employees. But moreover, with can link to a good and independent bookshop from within the application, and… we learn a lot in doing this !

## Why not Google books ?

It has very few data.

## Why not the BNF (Bibliothèque Nationale de France) ?

Because, for bookshops, we need recent books (the BNF takes a few months), up to date information. There isn’t a lot of tools either.

# Install

This lib isn’t on pypi yet.

It is usable, but not mature.

It’s used as a git submodule at Abelujo https://gitlab.com/vindarel/abelujo

# Use

So, for now you need to clone this repo.

To try it out, go to a scraper directory:

cd frFR/librairiedeparis/ python librairiedeparisScraper.py 9782918059363

and you get the above screenshot.

## Advanced search

Work in progress.

You can search ed:agone to search for a specific publisher.

## Pagination

We do pagination.

# Develop and test

See http://dev.abelujo.cc/webscraping.html

Development mode:

pip install -e .

Now you can edit the project and run the development version like the lib is meant to be run, i.e. with the entry_points: livres, libros, etc.

doc: https://python-packaging-user-guide.readthedocs.org/en/latest/distributing/#working-in-development-mode

# Bugs and shortcomings

This is webscraping, so it doesn’t go without pitfalls:

  • the site can go down. It happened already.

  • the site can change, it which case we would have to change our sraper too. This can be catched early with automated and frequent tests (work ongoing).

# Licence

LGPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bookshops-0.1.1dev.tar.gz (39.0 kB view details)

Uploaded Source

Built Distribution

bookshops-0.1.1dev-py2.7.egg (253.8 kB view details)

Uploaded Source

File details

Details for the file bookshops-0.1.1dev.tar.gz.

File metadata

File hashes

Hashes for bookshops-0.1.1dev.tar.gz
Algorithm Hash digest
SHA256 3b054804ab9e5c3c1f1a964ccded484acae3d094c740d75a9c9a04481ab53ebc
MD5 3d4b8ea2110b34b8a7f9bca61bdfdf42
BLAKE2b-256 52051fea5c65da416949fd453dea8c976351cf126fe49045854b1b8e98112c32

See more details on using hashes here.

File details

Details for the file bookshops-0.1.1dev-py2.7.egg.

File metadata

File hashes

Hashes for bookshops-0.1.1dev-py2.7.egg
Algorithm Hash digest
SHA256 78669ba37ab5f301c639dbf1438af593e75cb81c9fec902516be0f4acbf2bec9
MD5 23704688af7e5b86aad9461eec58dcd6
BLAKE2b-256 ff7d67896e3d51e71663e0ae30e51de25cf06984d90ef3a8adee503d9208fc49

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page