Skip to main content

ETL for parsing scientific papers.

Project description

Scholaretl

An Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.

  1. Quickstart
  2. List of endpoints
  3. Docker Image
  4. Grobid parsing
  5. Funding and Acknowledgement

Quickstart

Step 1 : Install the package.

Simply install the package with PyPi.

pip install scholaretl

You can also clone the GitHub repo and install the package yourself.

Step 2 : Run the FastApi app.

A simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.

scholaretl-api

See the -h flag for non default arguments.

Step 3 : Test the app.

Now that the server is running, you can either curl it to get information.

curl http://localhost:8000/settings

Or open a browser at : http://localhost:8000/docs and try some of the endpoints. For example, use the parse/pypdf endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see List of endpoints)

List of endpoints

Once the app is deployed, all these endpoints will be available to use :

  • /parse/pubmed_xml: parses XMLs coming from PubMed.
  • /parse/jats_xml: Parses XMLs coming from PMC.
  • /parse/tei_xml: Parses XMLs produced by Grobid.
  • /parse/xocs_xml: Parses XMLs coming from Scopus (Elsevier)
  • /parse/pypdf: Parses PDFs without keeping the structure of the document.
  • /parse/grobidpdf: Parses PDFs keeping the structure of the document (REQUIRES grobid, see Grobid parsing).

Docker image

If a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.

docker build -t scholaretl:latest . --platform linux/amd64

It can then be tested by runing the container locally. The flag --platform linux/amd64 depends on the desired deployement and should be changed accordingly. Scholaretl:latest can be sutomized at will. The image can then be activated using :

docker run -d -p 8080:8080 scholaretl:latest

The Api will accept requests on port 8080, ie you can acces the UI at : http://localhost:8080/docs.

Grobid parsing

To parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run

docker run -p 8070:8070 -d lfoppiano/grobid:0.7.3

Then pass the server's url to the script in a .env file:

echo SCHOLARETL__GROBID__URL=http://localhost:8070 > .env
scholaretl-api

You can also add the server's url in the .env manually. See the env.example file for more information.

If using docker, pass the server's URL as an environment variable.

docker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest

Funding and Acknowledgement

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

Copyright (c) 2024 Blue Brain Project/EPFL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholaretl-0.0.5.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

scholaretl-0.0.5-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file scholaretl-0.0.5.tar.gz.

File metadata

  • Download URL: scholaretl-0.0.5.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for scholaretl-0.0.5.tar.gz
Algorithm Hash digest
SHA256 ba1e2f274e9afe9add74c5d4105f929b29f52ffc9206feed6340ad803c6fb26a
MD5 323ec977d58ae5567a7e18d7800b4dec
BLAKE2b-256 a2dd207d710115f2a8a056ac3fa930ab4ab55f4e963cba5202ee7e5ff27c1091

See more details on using hashes here.

File details

Details for the file scholaretl-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: scholaretl-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for scholaretl-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 aadea56c8f66aa9e6ebaabd7e52e42e72478ba532d658ede8504aeba4654e874
MD5 f6dbc2f2aeca5011ff7e7535f99e5a16
BLAKE2b-256 37621022194a1b0857ac809368bf88cbd741ca6daf34145e46f7f784b9a1260d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page