Skip to main content

ETL for parsing scientific papers.

Project description

Scholaretl

An Extract, Transfrom and Load (ETL) API made to parse scientific papers. This package is meant to be used with scholarag, our Retreival Augmented Generation (RAG) tool. It is mainly used to parse scientific paper coming from different sources, to make it compatible with ususal databases.

  1. Quickstart
  2. List of endpoints
  3. Docker Image
  4. Grobid parsing
  5. Funding and Acknowledgement

Quickstart

Step 1 : Install the package.

Simply install the package with PyPi.

pip install scholaretl

You can also clone the GitHub repo and install the package yourself.

Step 2 : Run the FastApi app.

A simple script is installed with the package, and allows to run the app locally. By default the API is open on port 8000.

scholaretl-api

See the -h flag for non default arguments.

Step 3 : Test the app.

Now that the server is running, you can either curl it to get information.

curl http://localhost:8000/settings

Or open a browser at : http://localhost:8000/docs and try some of the endpoints. For example, use the parse/pypdf endpoint to parse a local pdf file. Parsing xml files works out of the box. Keep in mind that the xml parsing endpoints are meant to be used with files comming from specific scientific journals. (see List of endpoints)

List of endpoints

Once the app is deployed, all these endpoints will be available to use :

  • /parse/pubmed_xml: parses XMLs coming from PubMed.
  • /parse/jats_xml: Parses XMLs coming from PMC.
  • /parse/tei_xml: Parses XMLs produced by Grobid.
  • /parse/xocs_xml: Parses XMLs coming from Scopus (Elsevier)
  • /parse/pypdf: Parses PDFs without keeping the structure of the document.
  • /parse/grobidpdf: Parses PDFs keeping the structure of the document (REQUIRES grobid, see Grobid parsing).

Docker image

If a docker container is required, it can be build using the provided Dockerfile. Make sure you have Docker installed.

docker build -t scholaretl:latest . --platform linux/amd64

It can then be tested by runing the container locally. The flag --platform linux/amd64 depends on the desired deployement and should be changed accordingly. Scholaretl:latest can be sutomized at will. The image can then be activated using :

docker run -d -p 8080:8080 scholaretl:latest

The Api will accept requests on port 8080, ie you can acces the UI at : http://localhost:8080/docs.

Grobid parsing

To parse documents with the Grobid enpoint, It requires a Grobid server to be running. To deploy it, simply run

docker run -p 8070:8070 -d lfoppiano/grobid:0.7.3

Then pass the server's url to the script in a .env file:

echo SCHOLARETL__GROBID__URL=http://localhost:8070 > .env
scholaretl-api

You can also add the server's url in the .env manually. See the env.example file for more information.

If using docker, pass the server's URL as an environment variable.

docker run -p 8080:8080 -d -e SCHOLARETL__GROBID__URL=http://host.docker.internal:8070 scholaretl:latest

Funding and Acknowledgement

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

Copyright (c) 2024 Blue Brain Project/EPFL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholaretl-0.0.4.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

scholaretl-0.0.4-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file scholaretl-0.0.4.tar.gz.

File metadata

  • Download URL: scholaretl-0.0.4.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for scholaretl-0.0.4.tar.gz
Algorithm Hash digest
SHA256 717348517c83bd05dc1f5c5cd2e0f7496aa7cf2c158b96ec597173f56c14124c
MD5 1a33b8d007abc214f85f69a3969b99a0
BLAKE2b-256 70131b0c14fbbfc2cc144f4d77d83cdbd365c317c5f179ffcf9d1b22937b67b8

See more details on using hashes here.

File details

Details for the file scholaretl-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: scholaretl-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for scholaretl-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 92e15f3f2cbd367069580642b537ddaaa9e0f618be922d6d913ca1a49ef4fefb
MD5 c871f049f747e39ed6c8770601053bfb
BLAKE2b-256 9b55cc8cbac5b527ce3de1e1c362ef06599108553f0cb85c3dab5dbb85c5dca1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page