Ingestion service queue runner between Plone RestAPI and ElasticSearch or OpenSearch.
Project description
Ingestion service queue runner between Plone RestAPI and ElasticSearch 8+ or OpenSearch 2+. Provides Celery-tasks to asynchronous index Plone content.
- auto-create Open-/ElasticSearch…
index
mapping from Plone schema using a flexible conversions file (JSON),
ingest-attachment pipelines using (same as above) file.
- task to
index a content object with all data given plus allowedRolesAndUsers and section (primary path)
unindex an content object
- configure from environment variables:
celery,
elasticsearch or opensearch
sentry logging (optional)
Installation
We recommended to use a Python virtual environment, create one with python3 -m venv venv, and activate it in the current terminal session with source venv/bin/activate.
Install collective.elastic.ingest ready to use with redis and opensearch:
pip install collective.elastic.ingest[redis,opensearch]
Depending on the queue server and index server used, the extra requirements vary:
index server: opensearch, elasticsearch.
queue server: redis or rabbitmq.
Configuration
Configuration is done via environment variables and JSON files.
Environment variables are:
- INDEX_SERVER
The URL of the ElasticSearch or OpenSearch server.
Default: localhost:9200
- INDEX_USE_SSL
Whether to use a secure connection or not.
Default: 0
- INDEX_OPENSEARCH
Whether to use OpenSearch or ElasticSearch.
Default: 1
- INDEX_LOGIN
Username for the ElasticSearch 8+ or OpenSearch server.
Default: admin
- INDEX_PASSWORD
Password for the ElasticSearch 8+ or OpenSearch server.
Default: admin
- CELERY_BROKER
The broker URL for Celery. See docs.celeryq.dev for details.
Default: redis://localhost:6379/0
- PLONE_SERVICE
Base URL of the Plone Server
Default: http://localhost:8080
- PLONE_PATH
Path to the site to index at the Plone Server
Default: Plone
- PLONE_USER
Username for the Plone Server, needs to have at least Site Administrator role.
Default: admin
- PLONE_PASSWORD
Password for the Plone Server.
Default: admin
- MAPPINGS_FILE
Absolute path to the mappings configuration file. Configures field mappings from Plone schema to ElasticSearch.
No default, must be given.
- PREPROCESSINGS_FILE
Configures preprocessing of field values before indexing.
Default: Uses a defaults file of this package.
- ANALYSIS_FILE
(optional) Absolute path to the analysis configuration file.
- SENTRY_DSN
(optional) Sentry DSN for error reporting.
Default: disabled
Starting
Run celery worker:
celery -A collective.elastic.ingest.celery.app worker -c 1 -l info
Or with debug information:
celery -A collective.elastic.ingest.celery.app worker -c 1 -l debug
The number is the concurrency of the worker. For production use, it should be set to the number of Plone backends available for indexing load.
OCI Image
For use in Docker, Podman, Kubernetes, …, an OCI image is provided at …
The environment variables above are used as configuration.
Additional the following environment variables are used:
- CELERY_CONCURENCY
The number of concurrent tasks to run.
Default: 1
- CELERY_LOGLEVEL
The log level for celery.
Default: info
Examples
Example configuration files are provided in the /examples directory.
OpenSearch with Docker Compose
A docker-compose file docker-compose.yml and a Dockerfile to start an OpenSearch server is provided.
Precondition:
Docker and docker-compose are installed.
Max virtual memory map needs increase to run this: sudo sysctl -w vm.max_map_count=262144 (not permanent, see StackOverflow post).
Steps to start the example OpenSearch Server with the ingest-attachment plugin installed:
enter the directory cd examples
build the docker image with
`bash docker buildx use default docker buildx build --tag opensearch-ingest-attachment:latest Dockerfile `
start the server with docker-compose up.
Now you have an OpenSearch server running on http://localhost:9200 and an OpenSearch Dashboard running on http://localhost:5601 (user/pass: admin/admin). The OpenSearch server has the ingest-attachment plugin installed. The plugin enables OpenSearch to extract text from binary files like PDFs.
Open another terminal.
An .env file is provided with the environment variables ready to use with the docker-compose file. Run source examples/.env to load the environment variables. Then start the celery worker with celery -A collective.elastic.ingest.celery.app worker -l debug.
In another terminal window run a Plone backend at http://localhost:8080/Plone with the add-on collective.elastic.plone installed. There, create an item or modify an existing one. You should see the indexing task in the celery worker terminal window.
Basic Mappings
A very basic mappings file mappings-basic.json is provided. To use it set MAPPINGS_FILE=examples/mappings-basic.json and then start the celery worker.
Complex Mapping With German Text Analysis
A complex mappings file with german text analysis configured, mappings-german-analysis.json is provided. It comes together with the matching analysis configuration file analysis-german.json and a stub lexicon file elasticsearch-lexicon-german.txt. Read the next section for more information about text analysis.
Text Analysis
Test analysis is optional. Skip this on a first installation.
Search results can be enhanced with a tailored text analysis. The simple fuzzy search, which can be used without any analysis configuration, has its limits. This is even more true in complex languages like German.
This is an advanced topic.
You can find detailed information about text analysis in the ElasticSearch documentation. We provide an example analysis configuration for a better search for German compounded words.
Example: A document with the string ‘Lehrstellenbörse’ can be found by querying ‘Lehrstelle’. It shall be found too by querying ‘Börse’ using a decompounder with a word list ‘Lehrstelle, Börse’ and an additional stemmer. The example analyzer configuration applies a stemmer, which can handle inflections of words. This is an important enhancement for better search results.
The analysis configuration is a configuration of analyzers. The example provided here uses two of them: german_analyzer and german_exact.
The first decompounds words according the word list in lexicon.txt. A stemmer is added.
The second one is to allow also exact queries with a quoted search string.
These two analyzers are to be applied to fields. You can apply them in your mapping.
Example:
"behaviors/plone.basic/title": { "type": "text", "analyzer": "german_analyzer", "fields": { "exact": { "type": "text", "analyzer": "german_exact_analyzer" } } },
Check your configured analysis with:
POST {{elasticsearchserver}}/_analyze { "text": "Lehrstellenbörse", "tokenizer": "standard", "filter": [ "lowercase", "custom_dictionary_decompounder", "light_german_stemmer", "unique" ] }
The response delivers the tokens for the analyzed text ‘Lehrstellenbörse’.
Note: The file elasticsearch-lexicon.txt with the word list used by the decompounder of the sample analysis configuration in analysis.json.example has to be located in the configuration directory of your elasticsearch server.
Source Code
The sources are in a GIT DVCS with its main branches at github. There you can report issues too.
We’d be happy to see many forks and pull-requests to make this addon even better.
Maintainers are Jens Klein, Katja Suess and the BlueDynamics Alliance developer team. We appreciate any contribution and if a release is needed to be done on PyPI, please just contact one of us. We also offer commercial support if any training, coaching, integration or adaptions are needed.
Installation for development
clone source code repository,
enter repository directory
recommended: create a Virtualenv python -mvenv env
development install ./bin/env/pip install -e .[test,redis,opensearch]
load environment configuration source examples/.env.
License
The project is licensed under the GPLv2.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for collective.elastic.ingest-2.0.0b1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d0fb57041c9a52548a2dc177afecd59ceb8a122807132d554086b03bbc730e8 |
|
MD5 | 2e17f3dd6f192456132f17203b3008c2 |
|
BLAKE2b-256 | e4a79d6f84e82404c0c55f65471f3a54c85c367566d796c37a10e96a2f29188f |
Hashes for collective.elastic.ingest-2.0.0b1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca8d41155283eb371ed6cf88572806a285b4079745685420755595a117591129 |
|
MD5 | 03684c17d6afa50724191d3c0586ad3d |
|
BLAKE2b-256 | a071cf07a211bd8ec39be4b4b0d5b53617f92346118f0e387cc3e4b2cb8fc249 |