A CLI tool for S3 data synchronizations.
Project description
Solgate
Yet another data sync pipelines job runner.
A CLI utility that is expected to be automated via container native workflow engines like Argo or Tekton.
Installation
pip install solgate
Configuration
Solgate relies on a configuration file that holds all the information required to fully perform the synchronization. This config file is expected to be of a YAML format and it should contain following keys:
source
key. Value to this key specifies where the data are sourced from.destinations
key. It's value is expected to be an array for locations. Their purpose is to define sync destinations.- other top level keys for a general configuration that is not specific to a single location.
General config section
All configuration in this section is optional. Use this section if you'd like to modify the default behavior. Default values are denoted below:
alerts_smtp_server: smtp.corp.redhat.com
alerts_from: solgate-alerts@redhat.com
alerts_to: dev-null@redhat.com
timedelta: 1d
Description:
alerts_smtp_server
,alerts_from
,alerts_to
are used for email alerting onlytimedelta
defines a time window in which the objects in the source bucket must have been modified, to be eligible fo the bucket listing. Only files modified no later thantimedelta
from now are included.
Source key
source:
aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
base_path: DH-PLAYPEN/storage/input # at least the bucket name is required, sub path within this bucket is optional
endpoint_url: https://s3.amazonaws.com # optional, defaults to s3.amazonaws.com
formatter: "{date}/{collection}.{ext}" # optional, defaults to None
If the formatter
is not set, no repartitioning is expected to happen and the S3 object key is left intact, same as it is in the source bucket (within the base_path
context). Specifying the formatter
in the source section only, doesn't result in repartitioning of all object by itself, only those destinations that also have this option specified are eligible for object key modifications.
Destinations key
destinations:
- aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
base_path: DH-PLAYPEN/storage/output # at least the bucket name is required, sub path within this bucket is optional
endpoint_url: https://s3.upshift.redhat.com # optional, defaults to s3.upshift.redhat.com
formatter: "{date}/{collection}.{ext}" # optional, defaults to None
unpack: yes # optional, defaults to False/no
The endpoint_url
defaults to a different value for destination compared to source section. This is due to the usual data origin and safe destination host.
If the formatter
is not set, no repartitioning is expected to happen and the S3 object key is left intact, same as it is in the source bucket (within the base_path
context). If repartitioning is desired, the formatter string must be defined in the source section as well - otherwise object name can't be parsed properly from the source S3 object key.
unpack
option specifies if the gunzipped archives should be unpacked during the transfer. The .gz
suffix is automatically dropped from the resulting object key, no matter if the repartitioning is on or off. Switching this option on results in weaker object validation, since the implicit metadata checksum and size checks can't be used to verify the file integrity.
Separate credentials into different files
In case you don't feel like inlining aws_access_key_id
, aws_secret_access_key
in plaintext into the config file is a good idea, you can separate these credentials into their distict files. If the credentials keys are not found (inlined) in the config, solgate tries to locate them in the config folder (the same folder as the main config file is located).
The credentials file is expected to contain following:
aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
For source the expected filename is source.creds.yaml
, for destinations destination.X.creds.yaml
where X
is the index in the destinations
list in the main config file. For destinations we allow credentials sharing, therefore if destination.X.creds.yaml
is not located, solgate tries to load destination.creds.yaml
(not indexed).
Full example
Let's have this file structure in our /etc/solgate
:
$ tree /etc/solgate
/etc/solgate
├── config.yaml
├── destination.0.creds.yaml
├── destination.creds.yaml
└── source.creds.yaml
And a main config file /etc/solgate/config.yaml
looking like this:
source:
base_path: DH-PLAYPEN/storage/input
destinations:
- base_path: DH-PLAYPEN/storage/output0 # idx=0
- base_path: DH-PLAYPEN/storage/output1 # idx=1
- base_path: DH-PLAYPEN/storage/output2 # idx=2
aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
Solgate will use these credentials:
- For source the
source.creds.yaml
is read, because no credentials are inlined - For destination
idx=0
thedestination.0.creds.yaml
is used, because no credentials are inlined - For destination
idx=1
thedestination.creds.yaml
is used, because no credentials are inlined and there's nodestination.1.creds.yaml
file - For destination
idx=2
the inlined credentials are used
The resolution priority:
type | priority |
---|---|
source | inlined > source.creds.yaml |
destination | inlined > destination.INDEX.creds.yaml > destination.creds.yaml |
Example config file
Here's a full configuration file example, all together.
alerts_smtp_server: smtp.corp.redhat.com
alerts_from: solgate-alerts@redhat.com
alerts_to: dev-null@redhat.com
timedelta: 1d
source:
aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
endpoint_url: https://s3.upshift.redhat.com
formatter: "{date}/{collection}.{ext}"
base_path: DH-PLAYPEN/storage/input
destinations:
- aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
endpoint_url: https://s3.upshift.redhat.com
formatter: "{collection}/historic/{date}-{collection}.{ext}"
base_path: DH-PLAYPEN/storage/output
- aws_access_key_id: KEY_ID
aws_secret_access_key: SECRET
endpoint_url: https://s3.upshift.redhat.com
formatter: "{collection}/latest/full_data.csv"
base_path: DH-PLAYPEN/storage/output
unpack: yes
Usage
Solgate is mainly intended for use in automation within Argo Workflows. However it can be also used as a standalone CLI tool for manual transfers and (via extensions) for (TBD) manifest scaffold generation and (TBD) deployed instance monitoring.
List bucket for files ready to be transferred
Before the actual sync can be run, it is required
solgate list
CLI option | Config file entry | Description |
---|---|---|
-o |
Output to a file instead of stdout. Creates a listing file. | |
timedelta |
Define a lookup restriction. Only files newer than this value are reported. Defaults to 1 day. |
Sync objects
solgate send KEY
CLI option | Description |
---|---|
-l , --listing-file |
A listing file ingested by this command. Format is expected to be the same as solgate list output. If set, the KEY argument is ignored. |
Notification service
Send an workflow status alert via email from Argo environment.
Command expects to be passed values matching available Argo variable format as described here.
solgate report
Options can be set either via CLI argument or via environment variable:
-
Options which map to Argo Workflow variables:
CLI option Environment variable name Value should map to Argo workflow variable Description --failures
WORKFLOW_FAILURES
{{workflow.failures}}
JSON serialized into a string listing all the failed workflow nodes -n
,--name
WORKFLOW_NAME
{{workflow.name}}
Workflow instance name. --namespace
WORKFLOW_NAMESPACE
{{workflow.namespace}}
Project namespace where the workflow was executed. -s
,--status
WORKFLOW_STATUS
{{workflow.status}}
Current status of the workflow execution. -t
,--timestamp
WORKFLOW_TIMESTAMP
{{workflow.creationTimestamp}}
Workflow execution timestamp. -
Options which map to config file entries. Priority order:
CLI option > Environment variable > Config file entry > Default value
CLI option Environment variable name Config file entry Description --from
ALERT_SENDER
alerts_from
Email alert sender address. Defaults to solgate-alerts@redhat.com. --to
ALERT_RECIPIENT
alerts_to
Email alert recipient address. Defaults to data-hub-alerts@redhat.com. --smtp
SMTP_SERVER
alerts_smtp_server
SMTP server URL. Defaults to smtp.corp.redhat.com. -
Other:
CLI option Environment variable name Description --host
ARGO_UI_HOST
Argo UI external facing hostname.
Workflow manifests
Additionally to the solgate
package this repository also features deployment manifests in the manifests
folder. The current implementation of Kubernetes manifests relies on Argo, Argo Events and are structured in a Kustomize format. Environments for deployment are specified in the manifests/overlays/ENV_NAME
folder.
Each environment features multiple solgate workflow instances. Configuration config.ini
file and selected triggers are defined in instance subfolder within the particular environment folder.
Deploy
Environment deployments are expected to be handled via Argo CD in AI-CoE SRE, however it can be done manually as well.
Local prerequisites:
Already deployed platform and running services:
Build and deploy manifests
kustomize build --enable_alpha_plugins manifests/overlays/ENV_NAME | oc apply -f -
Create a new instance
Will be handled via scaffold in next version!
Prerequisites:
Import GPG keys EFDB9AFBD18936D9AB6B2EECBD2C73FF891FBC7E
, A76372D361282028A99F9A47590B857E0288997C
, 04DAFCD9470A962A2F272984E5EB0DA32F3372AC
gpg --keyserver keyserver.ubuntu.com --recv EFDB9AFBD18936D9AB6B2EECBD2C73FF891FBC7E A76372D361282028A99F9A47590B857E0288997C 04DAFCD9470A962A2F272984E5EB0DA32F3372AC
-
Create new folder named after the instance in the selected environment overlay (make a copy of
prod/TEMPLATE
). -
Create a
kustomization.yaml
file in this new folder with following content, change theNAME
to your instance name:apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization generators: - ./secret-generator.yaml commonLabels: app.kubernetes.io/name: NAME resources: - ./cronwf.yaml
-
Create a
secret-generator.yaml
file in this new folder with following content:apiVersion: viaduct.ai/v1 kind: ksops metadata: name: secret-generator files: - secret.enc.yaml
-
Create a
secret.enc.yaml
file in this folder and encrypt it via sops:apiVersion: v1 kind: Secret metadata: name: solgate-NAME stringData: source.creds.yaml: | aws_access_key_id: KEY_ID_FOR_SOURCE aws_secret_access_key: SECRET_FOR_SOURCE destination.creds.yaml: | aws_access_key_id: DEFAULT_KEY_ID_FOR_DESTINATIONS aws_secret_access_key: DEFAULT_SECRET_FOR_DESTINATIONS destination.2.creds.yaml: | aws_access_key_id: KEY_ID_FOR_DESTINATION_ON_INDEX_2 aws_secret_access_key: SECRET_FOR_DESTINATION_ON_INDEX_2 config.yaml: | alerts_smtp_server: smtp.corp.redhat.com alerts_from: solgate-alerts@redhat.com alerts_to: dev-null@redhat.com timedelta: 5h source: endpoint_url: https://s3.upshift.redhat.com formatter: "{date}/{collection}.{ext}" base_path: DH-PLAYPEN/storage/input destinations: - endpoint_url: https://s3.upshift.redhat.com formatter: "{collection}/historic/{date}-{collection}.{ext}" base_path: DH-PLAYPEN/storage/output unpack: yes - endpoint_url: https://s3.upshift.redhat.com formatter: "{collection}/latest/full_data.csv" base_path: DH-PLAYPEN/storage/output unpack: yes - endpoint_url: https://s3.upshift.redhat.com base_path: DH-PLAYPEN/storage/output
sops -e -i overlays/ENV_NAME/NEW_INSTANCE_NAME/INSTANCE_NAME.env.yaml
Please make sure the
*.creds.yaml
entries in the secret are encrypted. -
Create
cronwf.yaml
with following content, please change the name and config variable value to match the secret above:apiVersion: argoproj.io/v1alpha1 kind: CronWorkflow metadata: generateName: solgate-NAME name: solgate-NAME spec: schedule: concurrencyPolicy: "Replace" workflowSpec: arguments: parameters: - name: config value: solgate-NAME workflowTemplateRef: name: solgate
-
Update the resource and patch listing in the
overlays/ENV_NAME/kustomization.yaml
:resources: - ... - ./NEW_INSTANCE_NAME
Backfill
A backfill job ensures processing of all objects in the source bucket. This job assumes none of the objects were processed before and syncs it all potentially overwriting any changes in the destination bucket.
There's a backfill.yaml
available to be submitted directly. Please specify the config parameter before submitting. Value must match a name of a Secret
config resource for targeted pipeline.
argo submit -p config=solgate-NAME manifests/backfill.yaml
Workflow parameters
CronWorkflow
resource defined for each pipeline instance allows you to define 3 parameters:
Parameter | Value | Required | Description |
---|---|---|---|
config |
string | yes | Define which config secret to mount to pods and pass to the solgate runtime |
is-backfil |
string (boolean in quotes) | no | If set to true sync all data in the source bucket. Defaults to false |
split |
string (int in quotes) | no | Define amount of files that is handled by a single sync pod. If there's more files to sync, the pipeline will spin up additional pods. Defaults to 5000 |
Developer setup
Local setup
Install pipenv
and set up the environment:
pipenv sync -d
Install/enable pre-commit for this project:
pip install -g pre-commit
pre-commit install
Running tests
With local environment set up, you can run tests locally like this:
pipenv run pytest . --cov solgate
Building manifests
Install local prerequisites for kustomize
manifests:
Use kustomize build --enable_aplha_plugins ...
to build manifests.
CI/CD
We rely on AICoE-CI GitHub application and bots to provide CI for us. All is configured via .aicoe-ci.yaml
.
Releasing
If you're a maintainer, please release via GitHub issues. New release creates:
- Creates a
git
release tag on GitHub. - Pushes new image to Quay.io thoth-station/solgate, tagged by the released version and
latest
. - Releases to PyPI solgate project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file solgate-3.5.3.tar.gz
.
File metadata
- Download URL: solgate-3.5.3.tar.gz
- Upload date:
- Size: 34.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/39.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1372c9121ea9ebf5d71be653cc8af5af9f109100a59f08bb027da92af3c53956 |
|
MD5 | e75bd14fbc128b8895d82846e536147b |
|
BLAKE2b-256 | f0b7c838bfbb4c6fa9170c84d8d559497036d71f10e8379c4cfde80dac015a74 |
File details
Details for the file solgate-3.5.3-py3-none-any.whl
.
File metadata
- Download URL: solgate-3.5.3-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/39.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a7b29585183589ff8efc43021c29e3fc889207d0befc3f41b63737d5bfc6512 |
|
MD5 | 3e7a7f22d942cb7309e51ebd8d77dbd9 |
|
BLAKE2b-256 | c3ead0f0242ee5c865d5eed68288a22ccb26226052d62b5a692fb138d41ae1f9 |