Skip to main content

Parse S3 logs to more easily calculate usage metrics per asset.

Project description

DANDI S3 Log Parser

Ubuntu Supported Python versions codecov

PyPI latest release version License: BSD-3

Python code style: Black Python code style: Ruff

Extraction of minimal information from consolidated raw S3 logs for public sharing and plotting.

Developed for the DANDI Archive.

Read more about S3 logging on AWS.

A few summary facts as of 2024:

  • A single line of a raw S3 log file can be between 400-1000+ bytes.
  • Some of the busiest daily logs on the archive can have around 5,014,386 lines.
  • There are more than 6 TB of log files collected in total.
  • This parser reduces that total to less than 25 GB of final essential information on NWB assets (Zarr size TBD).

Installation

pip install dandi_s3_log_parser

Workflow

The process is comprised of three modular steps.

1. Reduction

Filter out:

  • Non-success status codes.
  • Excluded IP addresses.
  • Operation types other than the one specified (REST.GET.OBJECT by default).

Then, only limit data extraction to a handful of specified fields from each full line of the raw logs; by default, object_key, timestamp, ip_address, and bytes_sent.

In the summer of 2024, this reduced 6 TB of raw logs to less than 170 GB.

The process is designed to be easily parallelized and interruptible, meaning that you can feel free to kill any processes while they are running and restart later without losing most progress.

2. Binning

To make the mapping to Dandisets more efficient, the reduced logs are binned by their object keys (asset blob IDs) for fast lookup.

This step reduces the total file sizes from step (1) even further by reducing repeated object keys, though it does create a large number of small files.

In the summer of 2024, this brought 170 GB of reduced logs down to less than 80 GB (20 GB of blobs spread across 253,676 files and 60 GB of zarr spread across 4,775 files).

3. Mapping

The final step, which should be run periodically to keep the desired usage logs per Dandiset up to date, is to scan through all currently known Dandisets and their versions, mapping the asset blob IDs to their filenames and generating the most recently parsed usage logs that can be shared publicly.

In the summer of 2024, this brought 80 GB of binned logs down to around 20 GB of Dandiset logs.

Usage

Reduction

To reduce:

reduce_all_dandi_raw_s3_logs \
  --raw_s3_logs_folder_path < base raw S3 logs folder > \
  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \
  --maximum_number_of_workers < number of workers to use > \
  --maximum_buffer_size_in_mb < approximate amount of RAM to use > \
  --excluded_ips < comma-separated list of known IPs to exclude >

For example, on Drogon:

reduce_all_dandi_raw_s3_logs \
  --raw_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --maximum_number_of_workers 3 \
  --maximum_buffer_size_in_mb 3000 \
  --excluded_ips < Drogons IP >

In the summer of 2024, this process took less than 10 hours to process all 6 TB of raw log data (using 3 workers at 3 GB buffer size).

Binning

To bin:

bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \
  --binned_s3_logs_folder_path < binned S3 logs folder path >

For example, on Drogon:

bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned

This process is not as friendly to random interruption as the reduction step is. If corruption is detected, the target binning folder will have to be cleaned before re-attempting.

The --file_processing_limit < integer > flag can be used to limit the number of files processed in a single run, which can be useful for breaking the process up into smaller pieces, such as:

bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \
  --file_limit 20

In the summer of 2024, this process took less than 5 hours to bin all 170 GB of reduced logs into the 80 GB of data per object key.

Mapping

To map:

map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path < binned S3 logs folder path > \
  --mapped_s3_logs_folder_path < mapped Dandiset logs folder > \
  --excluded_dandisets < comma-separated list of six-digit IDs to exclude > \
  --restrict_to_dandisets < comma-separated list of six-digit IDs to restrict mapping to >

For example, on Drogon:

map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \
  --mapped_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-mapped \
  --excluded_dandisets 000108

In the summer of 2024, this blobs process took less than 8 hours to complete (with caches; 10 hours without caches) with one worker.

Some Dandisets may take disproportionately longer than others to process. For this reason, the command also accepts --excluded_dandisets and --restrict_to_dandisets.

This is strongly suggested for skipping 000108 in the main run and processing it separately (possibly on a different CRON cycle altogether).

map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \
  --mapped_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-mapped \
  --restrict_to_dandisets 000108

In the summer of 2024, this took ?? hours to complete.

The mapping process can theoretically be designed to work in parallel (and thus much faster), but this would take some effort to design. If interested, please open an issue to request this feature.

Submit line decoding errors

Please email line decoding errors collected from your local config file (located in ~/.dandi_s3_log_parser/errors) to the core maintainer before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dandi_s3_log_parser-0.4.0.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

dandi_s3_log_parser-0.4.0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file dandi_s3_log_parser-0.4.0.tar.gz.

File metadata

  • Download URL: dandi_s3_log_parser-0.4.0.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for dandi_s3_log_parser-0.4.0.tar.gz
Algorithm Hash digest
SHA256 1e07228a5811c51305a3a7612c9795ba19d7b99a2c22e6b520fc04f971a5d6ab
MD5 e9747c0e56762a6691c2950cbc061ab1
BLAKE2b-256 991a7e0ac3ed5af3a1a6aadab4f69165eba4c59059f8ddc4a988b579e6800078

See more details on using hashes here.

File details

Details for the file dandi_s3_log_parser-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dandi_s3_log_parser-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cdebea0a305e0a6a2967e13f4990e7db0530bf29da82344682d358537201b889
MD5 68cd9ae7b536ca458ad67b022d0aff42
BLAKE2b-256 65106f7110d93e5fed16ecee0260455f47d91e9d9d126c65be647edbf2893970

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page