Skip to main content

Parse S3 logs to more easily calculate usage metrics per asset.

Project description

DANDI S3 Log Parser

codecov

PyPI latest release version Ubuntu Supported Python versions License: BSD-3

Python code style: Black Python code style: Ruff

Simple reductions of consolidated S3 logs (consolidation step not included in this repository) into minimal information for public sharing and plotting.

Developed for the DANDI Archive.

A single line of a raw S3 log file can be between 400-1000+ bytes. Some of the busiest daily logs on the archive can have around 5,014,386 lines. As of summer 2024, there are more than 6 TB of log files collected.

This parser can reduce these to tens of GB of consolidated and anonymized usage data, which is much more manageable for sharing and plotting.

Usage

To iteratively parse all historical logs all at once (parallelization with 10-15 total GB recommended):

parse_all_dandi_raw_s3_logs \
  --base_raw_s3_log_folder_path < base log folder > \
  --parsed_s3_log_folder_path < output folder > \
  --excluded_log_files < any log files to skip> \
  --excluded_ips < comma-separated list of known IPs to exclude > \
  --maximum_number_of_workers < number of CPUs to use > \
  --maximum_buffer_size_in_bytes < approximate amount of RAM to use >

For example, on Drogon:

parse_all_dandi_raw_s3_logs \
  --base_raw_s3_log_folder_path /mnt/backup/dandi/dandiarchive-logs \
  --parsed_s3_log_folder_path /mnt/backup/dandi/dandiarchive-logs-cody/parsed_7_13_2024/GET_per_asset_id \
  --excluded_log_files /mnt/backup/dandi/dandiarchive-logs/stats/start-end.log \
  --excluded_ips < Drogons IP > \
  --maximum_number_of_workers 3 \
  --maximum_buffer_size_in_bytes 15000000000

To parse only a single log file at a time, such as in a CRON job:

parse_dandi_raw_s3_log \
  --raw_s3_log_file_path < s3 log file path > \
  --parsed_s3_log_folder_path < output folder > \
  --excluded_ips < comma-separated list of known IPs to exclude >

Submit line decoding errors

Please email line decoding errors collected from your local config file to the core maintainer before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.

Developer notes

.log file suffixes should typically be ignored when working with Git, so when committing changes to the example log collection, you will have to forcibly include it with

git add -f <example file name>.log

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dandi_s3_log_parser-0.1.0.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

dandi_s3_log_parser-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file dandi_s3_log_parser-0.1.0.tar.gz.

File metadata

  • Download URL: dandi_s3_log_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for dandi_s3_log_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a13c498451de7616fcfba8df4e301d1ac3ccbcffb2e00e25dfa0c5a1ec4750c5
MD5 9d33f81f4ac277a457111dfe929720fd
BLAKE2b-256 5aabd4f3366f7294269a7639f84acebef0abc5c021a1541e18cd14887a2afc46

See more details on using hashes here.

File details

Details for the file dandi_s3_log_parser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dandi_s3_log_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 664b11207fca2af33f56842a88ed1914d1bd1c2deeb810eede55448ab55a741f
MD5 0cd4e102025e603d36c74ceee83ca05e
BLAKE2b-256 fe7603d973878686cce649716b1298d133a3a56c1a94db84bc3181a827c59d86

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page