Estimate the distributions of line lengths of files.
Project description
Qudth randomly samples the lines within a large file and calculates statistics about each line. For example, in a 10-gigabyte text file, you might want to know how long a typical line is.
Line lengths
It would be very convenient if line length is what you are interested in, as that is the only thing we implement right now.
$ qudth qudth/cli.py -n 5 --bins 8 ▁ ▁ ▂ ▁ ▁ ▁ ▃ ▃ 01 52 59 Lengths of 5 lines in qudth/cli.py (simple random sample with replacement)
Benchmarking
wc -l is equivalent to qudth’s line length estimation, but qudth’s sampling makes it much faster on large files. big-file.csv is 1 gigabyte in size.
_:~ t$ time qudth big-file.csv > /dev/null real 0m0.287s user 0m0.161s sys 0m0.032s _:~ t$ time wc -l big-file.csv > /dev/null real 0m2.515s user 0m1.475s sys 0m0.440s
Future work
A more standard thing would perhaps be something that emitted a random sample to stdout. It could support different sampling strategies perhaps.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file qudth-0.0.3.tar.gz
.
File metadata
- Download URL: qudth-0.0.3.tar.gz
- Upload date:
- Size: 2.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6dfc361ccd72775fa12e91a558b1815b54920622464b35454e44312444a4ff2d |
|
MD5 | 4f37cca1229a12091dcf6430545bc863 |
|
BLAKE2b-256 | 45ac4672947113652f4cec572697f40369594a175a13891f451e3ff6e4a39ba4 |