Estimate the distributions of line lengths of files.
Project description
Qudth randomly samples the lines within a large file and calculates statistics about each line. For example, in a 10-gigabyte text file, you might want to know how long a typical line is.
Line lengths
It would be very convenient if line length is what you are interested in, as that is the only thing we implement right now.
$ qudth qudth/cli.py -n 5 --bins 8 ▁ ▁ ▂ ▁ ▁ ▁ ▃ ▃ 01 52 59 Lengths of 5 lines in qudth/cli.py (simple random sample with replacement)
Benchmarking
wc -l is equivalent to qudth’s line length estimation, but qudth’s sampling makes it much faster on large files. big-file.csv is 1 gigabyte in size.
_:~ t$ time qudth big-file.csv > /dev/null real 0m0.287s user 0m0.161s sys 0m0.032s _:~ t$ time wc -l big-file.csv > /dev/null real 0m2.515s user 0m1.475s sys 0m0.440s
Future work
A more standard thing would perhaps be something that emitted a random sample to stdout. It could support different sampling strategies perhaps.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.