Sample lines from a file.
Project description
Sample lines from a file that has already been written.
Install
Install like so.
pip install sample-lines
How to
See the help for documentation.
sample-lines -h usage: Randomly select lines from a file. [-h] [--sample-size N] [--method {simple-random,systematic}] [--repeat REPEAT] file positional arguments: file optional arguments: -h, --help show this help message and exit --sample-size N, -n N Number of lines to emit --method {simple-random,systematic}, -m {simple-random,systematic} Sampling method --repeat REPEAT, -r REPEAT Number of repetitions for systematic sampling
Samples are with replacement and weighted by line length. The probability of selecting a line is proportional to its line length. This allows us to sample very quickly, but it makes this approach appropriate only if your file has reasonably consistent line lengths or if you don’t care about short lines.
How fast
Consider this 1-gigabyte CSV file.
$ wc big-file.csv 2388430 27673790 1071895374 big-file.csv
Running wc took three seconds.
time wc big-file.csv 2388430 27673790 1071895374 big-file.csv real 0m3.789s user 0m3.560s sys 0m0.190s
sample-lines is much faster. Here’s a simple random sample of 40 lines,
$ time sample-lines -n 40 -m simple-random big-file.csv > /dev/null real 0m0.136s user 0m0.113s sys 0m0.018s
a systematic sample of 40 lines,
$ time sample-lines -n 40 -m systematic -r 4 big-file.csv > /dev/null real 0m0.148s user 0m0.122s sys 0m0.019s
and repeated systematic sample, with 4 repeats and 10 lines each, for a total of 40 lines.
$ time sample-lines -n 10 -m systematic -r 4 big-file.csv > /dev/null real 0m0.175s user 0m0.140s sys 0m0.025s
Most of the time in the above examples was spent loading Python and the various modules; printing the help takes almost as long as running the sample.
$ time sample-lines -h > /dev/null real 0m0.157s user 0m0.129s sys 0m0.021s
So even a pretty big sample is still fast to run.
$ time sample-lines -n 2000 -m systematic -r 50 big-file.csv > /dev/null real 0m2.695s user 0m2.435s sys 0m0.231s
Alternatives
Use sample if you want to sample from a stream.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.