Skip to main content

UNIX command-line tool for python line-based stream processing

Project description

Author: Pahaz White

Repo: https://github.com/pahaz/py3line/

Pyline is a UNIX command-line tool for bash one-liner scripts. It’s python line alternative to grep, sed, and awk.

This project inspired by: pyfil, piep, pysed, pyline, pyp and Jacob+Mark recipe

WHY I MAKE IT?

Sometimes, I have to use sed / awk / grep. Usually for simple text processing. Find some pattern inside the text file using Python regexp, or comment/uncomment some config line by bash one line command.

I always forget the necessary options and sed / awk DSL. But I now python, I like it, and I want use it for this simple bash tasks. Default python -c is not enough to write readable bash one-liners.

Why not a pyline?
  • Don`t support python3

  • Have many options

  • Don`t support command chaining

PRINCIPLES

  • AS MUCH SIMPLE TO UNDERSTAND BASH ONE LINER SCRIPT AS POSSIBLE

  • LESS SCRIPT ARGUMENTS

  • AS MUCH EASY TO INSTALL AS POSSIBLE (CONTAINER FRIENDLY ???)

  • SMALL CODEBASE (less 500 loc)

  • LAZY AND EFFECTIVE AS POSSIBLE

Installation

py3line is on PyPI, so simply run:

pip install py3line

or

sudo curl -L "https://61-63976011-gh.circle-artifacts.com/0/py3line-$(uname -s)-$(uname -m)" -o /usr/local/bin/py3line
sudo chmod +x /usr/local/bin/py3line

to have it installed in your environment.

For installing from source, clone the repo and run:

python setup.py install

Tutorial

Lets start with examples, we want to evaluate a number of words in each line:

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "x = len(line.split(' ')); print(x, line)"
2 Here are
1 some
3 words for you.

Py3line process input stream by python code line by line.

  • echo -e “Here arensomenwords for you.” – create an input stream consists of three lines

  • | – pipeline input stream to py3line

  • “x = len(line.split()); print(x, line)” – define 2 actions: “x = len(line.split(’ ‘))” evaluate number of words in each line, then “print(x, line)” print the result. Each action apply to the input stream step by step.

The example above can be represented as the following python code:

import sys

def process(stream):
    for line in stream:
        x = len(line.split(' '))  # action 1
        print(x, line)            # action 2
        yield line

stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = process(stream)
for line in stream: pass

You can also get the executed python code by --pycode argument.

$ ./py3line.py "x = len(line.split(' ')); print(x, line)" --pycode  #skipbashtest
...

Stream transform

Lets try more complex example, we want to to evaluate the number of words in the whole file. This value is easy to calculate if you convert the input stream from a stream of lines to a number of words in line stream. Just override line variable

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "line = len(line.split()); print(sum(stream))"
6

Here we have a stream transformation action “print(sum(stream))”.

The example above can be represented as the following python code:

import sys

def process(stram):
    for line in stream:
        line = len(line.split())  # action 1
        yield line

def transform(stream):
    print(sum(stream))            # action 2
    return stream

stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(process(stream))
for line in stream: pass

You can also get the executed python code by --pycode argument.

$ ./py3line.py "line = len(line.split()); print(sum(stream))" --pycode  #skipbashtest
...

Lazy as possible

Py3line does calculations only when necessary by the use of python generators. This means that the input stream does not fit into memory and you can easy process more data than your RAM allows.

But it also imposes limitations on the ability to work with the data flow. You cannot use multiple aggregation functions at the same time. For example, if we want to calculate the maximum number of words in a line and the total number of words in a whole file at the same time.:

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "line = len(line.split()); print(sum(stream)); print(max(stream))"  #skipbashtest
6
2019-05-05 14:55:09,353 | ERROR   | Traceback (most recent call last):
  File "<string>", line 15, in <module>
    stream = transform2(process1(stream))
  File "<string>", line 10, in transform2
    print(max(stream))
ValueError: max() arg is an empty sequence

We can see the empty sequence error. It throws because our stream generator is already empty. And we can’t find any max value on empty stream.

stream memorization

We can solve it by converting the stream generator to a list of values in memory using python list(stream) function.

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "line = len(line.split()); stream = list(stream); print(sum(stream), max(stream))"
6 3

The example above can be represented as the following python code:

import sys

def process(stram):
    for line in stream:
        line = len(line.split())     # action 1
        yield line

def transform(stream):
    stream = list(stream)            # action 2
    print(sum(stream), max(stream))  # action 3
    return stream

stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(process(stream))
for line in stream: pass

evaluate on the fly

We can also solve it without putting the stream into memory. Just use the auxiliary variables where we will place the calculated result in the process of processing the stream.

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "s = 0; m = 0; num_of_words = len(line.split()); s += num_of_words; m = max(m, num_of_words); print(s, m)"
2 2
3 2
6 3

The example above can be represented as the following python code:

import sys

def process(stram):
    s = 0                                 # action 1
    m = 0                                 # action 2
    for line in stream:
        num_of_words = len(line.split())  # action 3
        s += num_of_words                 # action 4
        m = max(m, num_of_words)          # action 5
        print(s, m)                       # action 6
        yield line

stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = process(stream)
for line in stream: pass

But we want only the last result. We don’t want to see intermediate results. To do this, you can add a loop over all elements of the stream before printing by for line in stream: pass. Don’t worry, this loop doesn’t add unnecessary calculations as we use Python language generators. The loop will simply force the stream to be iterated before the print function called.

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "s = 0; m = 0; num_of_words = len(line.split()); s += num_of_words; m = max(m, num_of_words); for line in stream: pass; print(s, m)"
6 3

The example above can be represented as the following python code:

import sys

def process(stram):
    global s, m
    s = 0                                 # action 1
    m = 0                                 # action 2
    for line in stream:
        num_of_words = len(line.split())  # action 3
        s += num_of_words                 # action 4
        m = max(m, num_of_words)          # action 5
        yield line

def transform(stream):
    global s, m
    for line in stream: pass              # action 6
    print(s, m)                           # action 7
    return stream

stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(process(stream))
for line in stream: pass

python generator laziness

Let’s check python generator laziness. Just run for line in stream: print(1); twice in a row:

$ echo -e "Here are\nsome\nwords for you." | ./py3line.py "for line in stream: print(1); for line in stream: print(1)"
1
1
1

As we can see, it only one-time iteration over the python generator items. And all subsequent iterations will work with an empty generator, which is equivalent to a cycle through an empty list.

The example above can be represented as the following python code:

import sys

def transform(stream):
    for line in stream: pass              # action 1 (3 iterations)
    for line in stream: pass              # action 2 (0 iterations)
    return stream

stream = (line.rstrip("\r\n") for line in sys.stdin if line)
stream = transform(stream)
for line in stream: pass                  # (0 iterations)

work with a part of stream

TODO ….

Details

Let us define some terminology. py3line “action1; action2; action3

We have actions: action1, action2 and action3. Each action have type. It may be element processing or stream transformation.

We can understand the type of action based on the variables used in it. We have two variables: line and stream. They are markers that define the type of action.

Lets look at some types from examples abow:

x = line.split()                 -- element processing
print(x, line)                   -- element processing
print(sum(stream))               -- stream transformation
stream = list(stream)            -- stream transformation
print(sum(stream), max(stream))  -- stream transformation
s = 0                            -- unidentified
m = 0                            -- unidentified
num_of_words = len(line.split()) -- element processing
s += num_of_words                -- unidentified
m = max(m, num_of_words)         -- unidentified
print(s, m)                      -- unidentified
for line in stream: pass         -- stream transformation

[rule1] If an action has an undefined type, it inherits its type from the previous action. [rule2] If there is no previous action, then the action is considered a stream transformation.

Examples:

s = 0                            -- stream transformation (because of [rule2])
num_of_words = len(line.split()) -- element processing (because of `line` marker)
s += num_of_words                -- element processing (because of [rule1])
print(s)                         -- element processing (because of [rule1])

And if we want to do print at the and, we should have some stream marker in actions before.

s = 0                            -- stream transformation (because of [rule2])
num_of_words = len(line.split()) -- element processing (because of `line` marker)
s += num_of_words                -- element processing (because of [rule1])
stream                           -- stream transformation (because of `stream` marker)
print(s)                         -- stream transformation (because of [rule1])

Unfortunately, it is not so clearly to people who are not familiar with the the implementation. Therefore, it is better to use a more explicit to readers actions like for line in stream: pass.

s = 0                            -- stream transformation (because of [rule2])
num_of_words = len(line.split()) -- element processing (because of `line` marker)
s += num_of_words                -- element processing (because of [rule1])
for line in stream: pass         -- stream transformation (because of `stream` marker)
print(s)                         -- stream transformation (because of [rule1])

Some examples

# Print every line (null transform)
$ cat ./testsuit/test.txt | ./py3line.py "print(line)"
This is my cat,
 whose name is Betty.
This is my dog,
 whose name is Frank.
This is my fish,
 whose name is George.
This is my goat,
 whose name is Adam.
# Number every line
$ cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); print(line)"
(0, 'This is my cat,')
(1, ' whose name is Betty.')
(2, 'This is my dog,')
(3, ' whose name is Frank.')
(4, 'This is my fish,')
(5, ' whose name is George.')
(6, 'This is my goat,')
(7, ' whose name is Adam.')
# Number every line
$ cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); print(line[0], line[1])"
0 This is my cat,
1  whose name is Betty.
2 This is my dog,
3  whose name is Frank.
4 This is my fish,
5  whose name is George.
6 This is my goat,
7  whose name is Adam.

Or just cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); print(*line)"

# Print every first and last word
$ cat ./testsuit/test.txt | ./py3line.py "s = line.split(); print(s[0], s[-1])"
This cat,
whose Betty.
This dog,
whose Frank.
This fish,
whose George.
This goat,
whose Adam.
# Split into words and print as list (strip al non word char like comma, dot, etc)
$ cat ./testsuit/test.txt | ./py3line.py "print(re.findall(r'\w+', line))"
['This', 'is', 'my', 'cat']
['whose', 'name', 'is', 'Betty']
['This', 'is', 'my', 'dog']
['whose', 'name', 'is', 'Frank']
['This', 'is', 'my', 'fish']
['whose', 'name', 'is', 'George']
['This', 'is', 'my', 'goat']
['whose', 'name', 'is', 'Adam']
# Split into words (strip al non word char like comma, dot, etc)
$ cat ./testsuit/test.txt | ./py3line.py "print(*re.findall(r'\w+', line))"
This is my cat
whose name is Betty
This is my dog
whose name is Frank
This is my fish
whose name is George
This is my goat
whose name is Adam
# Find all three letter words
$ cat ./testsuit/test.txt | ./py3line.py "print(re.findall(r'\b\w\w\w\b', line))"
['cat']
[]
['dog']
[]
[]
[]
[]
[]
# Find all three letter words + skip empty lists
cat ./testsuit/test.txt | ./py3line.py "line = re.findall(r'\b\w\w\w\b', line); if not line: continue; print(line)"
['cat']
['dog']
# Regex matching with groups
$ cat ./testsuit/test.txt | ./py3line.py "line = re.findall(r' is ([A-Z]\w*)', line); if not line: continue; print(*line)"
Betty
Frank
George
Adam
# cat ./testsuit/test.txt | ./py3line.py "line = re.search(r' is ([A-Z]\w*)', line); if not line: continue; line.group(1)"
$ cat ./testsuit/test.txt | ./py3line.py "rgx = re.compile(r' is ([A-Z]\w*)'); line = rgx.search(line); if not line: continue; print(line.group(1))"
Betty
Frank
George
Adam
# head -n 2
# cat ./testsuit/test.txt | ./py3line.py "stream = enumerate(stream); if line[0] >= 2: break; print(line[1])"
$ cat ./testsuit/test.txt | ./py3line.py "stream = list(stream)[:2]; print(line)"
This is my cat,
 whose name is Betty.
# Print just the URLs in the access log
$ cat ./testsuit/nginx.log | ./py3line.py "print(shlex.split(line)[13])"
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
GET /admin/moktoring/session/add/ HTTP/1.1
GET /admin/jsi18n/ HTTP/1.1
GET /static/admin/img/icon-calendar.svg HTTP/1.1
GET /static/admin/img/icon-clock.svg HTTP/1.1
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
HEAD / HTTP/1.0
GET /logout/?reason=startApplication HTTP/1.1
GET / HTTP/1.1
GET /login/?next=/ HTTP/1.1
POST /admin/customauth/user/?q=%D0%9F%D0%B0%D1%81%D0%B5%D1%87%D0%BD%D0%B8%D0%BA HTTP/1.1
# Print most common accessed urls and filter accessed more then 5 times
$ cat ./testsuit/nginx.log | ./py3line.py "line = shlex.split(line)[13]; stream = collections.Counter(stream).most_common(); if line[1] < 5: continue; print(line)"
('HEAD / HTTP/1.0', 10)

Complex examples

# create directory tree
echo -e "y1\nx2\nz3" | ./py3line.py "pathlib.Path('/DATA/' + line +'/db-backup/').mkdir(parents=True, exist_ok=True)"

group by 3 lines ... (https://askubuntu.com/questions/1052622/separate-log-text-according-to-paragraph)

HELP

$ ./py3line.py --help
usage: py3line.py [-h] [-v] [-q] [--version] [--pycode]
                  [expression [expression ...]]

Py3line is a UNIX command-line tool for a simple text stream processing by the
Python one-liner scripts. Like grep, sed and awk.

positional arguments:
  expression     python comma separated expressions

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose
  -q, --quiet
  --version      print the version string
  --pycode       show generated python code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py3line-0.3.0.tar.gz (20.3 kB view details)

Uploaded Source

File details

Details for the file py3line-0.3.0.tar.gz.

File metadata

  • Download URL: py3line-0.3.0.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for py3line-0.3.0.tar.gz
Algorithm Hash digest
SHA256 56c7a08a060e89cb8487c4b72fda0b73bbb9a0c27d73a5f4dcc5b9975b895b29
MD5 4f1837b79ac115981aee6f7b7fa2d38b
BLAKE2b-256 935cc3527fa83a9e9ac17dd54c5e0d266666de02a76283d0c41f5732f5043459

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page