Skip to main content

Universal errors-and-erasures Reed Solomon codec (error correcting code) in pure Python with extensive documentation

Project description

PyPI-Status PyPI-Versions PyPI-Downloads

Build-Status Coverage

UniReedSolomon is a pure-Python universal Reed-Solomon error correction codec with fully documented code and mathematical nomenclatura, compatible with Python 2.7 up to 3.7 and also PyPy 2 and 3.

Compatible also with Python 3.8 but requires installing Cython even if not used (this will be fixed in the future).


Installation

pip install --upgrade unireedsolomon

Quickstart

>>> import unireedsolomon as rs
>>> coder = rs.RSCoder(20,13)
>>> c = coder.encode("Hello, world!")
>>> print repr(c)
'Hello, world!\x8d\x13\xf4\xf9C\x10\xe5'
>>>
>>> r = "\0"*3 + c[3:]
>>> print repr(r)
'\x00\x00\x00lo, world!\x8d\x13\xf4\xf9C\x10\xe5'
>>>
>>> coder.decode(r)
'Hello, world!'

Description

This library implements a pure-Python documented universal Reed-Solomon error correction codec with a mathematical nomenclatura, compatible with Python 2.7 up to 3.7+ and also with PyPy 2 and PyPy 3.

The project aims to keep a well commented and organized code with an extensive documentation and mathematical clarity of the various arithmetic operations.

How does this library differs from other Reed-Solomon libraries?

  • universal: compatibility with (almost) any other Reed-Solomon codec. This means that you can choose the parameters so that you can either encode data and decode it with another RS codec, or on the opposite encode data with another RS codec and decode this data with this library.

  • errors-and-erasures decoding allows to decode both erasures (where you know the position of the corrupted characters) and errors (where you don’t know where are the corrupted characters) at the same time (because you can decode more erasures than errors, so if you can provide even a few know corrupted characters positions, this can help a lot the decoder to repair the message).

  • documented: following literate programming guidelines, you should understand everything you need about RS by reading the code and the comments.

  • mathematical nomenclatura: for example, contrary to most other RS libraries, you will see a clear distinction between the different mathematical constructs, such as the Galois Fields numbers are clearly separated from the generic Polynomial objects, and both are separated from the Reed-Solomon algorithm, which makes use of both of those constructs. For this purpose, object-oriented programming was chosen to design the architecture of the library, although obviously at the expense of a bit of performance. However, this library favors mathematical clarity and documentation over performance (even if performance is optimized whenever possible).

  • pure-Python means that there are no dependencies whatsoever apart from the Python interpreter. This means that this library should be resilient in the future (since it doesn’t depend on external libraries who can become broken with time, see software rot), and you can use it on any system where Python can be installed (including online cloud services).

The authors tried their best to extensively document the algorithms. However, a lot of the math involved is non-trivial and we can’t explain it all in the comments. To learn more about the algorithms, see these resources:

Also, here’s a copy of the presentation one of the authors gave to the class Spring 2010 on his experience implementing this library. The LaTeX source is in the presentation directory.

http://www.cs.duke.edu/courses/spring10/cps296.3/decoding_rs.pdf

The code was lately updated to support errors-and-erasures decoding (both at the same time), and to be universal (you can supply the parameters to be compatible with almost any other RS codec).

The codec has decent performances if you use PyPy with the fast methods (~1 MB/s), but it would be faster if we drop the object-oriented design (implementing everything in functions), but this would be at the expense of mathematical clarity. If you are interested, see the reedsolo library by Tomer Filiba, which is exactly the same implementation but only functional without objects (results in about 5x speedup).

Files

rs.py

Holds the Reed-Solomon Encoder/Decoder object

polynomial.py

Contains the Polynomial object (pure-python)

ff.py

Contains the GF2int object representing an element of the GF(2^p) field, with p being 8 by default (pure-python)

polynomial.pyx

Cython implementation of polynomial.py with equivalent functions (optional)

ff.pyx

Cython implementation of ff.py with equivalent functions (optional)

Documentation

unireedsolomon.rs.RSCoder(n, k, generator=3, prim=0x11b, fcr=1, c_exp=8)

Creates a new Reed-Solomon Encoder/Decoder object configured with the given n and k values. n is the length of a codeword, must be less than 256 k is the length of the message, must be less than n generator, prim and fcr parametrize the Galois Field that will be built c_exp is the Galois Field range (ie, 8 means GF(2^8) = GF(256)), which is both the limit for one symbol’s value, and the maximum length of a message+ecc.

The code will have error correcting power (ie, maximum number of repairable symbols) of 2*e+v <= (n-k), where e is the number of errors and v the number of erasures.

The typical RSCoder is RSCoder(255, 223)

RSCoder.encode(message, poly=False, k=None)

Encode a given string with reed-solomon encoding. Returns a byte string with the k message bytes and n-k parity bytes at the end.

If a message is < k bytes long, it is assumed to be padded at the front with null bytes (ie, a shortened Reed-Solomon code).

The sequence returned is always n bytes long.

If poly is not False, returns the encoded Polynomial object instead of the polynomial translated back to a string (useful for debugging)

You can change the length (number) of parity/ecc bytes at encoding by setting k to any value between [1, n-1]. This allows to create only one RSCoder and then use it with a variable redundancy rate.

RSCoder.encode_fast(message, poly=False, k=None)

Same as encode() but using faster algorithms and optimization tricks.

RSCoder.decode(message_ecc, nostrip=False, k=None, erasures_pos=None, only_erasures=False):

Given a received string or byte array message_ecc (composed of a message string + ecc symbols at the end), attempts to decode it. If it’s a valid codeword, or if there are no more than 2*e+v <= (n-k) erratas (called the Singleton bound), the message is returned.

You can provide the erasures positions as a list to erasures_pos. For example, if you have “hella warld” and you know that a is an erasure, you can provide the list erasures_pos=[4, 7]. You can correct twice as many erasures than errors, and if some provided erasures are wrong (they are correct symbols), then there’s no problem, they will be repaired just fine (but will count towards the Singleton bound). You can also specify that you are sure there are only erasures and no errors at all by setting only_erasures=True.

A message always has k bytes, if a message contained less it is left padded with null bytes (punctured RS code). When decoded, these leading null bytes are stripped, but that can cause problems if decoding binary data. When nostrip is True, messages returned are always k bytes long. This is useful to make sure no data is lost when decoding binary data.

Note that RS can correct errors both in the message and the ecc symbols.

RSCoder.decode_fast(message_ecc, nostrip=False, k=None, erasures_pos=None, only_erasures=False):

Same as decode() but using faster algorithms and optimization tricks.

RSCoder.check(message_ecc, k=None)

Verifies the codeword (message + ecc symbols at the end) is valid by testing that the code as a polynomial code divides g, or that the syndrome is all 0 coefficients. The result is not foolproof: if it’s False, you’re sure the message was corrupted (or that you used the wrong RS parameters), but if it’s True, it’s either that the message is correct, or that there are too many errors (ie, more than the Singleton bound) for RS to do anything about it. returns True/False

RSCoder.check_fast(message_ecc, k=None)

Same as check() but using faster algorithms and optimization tricks.

unireedsolomon.ff.find_prime_polynomials(generator=2, c_exp=8, fast_primes=False, single=False)

Compute the list of prime polynomials for the given generator and galois field characteristic exponent. You can then use this prime polynomial to specify the mandatory “prim” parameter, particularly if you are using a larger Galois Field (eg, 2^16).

Internal API

Besides the main RSCoder object, two other objects are used in this implementation: Polynomial and GF2int. Their use is not specifically tied to the coder or even to the Reed-Solomon algorithm, they are just generic mathematical constructs respectively representing polynomials and Galois field’s number of base 2.

You do not need to know about the internal API to use the RS codec, this is just left as a documentation for the reader interested into dwelling inside the mathematical constructs.

polynomial.Polynomial(coefficients=[], **sparse)

There are three ways to initialize a Polynomial object. 1) With a list, tuple, or other iterable, creates a polynomial using the items as coefficients in order of decreasing power

2) With keyword arguments such as for example x3=5, sets the coefficient of x^3 to be 5

3) With no arguments, creates an empty polynomial, equivalent to Polynomial([0])

>>> print Polynomial([5, 0, 0, 0, 0, 0])
5x^5
>>> print Polynomial(x32=5, x64=8)
8x^64 + 5x^32
>>> print Polynomial(x5=5, x9=4, x0=2)
4x^9 + 5x^5 + 2

Polynomial objects export the following standard functions that perform the expected operations using polynomial arithmetic. Arithmetic of the coefficients is determined by the type passed in, so integers or GF2int objects could be used, the Polynomial class is agnostic to the type of the coefficients.

__add__
__divmod__
__eq__
__floordiv__
__hash__
__len__
__mod__
__mul__
__ne__
__neg__
__sub__
evaluate(x)
degree()
    Returns the degree of the polynomial
get_coefficient(degree)
    Returns the coefficient of the specified term
ff.GF2int(value)

Instances of this object are elements of the field GF(2^p) and instances are integers in the range 0 to (2^p)-1. By default, the field is GF(2^8) and instances are integers in the range 0 to 255 and is defined using the irreducable polynomial 0x11b or in binary form: x^8 + x^4 + x^3 + x + 1 and using 3 as the generator for the exponent table and log table.

You can however use other parameters for the Galois Field, using the init_lut() function.

ff.find_prime_polynomials(generator=2, c_exp=8, fast_primes=False, single=False)

Find the list of prime polynomials to use to generate the look-up tables for your field.

ff.init_lut(generator=3, prim=0x11b, c_exp=8)

Generate the look-up tables given the parameters. This effectively parametrize your Galois Field (ie, generator=2, prim=0x1002d, c_exp=16) will generate a GF(2^16) field.

The GF2int class inherits from int and supports all the usual integer operations. The following methods are overridden for arithmetic in the finite field GF(2^p)

__add__
__div__
__mul__
__neg__
__pow__
__radd__
__rdiv__
__rmul__
__rsub__
__sub__
inverse()
    Multiplicative inverse in GF(2^p)

Example implementations

Image Encoder

imageencode.py is an example script that encodes codewords as rows in an image. It requires PIL to run.

Usage: python imageencode.py [-d] <image file>

Without the -d flag, imageencode.py will encode text from standard in and output it to the image file. With -d, imageencode.py will read in the data from the image and output to standard out the decoded text.

An example is included: exampleimage.png. Try decoding it as-is, then open it up in an image editor and paint some vertical stripes on it. As long as no more than 16 pixels per row are disturbed, the text will be decoded correctly. Then draw more stripes such that more than 16 pixels per row are disturbed and verify that the message is decoded improperly.

Notice how the parity data looks different–the last 32 pixels of each row are colored differently. That’s because this particular image contains encoded ASCII text, which generally only has bytes from a small range (the alphabet and printable punctuation). The parity data, however, is binary and contains bytes from the full range 0-255. Also note that either the data area or the parity area (or both!) can be disturbed as long as no more than 16 bytes per row are disturbed.

Cython implementation

If either a C compiler or Cython is found, rs.py will automatically load the Cython implementations (the *.pyx files). These are provided as optimized versions of the pure-python implementations, with equivalent functionalities. The goal was to get a speedup, which is the case, but using PyPy on the pure-python implementation provides a significantly higher speedup than the Cython implementation. The Cython implementations are still provided for the interested reader, but the casual user is not advised to use them. If you want to encode and decode fast, use PyPy.

Authors and licence

Written from scratch by Andrew Brown <brownan@gmail.com> <brownan@cs.duke.edu> (c) 2010.

Upgraded and maintained by Stephen Karl Larroque <LRQ3000@gmail.com> in 2015-2020.

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unireedsolomon-1.0.2.tar.gz (295.9 kB view details)

Uploaded Source

Built Distributions

unireedsolomon-1.0.2.win-amd64-py3.7.exe (814.2 kB view details)

Uploaded Source

unireedsolomon-1.0.2-cp37-cp37m-win_amd64.whl (210.6 kB view details)

Uploaded CPython 3.7m Windows x86-64

File details

Details for the file unireedsolomon-1.0.2.tar.gz.

File metadata

  • Download URL: unireedsolomon-1.0.2.tar.gz
  • Upload date:
  • Size: 295.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for unireedsolomon-1.0.2.tar.gz
Algorithm Hash digest
SHA256 0f300814829ab078b4aea13b0f7da31024d1aba919db56f9aa8b58a73f331e75
MD5 bb00fda89670d5ffbde4c5629cddb4e0
BLAKE2b-256 a61941c4c676e3eaf7c8a092eeeed1f94f56740a0e8b8f6a66292f3c1a6d3be1

See more details on using hashes here.

File details

Details for the file unireedsolomon-1.0.2.win-amd64-py3.7.exe.

File metadata

  • Download URL: unireedsolomon-1.0.2.win-amd64-py3.7.exe
  • Upload date:
  • Size: 814.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for unireedsolomon-1.0.2.win-amd64-py3.7.exe
Algorithm Hash digest
SHA256 c49d9a537332408ac516b344ced8e42c204a9df38072ccdb60a4fe01e1fff143
MD5 c74eeadef96c5596c326ba49da1f7fa6
BLAKE2b-256 1a770b25559e71f0a6cf871022a484ab278a313bab78d44f9de7fab8f822e8e6

See more details on using hashes here.

File details

Details for the file unireedsolomon-1.0.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: unireedsolomon-1.0.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 210.6 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for unireedsolomon-1.0.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 be422b7e8ed94e6420fc563788ea291d034475f2eafea947d524b85394bc904f
MD5 c41f9f14d95eed84f0f1bf72c28cc7c4
BLAKE2b-256 244a9aaa4115a7e3c9af509bc47321fb4569f450fdae1a7ca3f263846dd34e03

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page