Project description

Ugly CSV generator

Python package to automatically uglify CSVs. Why? To improve the testing capabilities of pipelines that must be able to support strongly malformed input data.

All the malformation automated here are non-destructive, meaning they introduce confusion in the data but do not mangle or destroy information.

The inspiration for the automated malformation are all from real-life CSVs (sigh)

Humans will always surprise us with the ever-new malformed input data, but hey, we can try to best ruining the test CSVs!

How do I install this package?

As usual, just download it using pip:

pip install ugly_csv_generator

Usage example

To ruin a CSV you can use the following snippet. In the following example we use a random_csv_generator to generate a random "healthy" csv.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(csv)

The initial CSV will look something like:

region	province	surname
Calabria	Catanzaro	Rossi
Sicilia	Ragusa	Pinna
Lombardia	Varese	Sbrana
Lazio	Roma	Mair
Sicilia	Messina	Ferrari

The result uglified CSV will look something like this:

	1	2	3	4	5	6
0	////	#RIF!	#RIF!	0	....	0
1	"('surname',)('.',)(0,)"	region	province	surname	"('province',)('_',)(1,)"
2	////////	region	"province "	"surname "	0	0
3	///////	"region "	"province "	"surname "	#RIF!	#RIF!
4		Calabria	"Catanzaro "	"Rossi "	0	--------
5	" "	Sicilia	Ragusa	"Pinna "	" "
6	-------		#RIF!	#RIF!	0	" "
7	/////////	"Lombardia "	"Varese "	Sbrana	///////////
8	---------	"Lazio "	"Roma "	"Mair "
9	--------	0	/////	---	0	/////
10	#RIF!	"Sicilia "	Messina	"Ferrari "	0
11	0		-----	" "	--------	0

Available uglifications

Let's take a look at the available uglifications! All of these options are available as keyword arguments in the uglify function.

We start by taking a look at the same example from before, but now we expand all of the available options:

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

ugly = uglify(
    csv,
    empty_columns = True,
    empty_rows = True,
    duplicate_schema = True,
    empty_padding = True,
    nan_like_artefacts = True,
    replace_zeros = True,
    replace_ones = True,
    satellite_artefacts = False,
    random_spaces = True,
    include_unicode = True,
    verbose = True,
    seed = 42,
)

Let's break down all of the available options with adequate examples. In all cases, we will use the following CSV as a starting point, obtained from the random_csv_generator package:

from random_csv_generator import random_csv

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

The initial CSV will look something like:

	region	province	surname
0	Veneto	Vicenza	Sacco
1	Abruzzo	L' Aquila	Sala
2	Sicilia	Messina	Sanna
3	Marche	Ancona	Gallo
4	Lazio	Frosinone	Gallo

Empty columns

In the following example we will solely add empty columns to the CSV. This phenomenon is common when the data-entry person leaves empty columns in the middle of the table.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = True,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    seed = 424,
)