Skip to main content

Python package to generate ugly real-looking csvs.

Project description

Ugly CSV generator

Pypi project Pypi total project downloads LICENSE Python version Github Actions Codacy Badge

Python package to automatically uglify CSVs. Why? To improve the testing capabilities of pipelines that must be able to support strongly malformed input data.

All the malformation automated here are non-destructive, meaning they introduce confusion in the data but do not mangle or destroy information.

The inspiration for the automated malformation are all from real-life CSVs (sigh)

Humans will always surprise us with the ever-new malformed input data, but hey, we can try to best ruining the test CSVs!

How do I install this package?

As usual, just download it using pip:

pip install ugly_csv_generator

Usage example

To ruin a CSV you can use the following snippet. In the following example we use a random_csv_generator to generate a random "healthy" csv.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(csv)

The initial CSV will look something like:

region province surname
Calabria Catanzaro Rossi
Sicilia Ragusa Pinna
Lombardia Varese Sbrana
Lazio Roma Mair
Sicilia Messina Ferrari

The result uglified CSV will look something like this:

1 2 3 4 5 6
0 //// #RIF! #RIF! 0 .... 0
1 "('surname',)('.',)(0,)" region province surname "('province',)('_',)(1,)"
2 //////// region "province " "surname " 0 0
3 /////// "region " "province " "surname " #RIF! #RIF!
4 Calabria "Catanzaro " "Rossi " 0 --------
5 " " Sicilia Ragusa "Pinna " " "
6 ------- #RIF! #RIF! 0 " "
7 ///////// "Lombardia " "Varese " Sbrana ///////////
8 --------- "Lazio " "Roma " "Mair "
9 -------- 0 ///// --- 0 /////
10 #RIF! "Sicilia " Messina "Ferrari " 0
11 0 ----- " " -------- 0

Available uglifications

Let's take a look at the available uglifications! All of these options are available as keyword arguments in the uglify function.

We start by taking a look at the same example from before, but now we expand all of the available options:

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

ugly = uglify(
    csv,
    empty_columns = True,
    empty_rows = True,
    duplicate_schema = True,
    empty_padding = True,
    nan_like_artefacts = True,
    replace_zeros = True,
    replace_ones = True,
    satellite_artefacts = False,
    random_spaces = True,
    include_unicode = True,
    verbose = True,
    seed = 42,
)

Let's break down all of the available options with adequate examples. In all cases, we will use the following CSV as a starting point, obtained from the random_csv_generator package:

from random_csv_generator import random_csv

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

The initial CSV will look something like:

region province surname
0 Veneto Vicenza Sacco
1 Abruzzo L' Aquila Sala
2 Sicilia Messina Sanna
3 Marche Ancona Gallo
4 Lazio Frosinone Gallo

Empty columns

In the following example we will solely add empty columns to the CSV. This phenomenon is common when the data-entry person leaves empty columns in the middle of the table.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = True,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    seed = 424,
)

The result will look something like:

region_2 region_0 1 region region_0 province surname
0 Veneto Vicenza Sacco
1 Abruzzo L Aquila Sala
2 Sicilia Messina Sanna
3 Marche Ancona Gallo
4 Lazio Frosinone Gallo

Empty rows

In the following example we will solely add empty rows to the CSV. This phenomenon is common when the data-entry person leaves empty rows in the middle of the table.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = True,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    seed = 424,
)

The result will look something like:

region province surname
0 Veneto Vicenza Sacco
1 Abruzzo L Aquila Sala
2 Sicilia Messina Sanna
3
4 Marche Ancona Gallo
5 Lazio Frosinone Gallo
6

Duplicate schema

In the following example we will solely duplicate the schema of the CSV. This phenomenon is common when the data-entry person copies the header of the table multiple times, or several CSVs are concatenated together without removing the header.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = True,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    seed = 424,
)

The result will look something like:

region province surname
0 Veneto Vicenza Sacco
1 Abruzzo L Aquila Sala
2 Sicilia Messina Sanna
3 region province surname
4 Marche Ancona Gallo
5 Lazio Frosinone Gallo
6 region province surname

Empty padding

In the following example we will solely add empty padding to the CSV. Padding in this context means adding empty cells around the CSV, represing when the data-entry person started the table somewhere in the middle of a sheet document.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = True,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    seed = 424,
)

The result will look something like:

0 1 2 3 4 5
0 region province surname
1 Veneto Vicenza Sacco
2 Abruzzo L Aquila Sala
3 Sicilia Messina Sanna
4 Marche Ancona Gallo
5 Lazio Frosinone Gallo
6
7
8
9
10
11

NaN-like artefacts

In the following example we will solely add NaN-like artefacts to the CSV. This phenomenon is common when the data-entry person follows some custom notation, which may be their own or office standard, to represent missing values. In some cases, this may be a string like "N/A", "NaN", "NULL", or even (one or more) "-", "\n", or "\t". Since the objective of this package is to not destroy information, it will solely replace NaN values with NaN-like artefacts.

In the example we considered earlier, we do not have any NaN values, so we will add some to the CSV by also enabling the empty_rows option.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = True,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = True,
    satellite_artefacts = False,
    random_spaces = False,
    seed = 424,
)

The result will look something like:

region province surname
0 Veneto Vicenza Sacco
1 Abruzzo L Aquila Sala
2 Sicilia Messina Sanna
3 " " ... ----
4 Marche Ancona Gallo
5 Lazio Frosinone Gallo
6 " -------

Unicode variant

The NaN-like artefacts can also be applied with unicode characters. This is useful to test the robustness of the CSV reader to unicode characters.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = True,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = True,
    satellite_artefacts = False,
    random_spaces = False,
    include_unicode = True,
    seed = 424,
)

The result will look something like:

region province surname
0 Calabria Catanzaro Rossi
1 Sicilia Ragusa Pinna
2 Lombardia Varese Sbrana
3 . 0
4 Lazio Roma Mair
5 Sicilia Messina Ferrari
6 ________

Replace zeros

In the following example we will solely replace zeros with a custom value. In different places in the word and different offices, zeros may be represented in different ways. Characters for zero from different alphabets, or even different symbols, may be used to represent zero. Note that this latter functionality is only enabled if the include_unicode option is set to True.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

# We add a column with zeros
csv["zero"] = 0

ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    replace_zeros = True,
    include_unicode = False,
    seed = 424,
)

The result will look something like:

region province surname zero
0 Calabria Catanzaro Rossi 0
1 Sicilia Ragusa Pinna o
2 Lombardia Varese Sbrana 0
3 Lazio Roma Mair 0
4 Sicilia Messina Ferrari O

Unicode variant

The replace zeros can also be applied with unicode characters. This is useful to test the robustness of the CSV reader to unicode characters.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

# We add a column with zeros
csv["zero"] = 0

ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    replace_zeros = True,
    include_unicode = True,
    seed = 424,
)

The result will look something like:

region province surname zero
0 Calabria Catanzaro Rossi o
1 Sicilia Ragusa Pinna
2 Lombardia Varese Sbrana ο
3 Lazio Roma Mair 𝟘
4 Sicilia Messina Ferrari

Replace ones

In the following example we will solely replace ones with a custom value. In different places in the word and different offices, ones may be represented in different ways. Characters for one from different alphabets, or even different symbols, may be used to represent one. Note that this latter functionality is only enabled if the include_unicode option is set to True.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

# We add a column with ones
csv["one"] = 1

ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    replace_ones = True,
    include_unicode = False,
    seed = 424,
)

The result will look something like:

region province surname one
0 Calabria Catanzaro Rossi 1
1 Sicilia Ragusa Pinna l
2 Lombardia Varese Sbrana 1
3 Lazio Roma Mair 1
4 Sicilia Messina Ferrari I

Unicode variant

The replace ones can also be applied with unicode characters. This is useful to test the robustness of the CSV reader to unicode characters.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example

# We add a column with ones
csv["one"] = 1

ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = False,
    replace_ones = True,
    include_unicode = True,
    seed = 424,
)

The result will look something like:

region province surname one
0 Calabria Catanzaro Rossi ¹
1 Sicilia Ragusa Pinna
2 Lombardia Varese Sbrana l
3 Lazio Roma Mair 1
4 Sicilia Messina Ferrari

Satellite artefacts

In the following example we will solely add satellite artefacts to the CSV. A satellite artefact is likely the quirkiest and most annoying artefact to deal with. It represents the situation where the data-entry person adds some notes on the side of the table. A real-world example of this which I have encountered is when the data-entry person adds the office lunch order on the side of the table and forgets to remove it.

The package offers a few satellite artefacts encountered in the wild.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = True,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = True,
    random_spaces = False,
    seed = 424,
)

The result will look something like:

0 1 2 3 4
0 random
1 random
2 caso
3 region province surname
4 Veneto Vicenza Sacco
5 Abruzzo L Aquila Sala
6 Sicilia Messina Sanna
7 Marche Ancona Gallo
8 Lazio Frosinone Gallo
9
10
11
12
13
14
15 person food
16 Jerry kebab
17 Steven rice with paprika
18 Vale pizza mit ananas

Random spaces

In the following example we will solely add random spaces around the values in the CSV. This phenomenon is common when the data-entry person is not careful with the spaces around the values in the table and adds some random spaces, for instance to visually align the values.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = True,
    seed = 424,
)

The result will look something like:

region province surname
0 " Veneto " " Vicenza " " Sacco "
1 " Abruzzo " " L Aquila " " Sala "
2 " Sicilia " " Messina " " Sanna "
3 " Marche " " Ancona " " Gallo "
4 " Lazio " " Frosinone " " Gallo "

Unicode variant

The random spaces uglification can also be applied with unicode characters. This is useful to test the robustness of the CSV reader to unicode characters.

from random_csv_generator import random_csv
from ugly_csv_generator import uglify

csv = random_csv(5) # CSV with 5 lines
csv = csv[csv.columns[:3]] # We will use only the first 3 columns for this example
ugly = uglify(
    csv,
    empty_columns = False,
    empty_rows = False,
    duplicate_schema = False,
    empty_padding = False,
    nan_like_artefacts = False,
    satellite_artefacts = False,
    random_spaces = True,
    include_unicode = True,
    seed = 424,
)

Due to limitations of the markdown rendering, we cannot show the result here. You can run the code snippet to see the result. It's just that damn cursed!

Contributing

You have encountered a new type of uglification that you would like to add to the package? You have a suggestion for a new feature or improvement? You have found a bug? Open an issue or a pull request, I will be happy to help you!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugly_csv_generator-1.1.2.tar.gz (21.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page