Package python to remove common ugliness from a csv-like file

These details have not been verified by PyPI

Project links

Homepage

Project description

✂️ CSV Trimming

CSV Trimming is a Python package designed to take messy CSVs — the kind you get from scraping websites, legacy systems, or poorly managed data — and transform them into clean, well-formatted CSVs with just one line of code. No need for complex setups or large language models. It’s simple, straightforward, and generally gets the job done.

How do I install this package?

As usual, just download it using pip:

pip install csv_trimming

How do I use this package?

The package is very simple to use, just load your CSV and pass it to the trimmer.

import pandas as pd
from csv_trimming import CSVTrimmer

# Load your csv
csv = pd.read_csv("tests/documents/noisy/sicilia.csv")
# Instantiate the trimmer
trimmer = CSVTrimmer()
# And trim it
trimmed_csv = trimmer.trim(csv)
# That's it!

For instance, your input CSV to clean up may look like this at the beginning:

	0	1	2	3	4
0	#RIF!	#RIF!	.........	///	-----
1	('surname',)('-',)(0,)	region	(""('surname',)('-',)(0,"),)(' ',)(1,)	province	surname
2	------	#RIF!	#RIF!
3	#RIF!	Calabria	-------	Catanzaro	Rossi
4	0	Sicilia	_____	Ragusa	Pinna
5	""	Lombardia	------	Varese	Sbrana
6	0	Lazio	__	Roma	Mair
7	_	Sicilia	#RIF!	Messina	Ferrari
8	-----	..	""	0	---------

And after the trimming, it will look like this:

	region	province	surname
0	Calabria	Catanzaro	Rossi
1	Sicilia	Ragusa	Pinna
2	Lombardia	Varese	Sbrana
3	Lazio	Roma	Mair
4	Sicilia	Messina	Ferrari

Magic!

Advanced trimming with row correlation

Sometimes, the CSVs you are working with may have a row correlation, meaning part of a given row is inserted in the next row. Such cases are common when the data-entry clerk wants to make the whole table fit in their screen, and in order to do so, they split the row in two. While this is clearly an extremely bad practice, it happens in the real world and the CSV Trimmer can handle it with a little help.

You just need to provide a function that defines which rows are correlated, and the CSV Trimmer will take care of the rest. While in this example we are using a rather simple function and a relatively clean CSV, the package can handle more complex cases.

from typing import Tuple
import pandas as pd
from csv_trimming import CSVTrimmer

def simple_correlation_callback(
    current_row: pd.Series,
    next_row: pd.Series
) -> Tuple[bool, pd.Series]:
    """Return the correlation between two rows.
    
    Parameters
    ----------
    current_row : pd.Series
        The current row being analyzed in the DataFrame.
    next_row : pd.Series
        The next row in the DataFrame.

    Returns
    -------
    Tuple[bool, pd.Series]
        A tuple with a boolean indicating if the rows are correlated
        and a Series with the merged row.
    """

    # All of the rows that have a subsequent correlated row are
    # non-empty, and the subsequent correlated rows are always
    # with the first cell empty.
    if pd.isna(next_row.iloc[0]) and all(pd.notna(current_row)):
        return True, pd.concat(
            [
                current_row,
                pd.Series({"surname": next_row.iloc[-1]}),
            ]
        )

    return False, current_row

csv = pd.read_csv("tests/test.csv")
trimmer = CSVTrimmer(simple_correlation_callback)
result = trimmer.trim(csv)

In this case, our CSV looked like this at the beginning:

	region	province
0	Campania	Caserta
1		Ferrero
2	Liguria	Imperia
3		Conti
4	Puglia	Bari
5		Fabris
6	Sardegna	Medio Campidano
7		Conti
8	Lazio	Roma
9		Fabbri

And after the trimming, it will look like this:

	region	province	surname
0	Campania	Caserta	Ferrero
1	Liguria	Imperia	Conti
2	Puglia	Bari	Fabris
3	Sardegna	Medio Campidano	Conti
4	Lazio	Roma	Fabbri

More examples

Here follow some examples of the package in action.

Case with duplicated schemas

Sometimes, when chaining multiple CSVs in a poor manner, you may end up with duplicated schemas. The CSV Trimmer detects rows that match the detected header, and it can (optionally) remove them.

import pandas as pd
from csv_trimming import CSVTrimmer

# Load your csv
csv = pd.read_csv("tests/documents/noisy/duplicated_schema.csv")
# Instantiate the trimmer
trimmer = CSVTrimmer()
# And trim it
trimmed_csv = trimmer.trim(csv, drop_duplicated_schema=True)
# That's it!

For instance, your input CSV to clean up may look like this at the beginning:

	0	1	2	3	4	5	6	7
0	#RIF!	////	#RIF!	#RIF!	0	....	0	0
1		('surname',)('.',)(0,)	region	province	surname	('province',)('_',)(1,)		0
2	0	////////	region	province	surname	0	0
3	_____	///////	region	province	surname	#RIF!	#RIF!
4			Puglia	Bari	Zanetti	0	--------
5	0		Piemonte	Alessandria	Fabbri
6	0	-------		#RIF!	#RIF!	0		----
7	/////////	/////////	Sicilia	Agrigento	Ferretti	//////////		----------
8	__	--------	Campania	Napoli	Belotti		///
9		--------	0	/////	---	0	/////	----------
10	-----	#RIF!	Liguria	Savona	Casini	0		#RIF!
11	...	0		-----		--------	0	0

And after the trimming, it will look like this:

	region	province	surname
0	Puglia	Bari	Zanetti
1	Piemonte	Alessandria	Fabbri
2	Sicilia	Agrigento	Ferretti
3	Campania	Napoli	Belotti
4	Liguria	Savona	Casini

Case with only padding

Sometimes, the data entry clerk may start filling a table offsetted from the top-left corner, and export it with also empty cells all around. We call such cells "padding". The CSV Trimmer can detect and remove them.

import pandas as pd
from csv_trimming import CSVTrimmer

# Load your csv
csv = pd.read_csv("tests/documents/noisy/padding.csv")

# Instantiate the trimmer
trimmer = CSVTrimmer()

# And trim it
trimmed_csv = trimmer.trim(csv, drop_padding=True)

For instance, your input CSV to clean up may look like this at the beginning:

	region	province	surname
0
1
2	region	province	surname
3	Campania	Caserta	Ferrero
4	Liguria	Imperia	Conti
5	Puglia	Bari	Fabris
6	Sardegna	Medio Campidano	Conti
7	Lazio	Roma	Fabbri
8
9
10
11

And after the trimming, it will look like this:

	region	province	surname
0	Campania	Caserta	Ferrero
1	Liguria	Imperia	Conti
2	Puglia	Bari	Fabris
3	Sardegna	Medio Campidano	Conti
4	Lazio	Roma	Fabbri

Command Line Interface

The package also provides a command line interface to trim CSVs. It comes installed with the setup.py of the package, therefore after having pip installed the package, you can immediately use it from the command line.

You can use it by running the following command:

csv-trim tests/documents/noisy/sicilia.csv tests/documents/noisy/sicilia_trimmed.csv

It supports the following options to keep it from attempting some trimmings:

--keep-padding: Do not attempt to remove padding.
--keep-duplicated-schema: Do not attempt to remove duplicated schemas.
--no-restore-header: Do not attempt to restore the header.

For instance:

csv-trim tests/documents/noisy/sicilia.csv tests/documents/noisy/sicilia_trimmed.csv --keep-padding

How do I contribute to this package?

If you have identified some new corner case that the package does not handle, or you have a suggestion for a new feature, feel free to open an issue. If you want to contribute with code, open an issue describing the feature you intend to add and submit a pull request.

License

This package is released under MIT license.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1.1

Sep 2, 2024

1.1.0

Jul 22, 2024

1.0.10

Nov 12, 2020

1.0.9

Sep 17, 2020

1.0.8

Jul 31, 2020

1.0.7

Jul 27, 2020

1.0.6

Jul 27, 2020

1.0.5

Jul 25, 2020

1.0.4

Apr 1, 2020

1.0.2

Mar 26, 2020

1.0.1

Mar 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv_trimming-1.1.1.tar.gz (17.4 kB view hashes)

Uploaded Sep 2, 2024 Source

Hashes for csv_trimming-1.1.1.tar.gz

Hashes for csv_trimming-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`17fa6a276a6a0e9c0eadd328ecf1947e638f580515870f351d32a225026b7ab3`
MD5	`088457ce0258be47ff5b7bb13f91e06e`
BLAKE2b-256	`6a3b6473544d571b9d405f7eec84c1729d6e774b02fe56a482f4e5b8409a26b5`