Skip to main content

Fixes mojibake and other problems with Unicode, after the fact

Project description

ftfy: fixes text for you

PyPI package Docs

>>> print(fix_encoding("(ง'⌣')ง"))
('⌣')

The full documentation of ftfy is available at ftfy.readthedocs.org. The documentation covers a lot more than this README, so here are some links into it:

Testimonials

  • “My life is livable again!” — @planarrowspace
  • “A handy piece of magic” — @simonw
  • “Saved me a large amount of frustrating dev work” — @iancal
  • “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
  • “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
  • “9.2/10” — pylint

What it does

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:

>>> import ftfy
>>> ftfy.fix_text('✔ No problems')
'✔ No problems'

Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:

>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:

>>> ftfy.fix_text("l’humanité")
"l'humanité"

ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:

>>> ftfy.fix_text('Ã\xa0 perturber la réflexion')
'à perturber la réflexion'
>>> ftfy.fix_text('à perturber la réflexion')
'à perturber la réflexion'

ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:

>>> # by the HTML 5 standard, only 'PÉREZ' is acceptable
>>> ftfy.fix_text('PÉREZ')
'PÉREZ'

These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.

The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.

>>> ftfy.fix_text('IL Y MARQUÉ…')
'IL Y MARQUÉ…'

Installing

ftfy is a Python 3 package that can be installed using pip:

pip install ftfy

(Or use pip3 install ftfy on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

You can also clone this Git repository and install it with python setup.py install.

Who maintains ftfy?

I'm Robyn Speer, also known as Elia Robyn Lake. You can find me on GitHub or Twitter.

Citing ftfy

ftfy has been used as a crucial data processing step in major NLP research.

It's important to give credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite it.

ftfy has a citable record on Zenodo. A citation of ftfy may look like this:

Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
http://doi.org/10.5281/zenodo.2591652

In BibTeX format, the citation is::

@misc{speer-2019-ftfy,
  author       = {Robyn Speer},
  title        = {ftfy},
  note         = {Version 5.5},
  year         = 2019,
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.2591652},
  url          = {https://doi.org/10.5281/zenodo.2591652}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ftfy-6.1.0.post1.tar.gz (62.6 kB view details)

Uploaded Source

Built Distribution

ftfy-6.1.0.post1-py3-none-any.whl (52.8 kB view details)

Uploaded Python 3

File details

Details for the file ftfy-6.1.0.post1.tar.gz.

File metadata

  • Download URL: ftfy-6.1.0.post1.tar.gz
  • Upload date:
  • Size: 62.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.9 Linux/5.15.8-76051508-generic

File hashes

Hashes for ftfy-6.1.0.post1.tar.gz
Algorithm Hash digest
SHA256 6bf6b060804eac55c8da07aa94997de8161d4dc90ebc82deb30677a41e36dd94
MD5 21402402f8556bb6100a515ea3fa3638
BLAKE2b-256 1336d98ce9f717efdd2389469f79574c8dfb230b8c2d590cd3cf8b5c11ab9127

See more details on using hashes here.

File details

Details for the file ftfy-6.1.0.post1-py3-none-any.whl.

File metadata

  • Download URL: ftfy-6.1.0.post1-py3-none-any.whl
  • Upload date:
  • Size: 52.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.9.9 Linux/5.15.8-76051508-generic

File hashes

Hashes for ftfy-6.1.0.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 7380481b59898382941a67e1c236e3ca7b76091f860975cd69d2552d48b95c1a
MD5 1cdae87a571d17e32846e099bbca9ace
BLAKE2b-256 b85e5a93192eba099c6c6b59826100f164d34b6b8857a031d477c05664aa306b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page