a Terrible Case Once Again of Bigotry of Low Expectations.ã¢â‚¬â
ftfy: fixes text for y'all
>> > print(fix_encoding("(ง'⌣')ง")) (ง'⌣')ง
The full documentation of ftfy is available at ftfy.readthedocs.org. The documentation covers a lot more than this README, so here are some links into it:
- Fixing bug and getting explanations
- Configuring ftfy
- Encodings ftfy tin handle
- "Fixer" functions
- Is ftfy an encoding detector?
- Heuristics for detecting mojibake
- Support for "bad" encodings
- Command-line usage
- Citing ftfy
Testimonials
- "My life is livable again!" — @planarrowspace
- "A handy piece of magic" — @simonw
- "Saved me a big amount of frustrating dev work" — @iancal
- "ftfy did the right affair right away, with no faffing nearly. Excellent piece of work, solving a very tricky real-globe (whole-world!) problem." — Brennan Immature
- "I have no idea when I'k gonna demand this, but I'm definitely bookmarking it." — /u/ocrow
- "9.2/x" — pylint
What it does
Here are some examples (found in the real world) of what ftfy can practise:
ftfy tin can prepare mojibake (encoding mix-ups), by detecting patterns of characters that were conspicuously meant to be UTF-8 but were decoded every bit something else:
>>> import ftfy >>> ftfy.fix_text('âœ" No issues') '✔ No problems'
Does this sound impossible? It'south really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.
ftfy can fix multiple layers of mojibake simultaneously:
>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.') "The Mona Lisa doesn't accept eyebrows."
It can fix mojibake that has had "curly quotes" applied on top of it, which cannot exist consistently decoded until the quotes are uncurled:
>>> ftfy.fix_text("l'humanité") "fifty'humanité"
ftfy can gear up mojibake that would have included the character U+A0 (non-breaking infinite), but the U+A0 was turned into an ASCII infinite and and so combined with another following space:
>>> ftfy.fix_text('Ã\xa0 perturber la réflexion') 'à perturber la réflexion' >>> ftfy.fix_text('à perturber la réflexion') 'à perturber la réflexion'
ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:
>>> # past the HTML 5 standard, simply 'PÉREZ' is acceptable >>> ftfy.fix_text('PÉREZ') 'PÉREZ'
These fixes are not applied in all cases, because ftfy has a strongly-held goal of fugitive false positives -- it should never change correctly-decoded text to something else.
The following text could be encoded in Windows-1252 and decoded in UTF-eight, and it would decode as 'MARQUɅ'. Notwithstanding, the original text is already sensible, so it is unchanged.
>>> ftfy.fix_text('IL Y MARQUÉ…') 'IL Y MARQUÉ…'
Installing
ftfy is a Python 3 package that can be installed using pip
:
(Or apply pip3 install ftfy
on systems where Python 2 and 3 are both globally installed and pip
refers to Python 2.)
Local development
ftfy is developed using poesy
. Its setup.py
is vestigial and is not the recommended way to install it.
Install Poetry, check out this repository, and run poetry install
to install ftfy for local development, such as experimenting with the heuristic or running tests.
Who maintains ftfy?
I'thou Robyn Speer, also known as Elia Robyn Lake. You tin can discover me on GitHub or Twitter.
Citing ftfy
ftfy has been used every bit a crucial data processing step in major NLP research.
It's important to requite credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite information technology.
ftfy has a citable tape on Zenodo. A citation of ftfy may look like this:
Robyn Speer. (2019). ftfy (Version v.v). Zenodo. http://doi.org/10.5281/zenodo.2591652
In BibTeX format, the citation is::
@misc{speer-2019-ftfy, author = {Robyn Speer}, championship = {ftfy}, note = {Version 5.5}, year = 2019, howpublished = {Zenodo}, doi = {ten.5281/zenodo.2591652}, url = {https://doi.org/x.5281/zenodo.2591652} }
Source: https://github.com/rspeer/python-ftfy
0 Response to "a Terrible Case Once Again of Bigotry of Low Expectations.ã¢â‚¬â"
إرسال تعليق