Suppose I have the following string:
s = 'Pingüino: Málaga es una ciudad fantástica y en Logroño me pica el... moño'
For what it was, I want to remove all the tildes and umlauts so that it looks like:
s = 'Pinguino: Malaga es una ciudad fantastica y en Logroño me pica el... moño'
# ^ ^ ^
I have discovered the library unidecode
that does exactly this:
>>> unidecode.unidecode(s)
'Pinguino: Malaga es una ciudad fantastica y en Logrono me pica el... mono'
But unfortunately it also replaces the ñ with n ( Logroño → Logrono , moño → mono ).
Is there any other library that allows this substitution, changing only the accents and umlauts? Otherwise, I understand that what I have to do is a regular expression that does this modification.
The technique is generally the same: it is to take the decomposed form of normalization in Unicode, remove what you don't want, and return to the compound form.
Decomposed form ?? In Unicode, a character (actually a "grapheme") is broken down into its base character equivalence, followed by its marks. For example:
Both the decomposed (D) and the compound (C) forms are equivalent ( unicode canonical equivalence ). So their bytes are different, but they print the same
(they are still the same grapheme and there are algorithms to compare between forms).
In the NFD form , diacritics are code points separated from their base character (the first code point )... This is the key to being able to remove what you don't want! And after eliminating it could be printed in that form (D), but it is convenient to return to the compound form to avoid problems.
What to remove? All options are valid. If they are understood , go through a choice according to what suits you and you prefer to apply in your case.
ChemaCortes in his answer chose to remove all non-ascii characters (that's why he temporarily replaces the
ñ
with another ascii string which is not removed).FJSevilla in its response aimed the rifles directly at the accent marks (
´
) and the umlauts (¨
).With the bases covered, I was left to choose to show the most fundamentalist option: exterminate all diacritical marks.
Delete all diacritics except the
ñ
All diacritical marks are in a block in the range
U+0300
-U+036F
( Combining Diacritical Marks ). And we're going to make the exception forU+0303
, the tilde (~
), butn
(replacing others like aã
)ñ͚͡
)with a regex where the first group is the base character and the diacritics are outside the group:
([^n\u0300-\u036f])[\u0300-\u036f]+
a character that isn
neither a nor a diacritic, followed by diacritics, or(n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+
onen
that is not followed by~
(unless the latter is followed by another diacritic), then it does match all the diacritics that follow it.Replacing with
\1
we are left with the letter without the diacritic.Code
https://ideone.com/YcXaQD
Another possible idea pulling also from the standard library with
unicodedata
is to get the decomposed normalized form of the unicode string. This allows to"á"
go fromu"\u00E1"
tou"\u0061\u0301"
eg.Then simply use
str.translate
to remove the unicode codepoints we want, in this caseU+0308
( combining diaeresis ) andU+0301
( combining acute accent ):In Python 3, you can simply do this:
Unfortunately not in Python 2 or earlier, you would also have to import the String module to be able to use string.maketrans() , and when applying it it would tell you that the strings a and b do not have the same length, in fact len(a) = 12 while len(b) = 6
Nothing more than the standard library is needed to "clean" the string:
To prevent the e's from being lost, it's simple to replace them with a symbol that you know won't be used: