What is a promise in Javascript?

Question

Asked: 2020-02-03 03:33:15 +0800 CST 2020-02-03 03:33:15 +0800 CST 2020-02-03 03:33:15 +0800 CST

How can I replace the letters with accents with the same ones without accents, but not the "ñ"?

772

Suppose I have the following string:

s = 'Pingüino: Málaga es una ciudad fantástica y en Logroño me pica el... moño'

For what it was, I want to remove all the tildes and umlauts so that it looks like:

s = 'Pinguino: Malaga es una ciudad fantastica y en Logroño me pica el... moño'
#        ^      ^                       ^

I have discovered the library unidecodethat does exactly this:

>>> unidecode.unidecode(s)
'Pinguino: Malaga es una ciudad fantastica y en Logrono me pica el... mono'

But unfortunately it also replaces the ñ with n ( Logroño → Logrono , moño → mono ).

Is there any other library that allows this substitution, changing only the accents and umlauts? Otherwise, I understand that what I have to do is a regular expression that does this modification.

4 Answers

Voted

Mariano · Answer 1 · 2020-02-20T22:33:30+08:00

The technique is generally the same: it is to take the decomposed form of normalization in Unicode, remove what you don't want, and return to the compound form.

Decomposed form ?? In Unicode, a character (actually a "grapheme") is broken down into its base character equivalence, followed by its marks. For example:

Source -> NFDNFC

Both the decomposed (D) and the compound (C) forms are equivalent ( unicode canonical equivalence ). So their bytes are different, but they print the same
_{(they are still the same grapheme and there are algorithms to compare between forms).}

NFD :: Normalization Form Canonical Decomposition
NFC :: Normalization Form Canonical Composition

In the NFD form , diacritics are code points separated from their base character (the first code point )... This is the key to being able to remove what you don't want! And after eliminating it could be printed in that form (D), but it is convenient to return to the compound form to avoid problems.

What to remove? All options are valid. If they are understood , go through a choice according to what suits you and you prefer to apply in your case.

ChemaCortes in his answer chose to remove all non-ascii characters (that's why he temporarily replaces the ñwith another ascii string which is not removed).

FJSevilla in its response aimed the rifles directly at the accent marks ( ´) and the umlauts ( ¨).

With the bases covered, I was left to choose to show the most fundamentalist option: exterminate all diacritical marks.

Delete all diacritics except theñ

All diacritical marks are in a block in the range U+0300- U+036F( Combining Diacritical Marks ). And we're going to make the exception for U+0303, the tilde ( ~), but

only if it's after a n(replacing others like a ã)
and only if it has no other diacritics (replacing weird stuff like ñ͚͡)

with a regex where the first group is the base character and the diacritics are outside the group:

([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+

([^n\u0300-\u036f])[\u0300-\u036f]+a character that is nneither a nor a diacritic, followed by diacritics, or
(n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+one nthat is not followed by ~(unless the latter is followed by another diacritic), then it does match all the diacritics that follow it.

Replacing with \1we are left with the letter without the diacritic.

Code

import re
from unicodedata import normalize

s = "Pingüino: Málãgà ês uñ̺ã cíudãd fantástica y èn Logroño me pica el... moñǫ̝̘̦̞̟̩̐̏̋͌́ͬ̚͡õ̪͓͍̦̓ơ̤̺̬̯͂̌͐͐͟o͎͈̳̠̼̫͂̊"


# -> NFD y eliminar diacríticos
s = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", s), 0, re.I
    )

# -> NFC
s = normalize( 'NFC', s)

print( s )

Pinguino: Malaga es una ciudad fantastica y en Logroño me pica el... moñoooo

https://ideone.com/YcXaQD

FJSevilla · Answer 2 · 2020-02-03T05:50:25+08:00

Another possible idea pulling also from the standard library with unicodedatais to get the decomposed normalized form of the unicode string. This allows to "á"go from u"\u00E1"to u"\u0061\u0301"eg.

Then simply use str.translateto remove the unicode codepoints we want, in this case U+0308( combining diaeresis ) and U+0301( combining acute accent ):

>>> from unicodedata import normalize

>>> s = 'Pingüino: Málaga es una ciudad fantástica y en Logroño me pica el... moño'
>>> trans_tab = dict.fromkeys(map(ord, u'\u0301\u0308'), None)
>>> s = normalize('NFKC', normalize('NFKD', s).translate(trans_tab))
>>> s
'Pinguino: Malaga es una ciudad fantastica y en Logroño me pica el... moño'

J. David Moreno Hernández · Answer 3 · 2020-10-19T10:06:15+08:00

In Python 3, you can simply do this:

s = 'Pingüino: Málaga es una ciudad fantástica y en Logroño me pica el... moño'
a,b = 'áéíóúü','aeiouu'
trans = str.maketrans(a,b)

print(s.translate(trans))

Unfortunately not in Python 2 or earlier, you would also have to import the String module to be able to use string.maketrans() , and when applying it it would tell you that the strings a and b do not have the same length, in fact len(a) = 12 while len(b) = 6

ChemaCortes · Answer 4 · 2020-02-03T04:46:35+08:00

Nothing more than the standard library is needed to "clean" the string:

import unicodedata

s = 'Pingüino: Málaga es una ciudad fantástica y en Logroño me pica el... moño'
s2 = unicodedata.normalize("NFKD", s).encode("ascii","ignore").decode("ascii")

To prevent the e's from being lost, it's simple to replace them with a symbol that you know won't be used:

import unicodedata

s = 'Pingüino: Málaga es una ciudad fantástica y en Logroño me pica el... moño'
s1 = s.replace("ñ", "#").replace("Ñ", "%")
s2 = unicodedata.normalize("NFKD", s1)\
     .encode("ascii","ignore").decode("ascii")\
     .replace("#", "ñ").replace("%", "Ñ")

How can I replace the letters with accents with the same ones without accents, but not the "ñ"?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?