What is a promise in Javascript?

Question

Asked: 2020-10-22 14:22:34 +0800 CST 2020-10-22 14:22:34 +0800 CST 2020-10-22 14:22:34 +0800 CST

Recognize the Ñ in a character string

772

Recognize if Ñ in a character string.

cadena = "DADEVVEÑWE"

If I do:

for letra in cadena:
    if letra == 'D':
        print 'Letra D'
    elif letra == 'Ñ':
        print 'Letra Ñ'

Why do you skip the Ñ and don't see them? And the rest of the letters if you see them (if I put their corresponding conditional). Where do I apply the encoding so that it recognizes the Ñ?

2 Answers

Voted

abulafia · Answer 1 · 2020-10-22T14:40:44+08:00

It's due to the way Python 2 works with non-ascii characters.

It turns out that when you put in your source code a line like this:

cadena = "EÑE"

what actually goes into the variable depends on what editor you used.

If you use an editor that uses ISO encoding (for example, many windows editors), three bytes will be stored in the variable, since in that encoding each letter is a single byte (and the code of the letter Ñwould be f1).

If you use an editor that uses UTF-8 encoding (the most standard today, on Linux and Mac and even on Windows depending on which editor you use), then four bytes would go in the variable, because in that encoding some letters occupy only one byte (ascii) and two or three others. The eñe in particular would be encoded with two bytes of value c3and 91.

This brings all kinds of problems, like for example that len(cadena)it can return 4 or 3 depending on which editor you have used.

Therefore, from the moment that your program is going to have to handle texts that may include non-ascii characters, what must be done is to always work with Unicode.

In python 2, Unicode is yet another type of variable, different from str. To put a string in Unicode, you must put one uin front of the quotes. So:

cadena = u'Eñe'

In this case, Python encodes each character in the string uniformly. They will all require 32 bits (although this is really transparent to us). The function len()on a unicode string tells you how many letters it has, not how many bytes, so it will return 3 regardless of what editor you typed it in.

If you have a string that is not part of the program's source code but has been read from the outside (from a file, from a socket, or via raw_input()), it will be a string str, that is, a sequence of bytes. To be able to handle it in your program and compare it with others that are Unicode, you must also convert them to Unicode. For example like this:

nombre = raw_input("Como te llamas? ")
nombre = unicode(nombre, "utf8")

The problem with this conversion is that, as you can see, you have to specify the encoding that the string is in str. In this example I have put "utf8", assuming that the terminal from which I read the text uses utf8. If you use another encoding the conversion might fail. The same if you read from a file, you must know what encoding the file is in.

Going to your example. You have code equivalent to this:

cadena = "EÑE"
for letra in cadena:
    if letra == 'E':
        print 'Letra E'
    elif letra == 'Ñ':
        print 'Letra Ñ'
    else:
        print 'Otra letra'  # <-- He añadido esto

As you can see, cadenait is of type str(because it does not lead uin front). If you write this code from an editor that uses UTF8, the string will contain four bytes as explained above. This implies that the loop would repeat four times (as you can see if you run it). In each iteration it letrawill be one byte. In none of its iterations will it be true that letra == "Ñ", since they "Ñ"are two bytes as explained before.

Now his mysterious departure makes sense:

Letra E
Otra letra
Otra letra
Letra E

It would be fixed like this:

cadena = u"EÑE"
for letra in cadena:
    if letra == u'E':
        print 'Letra E'
    elif letra == u'Ñ':
        print 'Letra Ñ'
    else:
        print 'Otra letra'

Still we have a mix of normal strings with unicode strings which is easy to screw up. Python3 simplifies this by making all your strings Unicode by default.

javrd · Answer 2 · 2020-10-22T15:30:43+08:00

In addition to @abulafia's answer, which explains very well why this happens and how to avoid it, I would like to add that in order not to have this type of mistake, the encoding can be specified at the beginning of the file in a comment. For example, utf8, which is the default in python 3:

# -*- coding: utf-8 -*-

utf-8can be substituted with another encoding if desired, such as latin-1orutf-64

Recognize the Ñ in a character string

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?