Recognize if Ñ in a character string.
cadena = "DADEVVEÑWE"
If I do:
for letra in cadena:
if letra == 'D':
print 'Letra D'
elif letra == 'Ñ':
print 'Letra Ñ'
Why do you skip the Ñ and don't see them? And the rest of the letters if you see them (if I put their corresponding conditional). Where do I apply the encoding so that it recognizes the Ñ?
It's due to the way Python 2 works with non-ascii characters.
It turns out that when you put in your source code a line like this:
what actually goes into the variable depends on what editor you used.
If you use an editor that uses ISO encoding (for example, many windows editors), three bytes will be stored in the variable, since in that encoding each letter is a single byte (and the code of the letter
Ñ
would bef1
).If you use an editor that uses UTF-8 encoding (the most standard today, on Linux and Mac and even on Windows depending on which editor you use), then four bytes would go in the variable, because in that encoding some letters occupy only one byte (ascii) and two or three others. The eñe in particular would be encoded with two bytes of value
c3
and91
.This brings all kinds of problems, like for example that
len(cadena)
it can return 4 or 3 depending on which editor you have used.Therefore, from the moment that your program is going to have to handle texts that may include non-ascii characters, what must be done is to always work with Unicode.
In python 2, Unicode is yet another type of variable, different from
str
. To put a string in Unicode, you must put oneu
in front of the quotes. So:In this case, Python encodes each character in the string uniformly. They will all require 32 bits (although this is really transparent to us). The function
len()
on a unicode string tells you how many letters it has, not how many bytes, so it will return 3 regardless of what editor you typed it in.If you have a string that is not part of the program's source code but has been read from the outside (from a file, from a socket, or via
raw_input()
), it will be a stringstr
, that is, a sequence of bytes. To be able to handle it in your program and compare it with others that are Unicode, you must also convert them to Unicode. For example like this:The problem with this conversion is that, as you can see, you have to specify the encoding that the string is in
str
. In this example I have put"utf8"
, assuming that the terminal from which I read the text uses utf8. If you use another encoding the conversion might fail. The same if you read from a file, you must know what encoding the file is in.Going to your example. You have code equivalent to this:
As you can see,
cadena
it is of typestr
(because it does not leadu
in front). If you write this code from an editor that uses UTF8, the string will contain four bytes as explained above. This implies that the loop would repeat four times (as you can see if you run it). In each iteration itletra
will be one byte. In none of its iterations will it be true thatletra == "Ñ"
, since they"Ñ"
are two bytes as explained before.Now his mysterious departure makes sense:
It would be fixed like this:
Still we have a mix of normal strings with unicode strings which is easy to screw up. Python3 simplifies this by making all your strings Unicode by default.
In addition to @abulafia's answer, which explains very well why this happens and how to avoid it, I would like to add that in order not to have this type of mistake, the encoding can be specified at the beginning of the file in a comment. For example, utf8, which is the default in python 3:
utf-8
can be substituted with another encoding if desired, such aslatin-1
orutf-64