I am trying a sentiment analysis by country using python2.7, on a json file that I have obtained with the twitter API. My problem is that despite assigning the default encoding as suggested in various forums, and also encoding the text, I can't 'translate' the 'rare' characters. I assign default encoding:
#!/usr/bin/env
# -*- coding: UTF-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
When assigning values to the 'place' variable, if I don't force any encoding changes, it shows me the strange characters in the resulting country names:
try:
jsonLine = json.loads(line)
place = jsonLine["place"].get('country')
text = jsonLine["text"]
score = self.tweet_Score(text, weights)
yield (place, score)
except:
pass
Result Example:
"Mexico" 217.41 "El Salvador" 7.78 "United Arab Emirates" 0 "Spain" 300.62
If instead I do a decode during the assignment of values to 'place' using .decode('utf-8').encode('utf-8'):
try:
jsonLine = json.loads(line)
place = jsonLine["place"].get('country').decode('utf-8').encode('utf-8')
text = jsonLine["text"]
score = self.tweet_Score(text, weights)
yield (place, score)
except:
pass
Using this last case, the records with strange characters disappear from my results and stop scoring the values they had calculated (which is not correct). I have tried different combinations of decode and encode, but the behaviors are as described.
I have considered the option of making some replace
to adjust the most frequent cases, but it would not be appropriate, because I have the same problem in the content of the texts that I analyze to score, with which there are many cases, so I suppose that there must be some solution with the encoding, but I don't know what else to try.
In advance thanks for the help!
P.S. To give additional information, this is what the country field I'm using for the example looks like, it's taken from my actual input file:
The \u sequence in javascript (and JSON)
Those "strange characters" are not an error. They are the way JSON decides to represent non-ascii characters in a way that doesn't depend on the encoding .
Let me explain, taking as an example the case
"España"
. The character"ñ"
is not part of the ASCII so when entering it in a JSON string we have two options:"ñ"
. This encoding will typically be UTF-8 and therefore Unicode. In Unicode"ñ"
it has the code U+00F1, but when encoded in UTF-8 in which the basic unit is the byte, it will occupy two bytes of values C3 and B1 (hexadecimal). Whoever reads this string of characters must know that the chosen encoding has been UTF-8, in order to "gather" those two bytes back into a single character (U+00F1) and thus obtain the"ñ"
. If you instead assume an encoding like latin1 where each byte is a single character, it would erroneously decode it as two characters:"ñ"
\
. This character is used for multiple purposes, to be able to put characters in a string that would otherwise not be visible or cause confusion. The most typical case is\n
for the new-line, but we also have\r
for the carriage return,\b
for the "beep",\t
for the tabulator, etc... And the one that concerns us,\u
for a unicode character. This must be followed by four hexadecimal digits encoding the character in question. In our case, therefore, the sequence of six characters:\u00f1
represents only one: the eñe.The second case is preferable because you have not used any encoding to store the Unicode character, but simply represented it by another ASCII sequence. It is as if in HTML you had put
ñ
, which is also an ASCII sequence that the browser will display asñ
.The fact that a JSON text contains
"Espa\u00f1a"
is therefore not a problem. Contains the correct string. And if a JavaScript program tries to display it, the eñe will be displayed correctly, as you can see here:The \u sequence in python2
This sequence for python2 has no special meaning. If in python a string contains
\u00f1
, it will be displayed as is, as six literal characters:But if it's a unicode string (with one
u
in front of the opening quote), then it is recognized and treated:although the most common form in python is not
\u00f1
but\xf1
, which is also recognized:Be careful though , either of these two forms saves within the string the Unicode character that represents the ñ, and not the sequence of characters
\u00f1
nor\xf1
. These sequences are processed and converted to the corresponding character. If we wanted to save those particular sequences, we would have to escape the\
with another\
, to prevent it from being processed (and that will save a single\
) to the chain. So:The difference between putting only one
\
or putting two is that in the second case the resulting text no longer contains any eñe, but simply an ASCII sequence (of which the characters\
,u
,0
etc, among others, are part). This is best understood if you look at the length of these strings:json and python
And finally we get to the heart of it. We have in JSON a string that contains
\u00f1
what, as we have seen, is a legal character in JSON, and we want to read it in python. For example, we have read from a file (or from a socket, it doesn't matter) the string that we have stored inline
, and that is the following:In python 2 reading from a file (or from a socket) produces a
str
, which is a byte string, rather than a Unicode string. We can try to convert it to Unicode, for which we would generally have to know the encoding of the file from which it was read. But in this case, due to the aforementioned, the encoding is irrelevant, since it has been chosen to represent the eñe as an ASCII sequence\uXXXX
. So the following should work without errors:As you can see, there have been no errors, but it doesn't seem to have worked either. Actually it did work (the displayed string is no longer of type
str
but of typeunicode
), but the character\u00f1
did not display as expected.This is because what we have would be the equivalent of having entered with
\\u
the text in Python, since the string we have literally has a\
and not a unicode character.This last part can be difficult to understand, but it doesn't really matter, since the first thing you will do after receiving a JSON string is to decode it using
json.loads()
and this method already takes care of detecting the string\u
and converting it to the corresponding python character:Therefore it works .
Why do you think it doesn't work for you?
Perhaps instead of printing a string (as in the example above) you are printing a data structure, such as the
jsonLine
result variable. If you do this you will apparently see strange things:This is because when you print a dictionary or a list, Python shows you the representation of the data that makes it up. Here we can see, for example, that both the key and the value are Unicode strings (they have one
u
in front of the quotes). And within those strings the non-ascii characters are being displayed in their "python representation" (\xf1
). But this is just the way it is shown. Internally\xf1
it is añ
Unicode, and so it will show up as soon as you want to print that string.It may also be that instead of printing the string from python, you are converting your results back to JSON. In that case the encoder
json.dumps()
will re-encode each non-ASCII character to the standard JSON form which is\uXXXX
:but again no error here. It is the correct behavior. The JSON thus generated is pure ASCII, it does not depend on encodings, and when a JavaScript client consumes it and tries to display it, it will display correctly
"España"
, see: