What is a promise in Javascript?

Question

EJS

Asked: 2020-08-25 09:27:07 +0800 CST 2020-08-25 09:27:07 +0800 CST 2020-08-25 09:27:07 +0800 CST

Adjust case encoding as "Spain" or "Republic of Montenegro"

772

I am trying a sentiment analysis by country using python2.7, on a json file that I have obtained with the twitter API. My problem is that despite assigning the default encoding as suggested in various forums, and also encoding the text, I can't 'translate' the 'rare' characters. I assign default encoding:

#!/usr/bin/env
# -*- coding: UTF-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

When assigning values to the 'place' variable, if I don't force any encoding changes, it shows me the strange characters in the resulting country names:

try:
    jsonLine = json.loads(line)
    place = jsonLine["place"].get('country')
    text = jsonLine["text"]
    score = self.tweet_Score(text, weights)
    yield (place, score)
except:
    pass

Result Example:

"Mexico" 217.41 "El Salvador" 7.78 "United Arab Emirates" 0 "Spain" 300.62

If instead I do a decode during the assignment of values to 'place' using .decode('utf-8').encode('utf-8'):

try:
    jsonLine = json.loads(line)
    place = jsonLine["place"].get('country').decode('utf-8').encode('utf-8')
    text = jsonLine["text"]
    score = self.tweet_Score(text, weights)
    yield (place, score)
except:
    pass

Using this last case, the records with strange characters disappear from my results and stop scoring the values they had calculated (which is not correct). I have tried different combinations of decode and encode, but the behaviors are as described.

I have considered the option of making some replaceto adjust the most frequent cases, but it would not be appropriate, because I have the same problem in the content of the texts that I analyze to score, with which there are many cases, so I suppose that there must be some solution with the encoding, but I don't know what else to try.

In advance thanks for the help!

P.S. To give additional information, this is what the country field I'm using for the example looks like, it's taken from my actual input file:

1 Answers

Voted

abulafia · Answer 1 · 2020-08-26T03:00:00+08:00

The \u sequence in javascript (and JSON)

Those "strange characters" are not an error. They are the way JSON decides to represent non-ascii characters in a way that doesn't depend on the encoding .

Let me explain, taking as an example the case "España". The character "ñ"is not part of the ASCII so when entering it in a JSON string we have two options:

Choose an encoding that does have the character "ñ". This encoding will typically be UTF-8 and therefore Unicode. In Unicode "ñ"it has the code U+00F1, but when encoded in UTF-8 in which the basic unit is the byte, it will occupy two bytes of values C3 and B1 (hexadecimal). Whoever reads this string of characters must know that the chosen encoding has been UTF-8, in order to "gather" those two bytes back into a single character (U+00F1) and thus obtain the "ñ". If you instead assume an encoding like latin1 where each byte is a single character, it would erroneously decode it as two characters:"Ã±"
Choose to represent it as an ASCII sequence, beginning with the escape character \. This character is used for multiple purposes, to be able to put characters in a string that would otherwise not be visible or cause confusion. The most typical case is \nfor the new-line, but we also have \rfor the carriage return, \bfor the "beep", \tfor the tabulator, etc... And the one that concerns us, \ufor a unicode character. This must be followed by four hexadecimal digits encoding the character in question. In our case, therefore, the sequence of six characters: \u00f1represents only one: the eñe.

The second case is preferable because you have not used any encoding to store the Unicode character, but simply represented it by another ASCII sequence. It is as if in HTML you had put &#x00f1, which is also an ASCII sequence that the browser will display as ñ.

The fact that a JSON text contains "Espa\u00f1a"is therefore not a problem. Contains the correct string. And if a JavaScript program tries to display it, the eñe will be displayed correctly, as you can see here:

console.log("Espa\u00f1a");

The \u sequence in python2

This sequence for python2 has no special meaning. If in python a string contains \u00f1, it will be displayed as is, as six literal characters:

>>> print("Espa\u00f1a")
Espa\u00f1a

But if it's a unicode string (with one uin front of the opening quote), then it is recognized and treated:

>>> print(u"Espa\u00f1a")
España

although the most common form in python is not \u00f1but \xf1, which is also recognized:

>>> print(u"Espa\xf1a")
España

Be careful though , either of these two forms saves within the string the Unicode character that represents the ñ, and not the sequence of characters \u00f1nor \xf1. These sequences are processed and converted to the corresponding character. If we wanted to save those particular sequences, we would have to escape the \with another \, to prevent it from being processed (and that will save a single \) to the chain. So:

>>> print(u"Espa\\u00f1a")
Espa\u00f1a

The difference between putting only one \or putting two is that in the second case the resulting text no longer contains any eñe, but simply an ASCII sequence (of which the characters \, u, 0etc, among others, are part). This is best understood if you look at the length of these strings:

>>> print(len(u"Espa\u00f1a"), len(u"Espa\\u00f1a"))
(6, 11)

json and python

And finally we get to the heart of it. We have in JSON a string that contains \u00f1what, as we have seen, is a legal character in JSON, and we want to read it in python. For example, we have read from a file (or from a socket, it doesn't matter) the string that we have stored in line, and that is the following:

>>> print(line)

{ "country" : "Espa\u00f1a" }

In python 2 reading from a file (or from a socket) produces a str, which is a byte string, rather than a Unicode string. We can try to convert it to Unicode, for which we would generally have to know the encoding of the file from which it was read. But in this case, due to the aforementioned, the encoding is irrelevant, since it has been chosen to represent the eñe as an ASCII sequence \uXXXX. So the following should work without errors:

>>> print(unicode(line))

{ "country" : "Espa\u00f1a" }

As you can see, there have been no errors, but it doesn't seem to have worked either. Actually it did work (the displayed string is no longer of type strbut of type unicode), but the character \u00f1did not display as expected.

This is because what we have would be the equivalent of having entered with \\uthe text in Python, since the string we have literally has a \and not a unicode character.

This last part can be difficult to understand, but it doesn't really matter, since the first thing you will do after receiving a JSON string is to decode it using json.loads()and this method already takes care of detecting the string \uand converting it to the corresponding python character:

>>> jsonLine = json.loads(line)
>>> print(jsonLine["country"])
España

Therefore it works .

Why do you think it doesn't work for you?

Perhaps instead of printing a string (as in the example above) you are printing a data structure, such as the jsonLineresult variable. If you do this you will apparently see strange things:

>>> print(jsonLine)
{u'country': u'Espa\xf1a'}

This is because when you print a dictionary or a list, Python shows you the representation of the data that makes it up. Here we can see, for example, that both the key and the value are Unicode strings (they have one uin front of the quotes). And within those strings the non-ascii characters are being displayed in their "python representation" ( \xf1). But this is just the way it is shown. Internally \xf1it is a ñUnicode, and so it will show up as soon as you want to print that string.

It may also be that instead of printing the string from python, you are converting your results back to JSON. In that case the encoder json.dumps() will re-encode each non-ASCII character to the standard JSON form which is \uXXXX:

>>> print(json.dumps(jsonLine))
{"country": "Espa\u00f1a"}

but again no error here. It is the correct behavior. The JSON thus generated is pure ASCII, it does not depend on encodings, and when a JavaScript client consumes it and tries to display it, it will display correctly "España", see:

data = "{\"country\": \"Espa\\u00f1a\"}"
parsed = JSON.parse(data)
console.log(parsed)

Adjust case encoding as "Spain" or "Republic of Montenegro"

The \u sequence in javascript (and JSON)

The \u sequence in python2

json and python

Why do you think it doesn't work for you?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?