What is a promise in Javascript?

Question

XBoss

Asked: 2020-08-02 02:22:52 +0800 CST 2020-08-02 02:22:52 +0800 CST 2020-08-02 02:22:52 +0800 CST

go from utf8 to "readable" representation

772

I've looked at a thousand different sites, stackoverflow post, blogs, etc and I can't find a solution for my situation.

I have a database where I store this message:

"Hola, me llamo العربية"and it is stored as utf-8:

"\x48\x6F\x6C\x61\x2C\x20\x6D\x65\x20\x6C\x6C\x61\x6D\x6F\x20\xD8\xA7\xD9\x84\xD8\xB9\xD8\xB1\xD8\xA8\xD9\x8A\xD8\xA9"

I want to be able to store that value in a variable and display the letters as I wrote them, for now it shows me something like this:

"Hola, me llamo Ø§ÙØ¹Ø±Ø¨ÙØ©"

I have tried to encode and decode with all the encodings I know, but nothing...

If I do a print, it shows me the text correctly, but if I try to store that variable in some external file, it is stored with the "strange" characters.

Does anyone have any way to guide me to the solution?

Thank you very much

--------------- edit -------------

If instead of storing the variable in the file, I store the text directly, it is displayed correctly, so it is not a problem with the encoding of the destination file.

2 Answers

Voted

abulafia · Answer 1 · 2020-08-02T08:07:57+08:00

After a long session of "chat debugging" I managed to diagnose the problem and give it a solution. I will leave the result documented here, although I doubt that it will be useful to anyone else as it is a very particular problem of this user and difficult to extrapolate to other cases.

Problem

The problem had been simplified by the user (because the real problem involved reading from one database to write to another, and the information read was in a binary format from which the relevant information, which was the text message, had to be extracted with heuristics). , in Arabic or other scripts).

The user thought that since the problem appeared in the encoding of these messages, the question could focus on this point only and change the databases to files to simplify.

The problem is that some of the omitted details were relevant. Basically the problem originated from the fact that what was being read from the database was not just text messages. Sometimes there were also messages encoded in binary that were not text, but data structures.

The user intended to save both one and the other in another MongoDB database. And the problem was that some of the messages failed to be sent to MongoDB because MongoDB only accepts ascii or utf8 text, while messages with binary structures did not contain valid text.

Faced with this problem, the user tried encoding with different encodings until he found one that stopped giving him errors in this binary data. But of course, the consequence was that the text data was no longer saved correctly.

Solution

When you read the data from the original database, you have to detect whether the message is text or another binary structure. The text ones can be sent to MongoDB without modifications (because they will be valid UTF-8, either in Arabic or in whatever alphabet). Those that are binary data must be converted for example to base64so that MongoDB can store them.

The user will possibly have more information to decide if what is read from the DB is a message or not, but a simple heuristic could be to try to decode it as UTF-8. If it fails, it is assumed to be binary:

msg = obtener_mensaje_de_base_de_datos(query)
try:
    txt = msg.decode("utf-8")
except:
    # No es texto. Recodificarlo como base64 por ejemplo
    txt = base64.b64encode(msg)

From there, txtit can be sent to MongoDB.

Pablo Lozano · Answer 2 · 2020-08-02T03:46:41+08:00

The supposed values of the Arabic characters seem suspiciously low to me: two hexadecimal characters means that they are numbers less than 256, so they are ASCII characters (one byte).

I know your problem is with Python but, keeping the encoding, in any language we should have similar results, so I'll use Javascript for the immediacy of the results to show you what you have in the database:

I have removed the part , me llamoto simplify, the texts are "Hola " + <parte problemática>:

let textoHex="\x48\x6F\x6C\x61\x20\xD8\xA7\xD9\x84\xD8\xB9\xD8\xB1\xD8\xA8\xD9\x8A\xD8\xA9";

console.log(textoHex)

let textoPlano="Hola العربية";

console.log(textoPlano);

let hex='';

for (let i=0;i<textoPlano.length;i++) {
  hex+='\\x'+textoPlano.charCodeAt(i).toString(16).toUpperCase();
}

console.log(hex);

You can see that the arabic characters don't match, the unicode numbers are totally different (much higher, they are all \x6__), so I'm afraid what you're saving to the database is wrong. By some intermediate step the values are transformed to another format.

go from utf8 to "readable" representation

Problem

Solution

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?