I've looked at a thousand different sites, stackoverflow post, blogs, etc and I can't find a solution for my situation.
I have a database where I store this message:
"Hola, me llamo العربية"
and it is stored as utf-8:
"\x48\x6F\x6C\x61\x2C\x20\x6D\x65\x20\x6C\x6C\x61\x6D\x6F\x20\xD8\xA7\xD9\x84\xD8\xB9\xD8\xB1\xD8\xA8\xD9\x8A\xD8\xA9"
I want to be able to store that value in a variable and display the letters as I wrote them, for now it shows me something like this:
"Hola, me llamo اÙعربÙØ©"
I have tried to encode and decode with all the encodings I know, but nothing...
If I do a print, it shows me the text correctly, but if I try to store that variable in some external file, it is stored with the "strange" characters.
Does anyone have any way to guide me to the solution?
Thank you very much
--------------- edit -------------
If instead of storing the variable in the file, I store the text directly, it is displayed correctly, so it is not a problem with the encoding of the destination file.
After a long session of "chat debugging" I managed to diagnose the problem and give it a solution. I will leave the result documented here, although I doubt that it will be useful to anyone else as it is a very particular problem of this user and difficult to extrapolate to other cases.
Problem
The problem had been simplified by the user (because the real problem involved reading from one database to write to another, and the information read was in a binary format from which the relevant information, which was the text message, had to be extracted with heuristics). , in Arabic or other scripts).
The user thought that since the problem appeared in the encoding of these messages, the question could focus on this point only and change the databases to files to simplify.
The problem is that some of the omitted details were relevant. Basically the problem originated from the fact that what was being read from the database was not just text messages. Sometimes there were also messages encoded in binary that were not text, but data structures.
The user intended to save both one and the other in another MongoDB database. And the problem was that some of the messages failed to be sent to MongoDB because MongoDB only accepts ascii or utf8 text, while messages with binary structures did not contain valid text.
Faced with this problem, the user tried encoding with different encodings until he found one that stopped giving him errors in this binary data. But of course, the consequence was that the text data was no longer saved correctly.
Solution
When you read the data from the original database, you have to detect whether the message is text or another binary structure. The text ones can be sent to MongoDB without modifications (because they will be valid UTF-8, either in Arabic or in whatever alphabet). Those that are binary data must be converted for example to
base64
so that MongoDB can store them.The user will possibly have more information to decide if what is read from the DB is a message or not, but a simple heuristic could be to try to decode it as UTF-8. If it fails, it is assumed to be binary:
From there,
txt
it can be sent to MongoDB.The supposed values of the Arabic characters seem suspiciously low to me: two hexadecimal characters means that they are numbers less than 256, so they are ASCII characters (one byte).
I know your problem is with Python but, keeping the encoding, in any language we should have similar results, so I'll use Javascript for the immediacy of the results to show you what you have in the database:
I have removed the part
, me llamo
to simplify, the texts are"Hola " + <parte problemática>
:You can see that the arabic characters don't match, the unicode numbers are totally different (much higher, they are all
\x6__
), so I'm afraid what you're saving to the database is wrong. By some intermediate step the values are transformed to another format.