In a question that will probably be closed due to its low quality and the user's lack of interest, the following text was pasted, clearly incorrect because "strange" characters appeared instead of accented vowels or eñes:
with an iron rod) that there are too many redundancies and that only the argument should suffice. What I can say is my name. My name is"
Apparently the user received a text with some unknown encoding , opened it in an editor that used another (also unknown) encoding and copied and pasted what that editor showed him into the question.
Probably your operator, regardless of the encoding used by your editor, converted the text to Unicode in order to save it to the clipboard, and therefore the version that was finally pasted into the question is the UTF-8 representation of those Unicode characters.
The question is how could the original encoding of the data be determined? in order to restore the text as it should look.
My solution uses Python 3 and a bit of detective work. We start by assigning the text copied from the question to a variable:
Using Python, there are some libraries for autodetection of the encoding of a sequence of bytes, such as the module
chardet
. However, this type of solution does not work here , because we do not have access to the original byte sequence, but to the result of having pasted the text in StackOverflow, with a transformation to UTF-8 of the result.In fact,
chardet.detect()
it expects a byte string as a parameter, but all we have in this case is a character string, which we would have to pass to bytes with something liketexto_mal.encode(...)
, and there we would have to specify an encoding, which is part of what we want discover.What to do then?
We can use some heuristic, relying on the fact that the text is in Spanish and we can mostly read its content. In fact we can deduce that it
‡
represents the letterá
, it’
represents theí
, it is probablyƒ
theÉ
, etc.Let us focus for the moment on just one of these characters. We can then reformulate the question as:
To answer it, I tried to encode the symbol
"‡"
in all the possible encodings supported by python , to get a byte (or sequence of bytes), which I then went on to encode again with each of the possible encodings, to see which one the result was one"á"
(ignoring all those that produced encoding errors, of course):The result was a set of 44 pairs of encodings, for example, one of them (which I chose because it seemed the most likely) was
cp1252 mac_roman
.This means that the user (always hypothetically) received a text file that had the encoding
mac_roman
(used on older Macs ), but opened it with an editor that used the encoding (probably Windowscp1252
Notepad ), and so saw all those weird characters. When copying and pasting them on Stack Overflow, they were received as Unicode ("utf-8"), further complicating the problem, since it is then no longer possible to see the original bytes.Thus, encoding the text given by the user with
cp1252
and decoding it again withmac_roman
, it appears already readable: