What is a promise in Javascript?

Question

abulafia

Asked: 2020-05-09 07:36:36 +0800 CST 2020-05-09 07:36:36 +0800 CST 2020-05-09 07:36:36 +0800 CST

How can I find out the encoding of this text?

772

In a question that will probably be closed due to its low quality and the user's lack of interest, the following text was pasted, clearly incorrect because "strange" characters appeared instead of accented vowels or eñes:

with an iron rod) that there are too many redundancies and that only the argument should suffice. What I can say is my name. My name is"

Apparently the user received a text with some unknown encoding , opened it in an editor that used another (also unknown) encoding and copied and pasted what that editor showed him into the question.

Probably your operator, regardless of the encoding used by your editor, converted the text to Unicode in order to save it to the clipboard, and therefore the version that was finally pasted into the question is the UTF-8 representation of those Unicode characters.

The question is how could the original encoding of the data be determined? in order to restore the text as it should look.

1 Answers

Voted

abulafia · Answer 1 · 2020-05-09T07:46:29+08:00

My solution uses Python 3 and a bit of detective work. We start by assigning the text copied from the question to a variable:

texto_mal = "ƒsta ser‡ una historia de terror. Ser‡ una historia polic’aca, un relato de serie negra y de terror. Pero no lo parecer‡. No lo parecer‡ porque soy yo la que lo cuenta. Soy yo la que habla y por eso no lo parecer‡. Pero en el fondo es la historia de un crimen atroz. Yo soy la amiga de todos los mexicanos. Podr’a decir: soy la madre de la poes’a mexicana, pero mejor no lo digo. Yo conozco a todos los poetas y todos los poetas me conocen a m’. As’ que podr’a decirlo. Podr’a decir: soy la madre y corre un cŽfiro de la chingada desde hace siglos, pero mejor no lo digo. Podr’a decir, por ejemplo: yo conoc’ a Arturito Belano cuando Žl ten’a diecisiete a–os y era un ni–o t’mido que escrib’a obras de teatro y poes’a y no sab’a beber, pero ser’a de algœn modo una redundancia y a m’ me ense–aron (con un l‡tigo me ense–aron, con una vara de fierro) que las redundancias sobran y que s—lo debe bastar con el argumento. Lo que s’ puedo decir es mi nombre. Me llamo"

Using Python, there are some libraries for autodetection of the encoding of a sequence of bytes, such as the module chardet. However, this type of solution does not work here , because we do not have access to the original byte sequence, but to the result of having pasted the text in StackOverflow, with a transformation to UTF-8 of the result.

In fact, chardet.detect()it expects a byte string as a parameter, but all we have in this case is a character string, which we would have to pass to bytes with something like texto_mal.encode(...), and there we would have to specify an encoding, which is part of what we want discover.

What to do then?

We can use some heuristic, relying on the fact that the text is in Spanish and we can mostly read its content. In fact we can deduce that it ‡represents the letter á, it ’represents the í, it is probably ƒthe É, etc.

Let us focus for the moment on just one of these characters. We can then reformulate the question as:

Which pair of encodings have the sign ‡and the sign áin the same position?

To answer it, I tried to encode the symbol "‡"in all the possible encodings supported by python , to get a byte (or sequence of bytes), which I then went on to encode again with each of the possible encodings, to see which one the result was one "á"(ignoring all those that produced encoding errors, of course):

codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig']

for e in codecs:
  for d in codecs:
    try:
      r = "‡".encode(e)
      b = r.decode(d)
    except:
      continue
    if b != 'á':
      continue
    print(e, d)

The result was a set of 44 pairs of encodings, for example, one of them (which I chose because it seemed the most likely) was cp1252 mac_roman.

This means that the user (always hypothetically) received a text file that had the encoding mac_roman(used on older Macs ), but opened it with an editor that used the encoding (probably Windowscp1252 Notepad ), and so saw all those weird characters. When copying and pasting them on Stack Overflow, they were received as Unicode ("utf-8"), further complicating the problem, since it is then no longer possible to see the original bytes.

Thus, encoding the text given by the user with cp1252and decoding it again with mac_roman, it appears already readable:

texto = texto_mal.encode("cp1252").decode("mac_roman")
print(texto)

This will be a horror story. It will be a police story, a black and horror series story. But it won't look like it. It won't seem like it because I'm the one who tells it. It's me who speaks and that's why it won't seem like it. But deep down it is the story of a heinous crime. I am the friend of all Mexicans. I could say: I am the mother of Mexican poetry, but I better not say it. I know all the poets and all the poets know me. So I could say it. I could say: I'm the mother and a zephyr of shit has been running for centuries, but it's better not to say it. I could say, for example: I met Arturito Belano when he was seventeen years old and he was a shy boy who wrote plays and poetry and didn't know how to drink, but it would be somewhat redundant and they taught me (with a whip they taught me , with an iron rod) that there are too many redundancies and that only the argument should suffice. What I can say is my name. My name is

How can I find out the encoding of this text?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?