I would like to know how to convert a string like this: U+1F601 to this format: \xF0\x9F\x98\x81
We can see an example on this page: https://apps.timwhitlock.info/emoji/tables/unicode
There its UNICODE code and its value in bytes are specified.
I use python 2.7
On this website it does what I want, but I don't know how it works internally: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=1F601&mode=hex
Python's Unicode strings method
.encode()
allows you to specify which encoding you want to convert to. In your case it is enough to specifyutf8
. But there remains the question of how to put any unicode character (in your case theU+1F601
) inside the string.How to do it depends on the character code.
If the code fits in 8 bits, you put
\xHH
, beingHH
the hexadecimal representation of those 8 bits. Note that we are talking about the Unicode code, not its transformation to UTF-8. So, for example, the code for the eñe isU+00F1
, but since the high part is 00, we only need to specify theF1
, which fits into eight bits, so it would be\xf1
.Another thing is its utf8 representation, which would be two bytes and which we can obtain with:
If it doesn't fit in 8 bits but it does fit in 16, such as the euro code (€) which is
U+20AC
, you can use the form\uXXXX
, whereXXXX
is the hexadecimal representation of those 16 bits. Its transformation to UTF8 is obtained the same as before:Finally, if it doesn't fit in 16 bits either, as is the case with emojis and your example, then you have to represent it with 32 bits using the form
\UXXXXXXXX
, beingXXXXXXXX
the hexadecimal representation of those 32 bits. In your example,U+1F601
it would be represented as\U0001F601
. To get the bytes of your utf8 encoding, do the same as before:Note that the last option is the most general of all, since what fits in 8 bits also fits in 32. Therefore, it would be possible to represent the eñe as
\xf1
and also as\U000000f1
.Update . If what you have is a string of the style
"U+XXXXX"
and you want to get the utf8 version of the character represented there, you don't need any of the above. It is enough to extract what goes after theU+
, decode it as an integer in hexadecimal, and usechr()
to obtain the character (unicode) that corresponds to that code. Once you have the character, you use.encode("utf8")
to get its encoding. So:Examples: