What is a promise in Javascript?

Question

XBoss

Asked: 2020-07-26 02:19:48 +0800 CST 2020-07-26 02:19:48 +0800 CST 2020-07-26 02:19:48 +0800 CST

convert U+XXXX to hexadecimal utf8

772

I would like to know how to convert a string like this: U+1F601 to this format: \xF0\x9F\x98\x81

We can see an example on this page: https://apps.timwhitlock.info/emoji/tables/unicode

There its UNICODE code and its value in bytes are specified.

I use python 2.7

On this website it does what I want, but I don't know how it works internally: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=1F601&mode=hex

1 Answers

Voted

abulafia · Answer 1 · 2020-07-26T03:10:56+08:00

Python's Unicode strings method .encode()allows you to specify which encoding you want to convert to. In your case it is enough to specify utf8. But there remains the question of how to put any unicode character (in your case the U+1F601) inside the string.

How to do it depends on the character code.

If the code fits in 8 bits, you put \xHH, being HHthe hexadecimal representation of those 8 bits. Note that we are talking about the Unicode code, not its transformation to UTF-8. So, for example, the code for the eñe is U+00F1, but since the high part is 00, we only need to specify the F1, which fits into eight bits, so it would be \xf1.

Another thing is its utf8 representation, which would be two bytes and which we can obtain with:
```
>>> u'\xf1'.encode("utf8")
b'\xc3\xb1'
```
If it doesn't fit in 8 bits but it does fit in 16, such as the euro code (€) which is U+20AC, you can use the form \uXXXX, where XXXXis the hexadecimal representation of those 16 bits. Its transformation to UTF8 is obtained the same as before:
```
>>> u'\u20ac'.encode("utf8")
b'\xe2\x82\xac'
```
Finally, if it doesn't fit in 16 bits either, as is the case with emojis and your example, then you have to represent it with 32 bits using the form \UXXXXXXXX, being XXXXXXXXthe hexadecimal representation of those 32 bits. In your example, U+1F601it would be represented as \U0001F601. To get the bytes of your utf8 encoding, do the same as before:
```
>>> u'\U0001F601'.encode("utf8")
b'\xf0\x9f\x98\x81'
```

Note that the last option is the most general of all, since what fits in 8 bits also fits in 32. Therefore, it would be possible to represent the eñe as \xf1and also as \U000000f1.

Update . If what you have is a string of the style "U+XXXXX"and you want to get the utf8 version of the character represented there, you don't need any of the above. It is enough to extract what goes after the U+, decode it as an integer in hexadecimal, and use chr()to obtain the character (unicode) that corresponds to that code. Once you have the character, you use .encode("utf8")to get its encoding. So:

def unicode_to_utf8(unicode_point):
  code = int(unicode_point[2:], 16)
  return chr(code).encode("utf8")

Examples:

>>> unicode_to_utf8("U+F1")
b'\xc3\xb1'
>>> unicode_to_utf8("U+20AC")
b'\xe2\x82\xac'
>>> unicode_to_utf8("U+1F601")
b'\xf0\x9f\x98\x81'

convert U+XXXX to hexadecimal utf8

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?