What is a promise in Javascript?

Question

Trauma

Asked: 2020-04-24 00:24:49 +0800 CST 2020-04-24 00:24:49 +0800 CST 2020-04-24 00:24:49 +0800 CST

Interpretation of string literals prefixed with size L, u, U, u8 within source files

772

I'm making a C++ code tokenizer , using C++ :-)

At the moment, I only support encoded source filesUTF8 , so to process the input file, I'm reading line by line and using a simple const char *to access each individual character.

To localize the literal strings with prefix , I use the following:

static size_t parseIdentifier( CppToken &tk, const char *beg, const char *end ) {
  while( ( ( beg != end ) && *beg <= ' ' ) ) ++beg;

  if( beg == end ) return 0;

  if( strncmp( beg, "L\"", 2 ) { return parseStringLiteralL( tk, beg, end ); }
  if( strncmp( beg, "u\"", 2 ) { return parseStringLiteralu( tk, beg, end ); }
  if( strncmp( beg, "U\"", 2 ) { return parseStringLiteralU( tk, beg, end ); }

  ...
}

And here is my question: within those functions parseStringLiteralX( ), should I interpret the strings in the source file with size prefix that they are still encoded in UTF8, or on the contrary, must they be previously encoded according to the prefix used?

The documentation that I have found does not clarify what to do:

Phase 5

1) All characters in character literals and string literals are converted from the source character set to the execution character set (which may be a multibyte character set such as UTF-8, as long as the 96 characters of the basic source character set listed in phase 1 have single-byte representations).

And I don't know how to apply this to my functions parseStringLiteralX( ). That is, the different functions should start like this, with a type conversion:

static size_t parseStringLiteralL( CppToken &tk, const char *beg, const char *end ) {
  const wchar_t *wbeg = reinterpret_cast< const wchar_t * >( beg );
  const wchar_t *wend = reinterpret_cast< const wchar_t * >( end );
  ...
}

or, conversely , I am to assume that the literal strings are still inUTF8 , and I am the one who has to transform them to the type indicated by the prefix ?

static size_t parseStringLiteralL( CppToken &tk, const char *beg, const char *end ) {
  std::wstring value;

  while( *beg != '"' ) value.append( 1, utf8_to_wchart( *beg ) );

  ...
}

_{Note : the actual source code is not like this , it is just illustrative.}

EDIT

I've tried this little test to try to clear it up:

#include <string>
#include <iostream>

int main( ) {
  const wchar_t *test = L"el niño y la niña\n";

  std::cout << reinterpret_cast< const char * >( test ) << '\n';
  std::wcout << test << std::endl;

  return 0;
}

I expected the text to display correctly at least 1 time. However, I get the following:

and
the boy and the girl

Well, I just realized that it will depend on the coding that WandBox supports... but I'll leave it there, in case it's useful to someone.

1 Answers

Voted

Trauma · Answer 1 · 2020-04-24T01:57:17+08:00

It already. After thinking about it for a bit...

#include <string>
#include <iostream>

int main( ) {
  const wchar_t *test = L"el niño y la niña\n";

  std::cout << reinterpret_cast< const char * >( test ) << '\n';
  std::wcout << test << std::endl;

  return 0;
}

Departure:

and
the boy and the girl

If we look closely, the first attempt

std::cout << reinterpret_cast< const char * >( test ) << '\n';

It gives totally wrong output . And I know that the source file is in UTF-8, and the terminal used also supports UTF-8.

The only possible explanation: the string is transformed by the compiler : In the code file, the string is still utf-8 encoded regardless of the width prefix . It is the compiler that transforms it to the type indicated by the prefix .

Therefore, of the 2 possible options in my question, you have to use the 2nd one :

static size_t parseStringLiteralL( CppToken &tk, const char *beg, const char *end ) {
  std::wstring value;

  while( *beg != '"' ) value.append( 1, utf8_to_wchart( *beg ) );

  ...
}

Continue reading the file in UTF-8, and transform it ourselves to the type indicated in the prefix :-)

Interpretation of string literals prefixed with size L, u, U, u8 within source files

Phase 5

EDIT

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?