I'm making a C++ code tokenizer , using C++ :-)
At the moment, I only support encoded source filesUTF8
, so to process the input file, I'm reading line by line and using a simple const char *
to access each individual character.
To localize the literal strings with prefix , I use the following:
static size_t parseIdentifier( CppToken &tk, const char *beg, const char *end ) {
while( ( ( beg != end ) && *beg <= ' ' ) ) ++beg;
if( beg == end ) return 0;
if( strncmp( beg, "L\"", 2 ) { return parseStringLiteralL( tk, beg, end ); }
if( strncmp( beg, "u\"", 2 ) { return parseStringLiteralu( tk, beg, end ); }
if( strncmp( beg, "U\"", 2 ) { return parseStringLiteralU( tk, beg, end ); }
...
}
And here is my question: within those functions parseStringLiteralX( )
, should I interpret the strings in the source file with size prefix that they are still encoded in UTF8
, or on the contrary, must they be previously encoded according to the prefix used?
The documentation that I have found does not clarify what to do:
Phase 5
1) All characters in character literals and string literals are converted from the source character set to the execution character set (which may be a multibyte character set such as UTF-8, as long as the 96 characters of the basic source character set listed in phase 1 have single-byte representations).
And I don't know how to apply this to my functions parseStringLiteralX( )
. That is, the different functions should start like this, with a type conversion:
static size_t parseStringLiteralL( CppToken &tk, const char *beg, const char *end ) {
const wchar_t *wbeg = reinterpret_cast< const wchar_t * >( beg );
const wchar_t *wend = reinterpret_cast< const wchar_t * >( end );
...
}
or, conversely , I am to assume that the literal strings are still inUTF8
, and I am the one who has to transform them to the type indicated by the prefix ?
static size_t parseStringLiteralL( CppToken &tk, const char *beg, const char *end ) {
std::wstring value;
while( *beg != '"' ) value.append( 1, utf8_to_wchart( *beg ) );
...
}
Note : the actual source code is not like this , it is just illustrative.
EDIT
I've tried this little test to try to clear it up:
#include <string>
#include <iostream>
int main( ) {
const wchar_t *test = L"el niño y la niña\n";
std::cout << reinterpret_cast< const char * >( test ) << '\n';
std::wcout << test << std::endl;
return 0;
}
I expected the text to display correctly at least 1 time. However, I get the following:
and
the boy and the girl
Well, I just realized that it will depend on the coding that WandBox supports... but I'll leave it there, in case it's useful to someone.
It already. After thinking about it for a bit...
Departure:
If we look closely, the first attempt
It gives totally wrong output . And I know that the source file is in UTF-8, and the terminal used also supports UTF-8.
The only possible explanation: the string is transformed by the compiler : In the code file, the string is still utf-8 encoded regardless of the width prefix . It is the compiler that transforms it to the type indicated by the prefix .
Therefore, of the 2 possible options in my question, you have to use the 2nd one :
Continue reading the file in UTF-8, and transform it ourselves to the type indicated in the prefix :-)