Why doesn't cout show vowels with tildes or "ñ" with gcc 4.9.4?
772
I have no idea why this happens. Whenever it processes the characters of a string, and it stringhas vowels with accents or ñtransforms them and does not display properly.
You cannot display UTF-8 characters as ascii bytes .
The only solution you have is to check, 1 to 1, that the characters are valid in ASCII (7 bits). If any character does not meet that rule, you would have to return more than 1 byte .
All UTF-8 characters have bit 8 set to 1, so the check is simple:
if( character & 128 ) {
If you find any character that meets the above, you are facing UTF-8 .
Before characters of this type, you have to use some library to extract it and convert it into a string, to display the latter.
Keep in mind that you can find more than one UTF-8 in a row , so you can't take the easy way of adding characters to an auxiliary string as long as the check is successful. You may also run into invalid UTF-8 sequences .
I think Windows provides functions for these things. On Linux, you can use ICU
EDIT
I never had the need to extract individual characters from a ::std::string... until reading this question ;-)
After a few unexpected annoyancestemplate< > , I made this one that allows you to iterate over the individual characters of UTF-8 strings, whether they're in a const char *VAR="...", or ::std::string( "..." ). It's not the coolest thing in the world, but it illustrates the process of checking if a character is UTF-8 or not, and how to treat them depending on the width of the character. It does not take into account possible errors in the UTF-8 encoding, it is only for training purposes:
You can iterate over the "bytes" of a string that is in UTF-8 and output those bytes elsewhere.
What you can never do is "interleave" characters/bytes (in this case end of line: the "endl") between those bytes that you are iterating, since there are characters that are made up of two bytes (the ñ, the á, etc) and are not "separable".
To better understand what I say above, this code works only for (unicode) characters less than 0x800 (less than 8*256, the 'ñ', 'á', are less than 1*256):
#include <iostream>
using namespace std;
int main()
{
for (auto const&l : string("áaéeiíóúñ")) {
cout << l;
if ((l&0xc0)!=0xc0)
cout << endl;
}
}
Departure:
á
a
é
e
i
í
ó
ú
ñ
I have interleaved line returns only in "some cases" between the output "bytes".
This is due to the locale that your program is running with; An example to locate would be:
You can see more information about this at:
Location functions in C
For quick understanding:
You cannot display UTF-8 characters as ascii bytes .
The only solution you have is to check, 1 to 1, that the characters are valid in ASCII (7 bits). If any character does not meet that rule, you would have to return more than 1 byte .
All UTF-8 characters have bit 8 set to 1, so the check is simple:
If you find any character that meets the above, you are facing UTF-8 .
Before characters of this type, you have to use some library to extract it and convert it into a string, to display the latter.
Keep in mind that you can find more than one UTF-8 in a row , so you can't take the easy way of adding characters to an auxiliary string as long as the check is successful. You may also run into invalid UTF-8 sequences .
I think Windows provides functions for these things. On Linux, you can use ICU
EDIT
I never had the need to extract individual characters from a
::std::string
... until reading this question ;-)After a few unexpected annoyances
template< >
, I made this one that allows you to iterate over the individual characters of UTF-8 strings, whether they're in aconst char *VAR="..."
, or::std::string( "..." )
. It's not the coolest thing in the world, but it illustrates the process of checking if a character is UTF-8 or not, and how to treat them depending on the width of the character. It does not take into account possible errors in the UTF-8 encoding, it is only for training purposes:A small test/example program, showing its use:
After compiling it with
g++ -I . -std=c++11 -Wall -pedantic main.cpp
, it shows the following result:Properly displays individual characters, both in
char *
andstd::string
, regardless of the bytes they occupy.I don't know if you solved this issue but as I see comments like this:
I used gnu++11's std::locale, then cout.imbue( locale( "" ); it still shows me the characters incorrectly...
You can make use of the following to display it how you want:
testIdeone
Info:
wstring
wcout
sync_with_stdio
You can iterate over the "bytes" of a string that is in UTF-8 and output those bytes elsewhere.
What you can never do is "interleave" characters/bytes (in this case end of line: the "endl") between those bytes that you are iterating, since there are characters that are made up of two bytes (the ñ, the á, etc) and are not "separable".
To better understand what I say above, this code works only for (unicode) characters less than 0x800 (less than 8*256, the 'ñ', 'á', are less than 1*256):
Departure:
I have interleaved line returns only in "some cases" between the output "bytes".