NOTE
Although it doesn't affect the template code ( utf8iterator.hpp
), main.cpp
it depends on the encoding used when saving it to work correctly. On my system, I work with UTF-8, and said main.cpp
is saved in the same encoding.
Thanks to user @asdasdasd for letting me know with his comments.
END OF NOTE
After reading this question: Why doesn't cout show vowels with tilde or “ñ” with gcc 4.9.4? , I felt the uncontrollable urge to iterate over individual characters within a ::std::string
, or within a const char[]
.
After doing some research on the UTF-8 Wikipedia page , I coded this simple template
// utf8iterator.hpp
#ifndef UTF8ITERATOR_HPP
#define UTF8ITERATOR_HPP
#include <cstddef>
template< typename T > struct utf8iterator {
//static constexpr char ReplacementCharacter[4] { '\xEF', '\xBF', '\xBD', '\x00' };
T ptr;
::size_t size; // Tamaño del caracter, en bytes. == 0 -> ptr ha cambiado.
// Su única misión es evitar escrituras no necesarias.
char bytes[5]; // Máximo tamaño de un UTF-8 es 4. Dejamos sitio para el 0 al final.
utf8iterator( const T &p ) :
ptr( p ),
size( 0 )
{
bytes[4] = 0; // Solo lo hacemos 1 vez. Nunca se sobreescribe.
}
utf8iterator &operator=( const T &iter ) {
ptr = iter;
size = 0;
return *this;
// Ya hicimos 'bytes[4] = 0' en el constructor.
}
bool operator==( const utf8iterator< T > &other ) const noexcept { return ptr == other.ptr; }
bool operator!=( const utf8iterator< T > &other ) const noexcept { return ptr != other.ptr; }
::size_t calculateSize( ) const {
if( ( *ptr & 248 ) == 240 ) { // 11110
return 4;
} else if( ( *ptr & 240 ) == 224 ) { // 1110
return 3;
} else if( ( *ptr & 224 ) == 192 ) // 110
return 2;
return 1;
}
utf8iterator &operator++( ) {
if( size ) {
ptr += size;
size = 0; // Al cambiar 'ptr', se invalida 'size'.
} else
ptr += calculateSize( ); // 'size' ya es inválido.
return *this;
}
utf8iterator operator++( int ) {
utf8iterator tmp( *this );
if( size ) {
ptr += size;
size = 0; // Al cambiar 'ptr', se invalida 'size'.
} else
ptr += calculateSize( ); // 'size' ya es inválido.
return tmp;
}
operator const char *( ) {
// Si 'size' es inválido, tenemos que calcular el tamaño del caracter, en bytes.
if( !size ) {
::size_t c;
T iter( ptr );
size = calculateSize( );
// Subsceptible de optimizar, especializando para < const char * >, y usando ::std::memcpy( ).
// Copiamos los bytes indicados en 'size' al buffer 'bytes'.
for( c = 0; c != size; ++c ) {
bytes[c] = *iter;
++iter;
}
// En el constructor, hicimos 'bytes[4] = 0'. Las escrituras son costosas.
// Solo ponemos el 0 si 'bytes != 4'.
if( size != 4 )
bytes[size] = 0;
}
return bytes;
}
};
#endif
Accompanied by a small test code
// main.cpp
#include <iostream>
#include "utf8iterator.hpp"
int main( void ) {
const char *test = "abcdeññ";
utf8iterator< const char * > charIter( test );
while( *charIter ) {
std::cout << charIter.size( ) << ": ";
std::cout << *charIter << "\n";
++charIter;
}
std::cout << std::endl;
return 0;
}
All this compiles correctly with
g++ -I . -std=c++11 -Wall -pedantic -o test test.cpp
The expected result would be
1: a
1: b
1: c
1: d
1: e
2: ñ
2: ñ
However, the result obtained is this other:
1: a
1: b
1: c
1: d
1: e
2:
2:
I'm pretty sure the bug is in const char *utf8iterator::operator*( )
, but I can't hit the key .
Any suggestion ?
EDIT
Heh, in the end the shots don't go that way, but how I print it in the test ; I have the C++ somewhat rusty . I leave her unanswered for a while.
This works for me:
std::cout << *charIter << "\n";
forstd::cout << charIter << "\n";
I imagine that will be the behavior that I wantedoperator<<
If you want to correctly interpret UTF-8 characters, you can use wstring instead of trying to reinvent the wheel:
To check its operation: