What is a promise in Javascript?

Question

Asked: 2020-03-21 02:00:46 +0800 CST 2020-03-21 02:00:46 +0800 CST 2020-03-21 02:00:46 +0800 CST

Error extracting character from UTF-8 string

772

NOTE

Although it doesn't affect the template code ( utf8iterator.hpp), main.cpp it depends on the encoding used when saving it to work correctly. On my system, I work with UTF-8, and said main.cppis saved in the same encoding.

Thanks to user @asdasdasd for letting me know with his comments.

END OF NOTE

After reading this question: Why doesn't cout show vowels with tilde or “ñ” with gcc 4.9.4? , I felt the uncontrollable urge to iterate over individual characters within a ::std::string, or within a const char[].

After doing some research on the UTF-8 Wikipedia page , I coded this simple template

// utf8iterator.hpp

#ifndef UTF8ITERATOR_HPP
#define UTF8ITERATOR_HPP

#include <cstddef>

template< typename T > struct utf8iterator {
  //static constexpr char ReplacementCharacter[4] { '\xEF', '\xBF', '\xBD', '\x00' };

  T ptr;
  ::size_t size; // Tamaño del caracter, en bytes. == 0 -> ptr ha cambiado.
                 // Su única misión es evitar escrituras no necesarias.
  char bytes[5]; // Máximo tamaño de un UTF-8 es 4. Dejamos sitio para el 0 al final.

  utf8iterator( const T &p ) :
    ptr( p ),
    size( 0 )
  {
    bytes[4] = 0; // Solo lo hacemos 1 vez. Nunca se sobreescribe.
  }
  utf8iterator &operator=( const T &iter ) {
    ptr = iter;
    size = 0;
    return *this;
    // Ya hicimos 'bytes[4] = 0' en el constructor.
  }

  bool operator==( const utf8iterator< T > &other ) const noexcept { return ptr == other.ptr; }
  bool operator!=( const utf8iterator< T > &other ) const noexcept { return ptr != other.ptr; }

  ::size_t calculateSize( ) const {
    if( ( *ptr & 248 ) == 240 ) { // 11110
      return 4;
    } else if( ( *ptr & 240 ) == 224 ) { // 1110
      return 3;
    } else if( ( *ptr & 224 ) == 192 ) // 110
      return 2;

    return 1;
  }
  utf8iterator &operator++( ) {
    if( size ) {
      ptr += size;
      size = 0; // Al cambiar 'ptr', se invalida 'size'.
    } else
      ptr += calculateSize( ); // 'size' ya es inválido.

    return *this;
  }
  utf8iterator operator++( int ) {
    utf8iterator tmp( *this );

    if( size ) {
      ptr += size;
      size = 0; // Al cambiar 'ptr', se invalida 'size'.
    } else
      ptr += calculateSize( ); // 'size' ya es inválido.

    return tmp;
  }

  operator const char *( ) {
    // Si 'size' es inválido, tenemos que calcular el tamaño del caracter, en bytes.
    if( !size ) {
      ::size_t c;
      T iter( ptr );

      size = calculateSize( );

      // Subsceptible de optimizar, especializando para < const char * >, y usando ::std::memcpy( ).
      // Copiamos los bytes indicados en 'size' al buffer 'bytes'.
      for( c = 0; c != size; ++c ) {
        bytes[c] = *iter;
        ++iter;
      }

      // En el constructor, hicimos 'bytes[4] = 0'. Las escrituras son costosas.
      // Solo ponemos el 0 si 'bytes != 4'.
      if( size != 4 )
        bytes[size] = 0;
    }

    return bytes;
  }
};

#endif

Accompanied by a small test code

// main.cpp

#include <iostream>

#include "utf8iterator.hpp"

int main( void ) {
  const char *test = "abcdeññ";

  utf8iterator< const char * > charIter( test );

  while( *charIter ) {
    std::cout << charIter.size( ) << ": ";
    std::cout << *charIter << "\n";
    ++charIter;
  }

  std::cout << std::endl;

  return 0;
}

All this compiles correctly with

g++ -I . -std=c++11 -Wall -pedantic -o test test.cpp

The expected result would be

1: a
1: b
1: c
1: d
1: e
2: ñ
2: ñ

However, the result obtained is this other:

1: a
1: b
1: c
1: d
1: e
2:
2:

I'm pretty sure the bug is in const char *utf8iterator::operator*( ), but I can't hit the key .

Any suggestion ?

EDIT

Heh, in the end the shots don't go that way, but how I print it in the test ; I have the C++ somewhat rusty . I leave her unanswered for a while.

2 Answers

Voted

Angel Angel · Answer 1 · 2020-03-21T14:33:01+08:00

This works for me:

int main( void ) {
  const char *test = "abcdeññ";

  utf8iterator< const char * > charIter( test );

  while( *charIter ) {

    std::cout << charIter.calculateSize( )  << ": ";  
    std::cout << charIter << "\n";

    ++charIter;
  }

  return 0;
}

std::cout << *charIter << "\n";for std::cout << charIter << "\n";I imagine that will be the behavior that I wanted

operator<<

Jose D. Jurado · Answer 2 · 2020-03-21T02:23:49+08:00

If you want to correctly interpret UTF-8 characters, you can use wstring instead of trying to reinvent the wheel:

// main.cpp

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main( void ) {
    ios_base::sync_with_stdio(false);
        wcout.imbue(locale("en_US.UTF-8"));

        for (auto const&t : wstring (L"áéíóúññ")){
            wcout << t;
        }

        wcout << endl;
        return 0;
}

To check its operation:

$ g++ -I . -std=c++11 -Wall -pedantic -o test main.cpp
$ ./test 

áéíóúññ

Error extracting character from UTF-8 string

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?