I have a local Mysql DB with utf8 encoding.
Then I have a txt file whose content is like this, where there are Ñ accents and other characters.
9950543SÃO MATEUS MAD PT0830
My intention is to read record by record this file of 50,000 lines and insert them into the MySQL table.
To do this I have written a command in Laravel that reads the file, calls the PostalCode model and tries to insert the data.
public function __construct()
{
parent::__construct();
$content = fopen(storage_path("app\fichero_codigos_postales.TXT"),'r');
while(!feof($content)){
$line=utf8_decode(fgets($content));
$cp = substr($line,0,7);
$localidad = substr($line,8,24);
$pais = substr($line,32,2);
$param = substr($line,34,4);
try {
$item = PostaslCode::create([
'postal_code' => $cp,
'town' => $localidad,
'country' => $pais,
'extra_param' => $param
]);
}
catch (\Exception $exception)
{
var_dump($exception->getMessage());
dd($exception);
}
}
}
The commmand works for me, but in MySQL I see the characters as accents or Ñ as "?"
It starts by determining the encoding of the file. on linux
I would give you that information. For this specific case, the tool does not matter. If instead you had to process an uncertain collection of files with variable encodings, you would have to automate that diagnosis.
If in step 1 the original file is in UTF-8, and the destination table is also in UTF-8, then don't transform the data , because you are actually converting it to the wrong encoding yourself:
If instead step 1 indicates that it is ISO-8859-1, then what you are looking for is the inverse function
But in my experience utf8_decode and encode are the most brittle of PHP's encoding conversion functions and any input with characters longer than 2 bytes (UTF-8 can use up to 4) will corrupt the output. mb_convert_encoding is preferable , but even better than this is iconv .
Iconv is not infallible but it is very versatile with character sets, and can also receive modifiers such as "if the character has no equivalence, skip it, don't give me a �":
There is one last edge case that can account for corrupt characters even when working with a UTF-8 file and a utf8 table, and that is because utf8_general_ci and all MySQL utf8_xxxxxx_yy are not really UTF-8 .
As I put before, UTF-8 can use up to four bytes. MySQL's utf-8 only uses 3 bytes. If you find yourself that level of depth in the rabbit hole and you already question your sanity, the utf8_mb4 encoding (for multibyte 4) is real UTF-8 and is what is currently used for example when installing Laravel. (I think wordpress keeps installing the old encoding).