Watching several Pandas tutorials I found this:
When reading csv files I sometimes get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 17: invalid continuation byte
In all the videos they give the option to open the file with Sublime and then save it as UTF-8.
Now, on the other hand, I found the following:
I open the file, go through it and apply the detect function of chardet so that it finds what type of encoding it has.
import chardet
with open('data/atp-tour/data.csv', 'rb') as f:
result = chardet.detect(f.read())
result['encoding']
Output: 'Windows-1252'
Once this is done, I can encode the file and make it work normally:
datos = pd.read_csv('data/atp-tour/data.csv', encoding=result['encoding'])
What is the best method for encoding? Fix it quick with sublime or find the type of encoding and apply it?
On the other hand, when applying the Chardet encoding, it sometimes gives a memory error that stops exiting when adding the parameter low_memory = False
DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.
What does this mean? Since I investigated and on the same line sometimes it comes out and sometimes not when executing it
FILE ENCODING
Opening the file with sublime is the simplest option to save it in UTF-8, on the other hand you can do the encoding to UTF-8 by doing the csv_read, if it doesn't work, you can import chardet and do the encoding in the same script so that everything stays there, this last option takes longer the more records the file has.
Although that was my query because I'm just starting out and I'm an intern, they already lowered the line that UTF-8 is required, so I no longer have that problem, but to practice you can try what I'm saying.
LOW MEMORY
I leave the information translated, SEE SOURCE provided by @abulafia
The low_memory option is not properly unchecked, but it should be, since it doesn't actually do anything different VIEW SOURCE
The reason you get this low memory warning is because the guess types for each column are memory intensive. Pandas tries to determine which data type to set by analyzing the data in each column.
Pandas can only determine what data type each column has once it reads the entire file. This means that you can't really parse anything before reading the entire file, unless you risk having to change the column type when you read the last value.
Consider the example of a file that has a column named user_id. It contains 10 million rows where the user_id is always numbers. Since pandas can't tell they are just numbers, it will probably keep it as the original strings until it has read the entire file.
Specifying dtypes (should always be done) by adding:
By calling pd.read_csv() pandas will know when it starts reading the file, that this is just integers.
It's also worth noting that if the last line of the file had "foobar" written in the user_id column, the load would block if the above dtype was specified.