I am uploading very large CSV files where there is information that I do not use. What I currently do is pass to lists and then filter with conditions.
For example, in one column I have "A" and "B" but I only want those rows that have "A", so I filter out the "B's" and do it for all other lists that I load from Pandas. I think it's very inefficient what I'm doing, of loading with Pandas and then converting the columns to lists to filter them.
Is there a more efficient way to filter the data. I load the data like this in Pandas:
import pandas as pd
df = pd.read_csv('C:\\Users\\4209456\\Downloads\\TECNICO\\BBDD FD.csv', header=0, sep=';',parse_dates = ['Date'],dayfirst = True)
and then I create lists of some columns:
FD=df["FD Text"]
Fecha=pd.to_datetime(df["Date"],dayfirst = True)
Hence for both filter lists according to a condition of the list "FD", creating another list "FD2", within the condition I have the list "Date" creating a corrected list "Date2". This way I have the lists I need to start my code, which are "FD2" and "Date2", keeping the original positions of the data left over from the filter.
There are several ways to filter rows in Pandas, the simplest is to create a mask (boolean array) using a conditional on the column in question and then filter the rows of the Dataframe with it. It is basically the procedure followed in NumPy to filter arrays. This is known as Boolean indexing .
We can create a small example to see it:
We therefore have the following Dataframe:
We can filter using the column
sexo
to obtain another Dataframe that only contains the rows corresponding to women simply by doing:With
df['Sexo'] == 'F'
simply creates a mask that in this case is a series of Pandas and that contains a single column of booleans:[False, True, False, True, False, False]
result of comparing if each value of the column is equal to'F'
. We can filter with any other iterable of boolean values, for example a NumPy array, a list, a column from another DataFrame, etc.Another example, filtering out those who are 18 or older:
Edition:
You can filter by dates following the same idea, for example we can filter the rows with
Fecha
between the current date and 30 days ago:Several conditions can also be used, for example the previous condition but also being a woman:
It can be filtered using the index in the same way or using
loc
, for example, if our columnFecha
was in the DataFrame index (DateTimeIndex) we can do: