I've been dragging queries from this problem:
Using this code:
import random
import pandas as pd
from datetime import datetime
inicio = datetime(2019, 1, 1)
final = datetime(2019, 3, 21)
datos = []
for i in range (0, 10000):
fechaRandom = inicio + (final - inicio) * random.random()
datos.append(fechaRandom.strftime('%Y-%m-%d'))
df = pd.DataFrame(datos)
df.rename(columns={0: "Fecha"}, inplace=True)
procesos = []
for a in range (1, 11):
procesos.append('Proceso' + str(a))
total = 0
proceso = []
for i in range (0, 10):
for j in range ( 0, 1000):
proceso.append(procesos[total])
total += 1
datosProceso = pd.DataFrame(proceso)
datosProceso.rename(index=str, columns={0: "Proceso"}, inplace=True)
result = pd.merge(datosProceso.reset_index(),
df.reset_index(),
left_index=True,
right_index=True)
result = result.drop(columns={'index_x', 'index_y'})
I get the following:
A Df with 10,000 random data in two columns, processes and dates (With 10,000 random dates grouped into 10 random processes of 1000 records each)
What I need now is to be able to remove the duplicates, and leave the process plus the date and in a column the total number of times that date was repeated for that process. In summary it is this:
(result
.groupby(["Proceso", "Fecha"])
["Fecha"].count())
The problem I have is that I need everything to be in a new Dataframe that has this format:
Fixed:
I complicated my life doing other things, it was just this. She left the question because it can be useful to someone.
another possibility in case you need to pass to a list: