I have a CSV file with the following structure:
country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year, gdp_for_year ($) ,gdp_per_capita ($),generation
Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers
Albania,1987,female,75+ years,1,35600,2.81,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,female,35-54 years,6,278800,2.15,Albania1987,,2156624900,796,Silent
Albania,1987,female,25-34 years,4,257200,1.56,Albania1987,,2156624900,796,Boomers
Albania,1987,male,55-74 years,1,137500,0.73,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,female,5-14 years,0,311000,0,Albania1987,,2156624900,796,Generation X
Albania,1987,female,55-74 years,0,144600,0,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,male,5-14 years,0,338200,0,Albania1987,,2156624900,796,Generation X
Albania,1988,female,75+ years,2,36400,5.49,Albania1988,,2126000000,769,G.I. Generation
Albania,1988,male,15-24 years,17,319200,5.33,Albania1988,,2126000000,769,Generation X
I am trying to script a for loop to get the number of suicides by country, year and gender. For now, the loop I have done is the following:
#!/bin/bash
# Primero, guardamos en varios CSV las los países, años y sexo.
tail -n +2 suicidios_final.csv | cut -d "," -f1 | sort | uniq > country.csv
tail -n +2 suicidios_final.csv | cut -d "," -f2 | sort | uniq > year.csv
tail -n +2 suicidios_final.csv | cut -d "," -f3 | sort | uniq > sex.csv
# Creamos arrays de las las variables anteriores mediante el comando mapfile:
mapfile -t countries < country.csv
mapfile -t years < year.csv
mapfile -t sex < sex.csv
# Finalmente, realizamos la iteración mediante un bucle for para los países, otro bucle for anidado para los años y un tercero para el sexo.
# Además, añadiremos un color diferente para cada una de las variables, para distinguirlas bien:
for i in "${countries[@]}"; do
echo -e "\e[36m== $i ==\e[0m"
for j in "${years[@]}"; do
echo -e " \e[33m$j\e[0m"
for k in "${sex[@]}"; do
echo -e " \e[31m$k\e[0m"
tail -n +2 suicidios_final.csv | grep -F "$i" | grep -F "$j" | grep -F "$k" > bucle.csv
suicidios=$(cat bucle.csv | cut -d "," -f5 | paste -s -d "+" | bc)
echo -e " \e[34mNúmero de suicidios: $suicidios\e[0m"
done
done
done
However, when executing the script, the output I get is not the desired one, since the loop performs the sums of the suicides for the "female" category of the sex variable correctly, but for the "male" category what it is doing is adding the rows regardless of whether it is "male" or "female":
./script.sh
== Albania ==
1985
female
Número de suicidios: 15
male
Número de suicidios: 50
1986
female
Número de suicidios:
male
Número de suicidios:
1987
female
Número de suicidios: 25
male
Número de suicidios: 73
1988
female
Número de suicidios: 22
male
Número de suicidios: 63
1989
female
Número de suicidios: 15
male
Número de suicidios: 68
1990
female
Número de suicidios:
male
Número de suicidios:
1991
female
Número de suicidios:
male
Número de suicidios:
1992
female
Número de suicidios: 14
male
Número de suicidios: 47
1993
female
Número de suicidios: 27
male
Número de suicidios: 73
1994
female
Número de suicidios: 15
male
Número de suicidios: 50
1995
female
Número de suicidios: 34
male
Número de suicidios: 88
1996
female
Número de suicidios: 39
male
Número de suicidios: 89
......
Actually, in the first result I already get errors, because I do not have data for Albania in the year 1985, but I have looked at the result of other different countries in various years and I do not see this type of error occurring, from what I understand that the error may be from the file data itself. Regardless, the error that I do not understand is that of the sex variable, because in the "female" category section it does add the number of suicides correctly, but then for "male" it gives me the sum of both the cases of "male as for "female". I know that my question is a bit difficult because it is a triple nested for loop and it will not be easy to see an error at first glance, but if someone knows what could be going on and tell me I would really appreciate it.
Restate your problem in two different ways; the first using GNU Datamash, and the second with a
awk
.The example file I took it from is this one I called
suicidios_final.csv
:With
datamash
In a single line:
Where I ask datamash to group by fields 1,2,3, and then add field 5 using the "," character as a separator.
Resulting in:
If you don't have
datamash
, install it withsudo apt install datamash
.Wearing
awk
Here I just mixed this script that comes in the official documentation to display ("walk" through) the content of a multidimensional array.
Then I used the respective fields assigning them as keys of the array. Thus,
awk
it is in charge of doing all the work, since there cannot be repeated keys in an array, so the grouping is done automatically due to the nature of the array keys, and the sum is only indicated with the operator+=
on the fifth field.In a file called
main.awk
, we put the following content:And in the terminal we run:
Getting:
Note:
Using a large mix of GNU/Linux tools can be tempting in the first few years, but it's highly inefficient (and highly unattractive), as each program opens file descriptors, then closes them, they can create temporary programs that you later delete. Keep in mind that Bash cannot be used lightly as if it were a programming language, but rather a great tool for orchestrating programs.
For this reason, it is better to use only utilities dedicated to the task that we want or powerful languages such as
sed
,awk
,python
,perl
, etc.