I have a CSV file referring to suicides by country, which has 12 variables. The structure is as follows:
country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year, gdp_for_year ($) ,gdp_per_capita ($),generation
Albania,1987,Hombre,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
Albania,1987,Hombre,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
Albania,1987,Mujer,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
Albania,1987,Hombre,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,Hombre,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers
Albania,1987,Mujer,75+ years,1,35600,2.81,Albania1987,,2156624900,796,G.I. Generation
Albania,1987,Mujer,35-54 years,6,278800,2.15,Albania1987,,2156624900,796,Silent
Albania,1987,Mujer,25-34 years,4,257200,1.56,Albania1987,,2156624900,796,Boomers
Albania,1987,Hombre,55-74 years,1,137500,0.73,Albania1987,,2156624900,796,G.I. Generation
When I apply the following awk command to see the number of suicides by ages:
tail -n +2 fichero.csv | awk -F, '{a[$4 ","]+=$5} END{for (i in a) print i, a[i]}' | sort -t, -nr -k2
I get the desired result:
35-54 years, 2452141
55-74 years, 1658443
25-34 years, 1123912
15-24 years, 808542
75+ years, 653118
5-14 years, 52264
However, when I apply the same command, this time to see the number of suicides by generations:
tail -n +2 fichero.csv | awk -F, '{a[$12 ","]+=$5} END{for (i in a) print i, a[i]}'
the output I get is:
, 2284498
, 15906ion Z
, 1532804n X
, 510009eration
, 623459ls
, 1781744
Why am I getting an erroneous output where the numerical values are mixed with the categories of the variable, if applying the same code for the age groups does not generate any problem?
Thanks in advance!
I have tried to replicate your example and I get the expected result correctly:
Explanation
I think the problem is that your .csv file uses the CRLF (carriage return + line feed) pair as a line terminator, which corresponds to ASCII 13 and 10, typically represented by \r\n.
Instead, the Unix machine on which you run the awk command expects line terminators to be just the LF character (\n). Therefore, the name of the generations, being the last field of the line and having eliminated the \n as it is considered the terminator, still contains a \r that has not been eliminated.
For example, one of the lines says:
(I put the trailing \r\n to make them more visible).
When AWK processes it, it removes the \n since that's what it considers to be "end of line". The name of the generation accessible in
$12
would therefore beBoomers\r
with that \r at the end.When you go to print the output that should look like "Boomers, 13" it happens that the string "Boomers\r, 13" is sent to the terminal. The character \r (carriage return) causes the cursor to go back to the beginning of the line, to write after the comma and the 13, overwriting what was already written there. As a result you will see ", 13ers"
To verify this hypothesis, I saved the .csv file using CRLF as the line terminator and ran your command again. In fact, strange things come out like you:
Solution
You can convert the .csv to "LF" format with the tool
dos2unix
before processing it with awk.You can also tell awk to use the \r\n pair as the line terminator, instead of the default \n. It would be like this: