I have to do a report on the schemas and tables that Amazon Redshift
considers to have bad statistics. There is a process that runs every weekend that is in charge of applying the corresponding operations on that, but I need to export the name of the schemas and tables in a .csv
.
The thing is that this process generates a report, among whose lines the ones that interest me look like this:
-- 2019-04-28 07:05:06.538818 [73589] [73589] Running 200 out of 214 commands: analyze schema_owner."nombre_tabla_mala"
I am collecting the lines that meet this pattern in the following way:
while read linea
do
SCHEMA="schema_owner"
FILTRO="commands: analyze $SCHEMA"
if [[ $linea =~ $FILTRO ]]
then
...Codigo que falta...
fi
done < /ruta_del_fichero_log
The problem is that I obviously capture the entire line and I only need to store the part ofschema_owner."nombre_tabla_mala"
How could I discard the rest of the chain?
I put the first twenty lines of the log file in question:
-- 2019-04-28 05:54:53.830738 [73589] [73589] Running 1 out of 1 commands: set wlm_query_slot_count = 4
-- 2019-04-28 05:54:53.833469 [73589] Success.
-- 2019-04-28 05:54:53.833531 [73589] [73589] Running 1 out of 1 commands: set statement_timeout = '36000000'
-- 2019-04-28 05:54:53.836162 [73589] Success.
-- 2019-04-28 05:54:53.836190 [73589] [73589] Running 1 out of 1 commands: set application_name to 'AnalyzeVacuumUtility-v.9.1.6'
-- 2019-04-28 05:54:53.838700 [73589] Success.
-- 2019-04-28 05:54:53.838788 [73589] Extracting Candidate Tables for Vacuum...
-- 2019-04-28 05:55:57.850685 [73589] Found 0 Tables requiring Vacuum and flagged by alert
-- 2019-04-28 05:55:57.850795 [73589] Extracting Candidate Tables for Vacuum ...
-- 2019-04-28 05:56:34.908067 [73589] Found 107 Tables requiring Vacuum due to stale statistics
-- 2019-04-28 05:56:34.908263 [73589] [73589] Running 1 out of 214 commands: vacuum FULL schema_owner."t_ed_p" ; /* Size : 120 MB, Unsorted_pct : N/A */ ;
-- 2019-04-28 05:56:47.588342 [73589] Success.
-- 2019-04-28 05:56:47.588401 [73589] [73589] Running 2 out of 214 commands: analyze schema_owner."t_ed_p"
-- 2019-04-28 05:56:50.363655 [73589] Success.
-- 2019-04-28 05:56:50.363711 [73589] [73589] Running 3 out of 214 commands: vacuum FULL schema_owner."t_ed_p_estados" ; /* Size : 120 MB, Unsorted_pct : N/A */ ;
-- 2019-04-28 05:57:03.430064 [73589] Success.
-- 2019-04-28 05:57:03.430124 [73589] [73589] Running 4 out of 214 commands: analyze schema_owner."t_ed_p_estados"
-- 2019-04-28 05:57:06.024933 [73589] Success.
-- 2019-04-28 05:57:06.025023 [73589] [73589] Running 5 out of 214 commands: vacuum FULL schema_owner."t_ed_p_tps_actividad" ; /* Size : 120 MB, Unsorted_pct : N/A */ ;
-- 2019-04-28 05:57:06.024933 [73589] Success.
In the end, what I need to obtain is the schema and the table. That is, of those that appear in this example, you would need to send the following to the .csv file:
schema_owner."t_ed_p"
schema_owner."t_ed_p_estados"
schema_owner."t_ed_p_tps_actividad"
It looks like it's about taking the string
schema_owner.
+" cosas "
. So, let's leave the task agrep
together with-o
so that it only shows the match:schema_owner\."[^"]*"
says: "the textschema_owner
followed by a period (escaped because it doesn't.
match any character) and followed by a string enclosed in double quotes.I notice that there are duplicate entries. If you want to remove them, you can pass the result to
sort -u
so that it only shows one entry of each:With the -o option and this regular expression you can output only the part of your interest:
the option
-o
returns only the part corresponding to the regular expression. There I assume that the table name is enclosed in quotes, and the schema contains only alphabetic characters and an underscore (_).For example, taking the line you provide:
And if you have many repeated records you can use the command
sort | uniq
. You could also count them and order them from the most frequent to the least frequent withuniq -c | sort -rn
:A little late but with different options.
with awk
What I do here is to capture with
match
each line ($0
) the mentioned regular expression and then I assign the elements found to the arraygr
. Then, for each line that enters and each group found, I fill the arrayun
with the keys found and increment their values by 1. This step is just to take advantage of the unique nature of array keys, values don't matter to me. That is, by filling the array with any value, its keys will always be different. Then, and at the end of the script, I iterate over the values of this array and print its values.Variation of the previous answers
Here is the usual regular expression mentioned in other answers in the same use of
grep
, the difference is that, to show only the unique characters, I use inawk
the condition of only printing when the value corresponding to that key is NOT greater than 0, that is , when the lines are unique.with pearl
This option is similar, I look for the desired pattern and then everything that is matched (with
$&
) is assigned to the hash$un
, which by its nature, is of unique keys, so there will be no duplicate keys. At the end of the script, I print the hash keysun
.In all cases something of the form results.
The advantage of using only one program (in the case of
awk
orperl
), is that it is much faster, you take up less processing. Because if there were a lot of lines, hundreds of thousands, millions of logs , they would go throughgrep
, then the matches would go throughsort
each one, then these ordered lines would go throughuniq
, etc, etc. And each of these programs is creating processes, opening file descriptors, closing file descriptors, and so on.