How can you read a string of text, reading its characters three at a time?
Specifically, I am looking to read the trinucleotides of a DNA sequence, and be able to count how many there are.
Let be a string:
AAGACAGAGTAGACAAGTACAGTAGACAGATGACGGGTAGCAT
I would like to split it into
AAG
ACA
GAG
...
But the problem is that I don't have a delimiter to use the command cut
.
How could I fix it?
The programming language I use is Bash.
I have tried the following:
#!/bin/bash
a=$(cat $1|wc -m)
b=$(cat $1)
for ((i=0;i<$a;i=i+3));do
echo ${b:i:(i+3)}
done
But it prints me from three to the end of the entire string, not three by three. The argument $1
is a file containing the DNA sequence.
The third parameter of the expansion is the length of the clip and not the end position of the clip. You can consult it in the help with
man bash
:In Spanish:
So your code would be:
If what you want is to count the trinucleotides, it might be enough to calculate
($a + 2) / 3
:I want to highlight the use of
$(<"$1")
to load the contents of a file into a variable (in quotes to support white space files) and${#b}
to get the length of a variable.If you are only interested in the calculation of the number and at no time are you interested in displaying its content, then it is better to do:
Or, in a reduced way,
$((( $(wc -c < "$1") + 2) / 3))
.Note the use of
wc -c < [archivo]
to prevent the file name from appearing along with the count result.Note that it
wc -m
is much slower thanwc -c
very large files because it does not require reading its content to count multibyte characters. A letterñ
is a character, but it takes up two bytes in a UTF-8 encoded file.Also, keep in mind that both (
-m
and-c
) would count all line feeds (\n
) if any.Use
grep
it to trim the chain into blocks of three:Since
.
it matches any character, the regular expression...
matches three characters. Using the signal we get each result-o
togrep
be displayed on a different line, so you can then do whatever you want with it: count lines, add...You can also even say the following, using process substitution to pretend to read a file line by line :
And thus be able to work with each trinucleotide in each iteration:
In your case:
You can use
sed
orawk
Thirsty.
Here the regular expression matches every three characters
(.{3})
and then adds a newline to that group\1\n
.with awk.
Where the field separator is redefined to nothing so that it iterates over characters. And, during the loop, each character number modulo 3 prints a line break.
In your case, using the sequence of nitrogenous bases that you showed.
It seems that at the end of what you entered there was "left over" a thymine.