I recently found myself needing to extract all the values between two specified points in a string, in this case everything inside the parentheses "()"
.
What would be the most optimal or adequate way to do this?
string cadena = string.Empty, resultado = string.Empty;
I have an email that has a predefined format, in which only the values that are between()
Example cadena
:
Hola, amigo X, ..........
bla bla bla bla
.......
('A','B','valorX','valorY',N...) //lo que quiero obtener.
.......
mas texto...
....
Se despide, atentamente, Pedro...
Looking for different ways to do it, I solved it using one of these ways presented below:
1- Using Split :
resultado = cadena.Split('(', ')')[1];
either
resultado = cadena.Split("()".ToCharArray())[1];
2- With Regular Expressions Regex.Match :
resultado = Regex.Match(cadena, @"\(([^)]*)\)").Groups[1].Value;
3- With Substring applying a bit of math:
int posInicial = cadena.LastIndexOf("(") + 1;
int longitud = cadena.IndexOf(")") - posInicial;
resultado = cadena.Substring(posInicial, longitud);
Each of those ways of doing it yields the same result:
#resultado 'A','B','valorX','valorY',N...
Honestly, it's hard for me to understand how regular expressions work, I always see them as a bunch of indecipherable hieroglyphic code...
So: What would be the most optimal or appropriate way to do this?
Just do a complexity analysis.
The most efficient algorithm in terms of memory and speed would be the fourth. Basically you have to look at the linear time and memory consumption of each algorithm.
In the first algorithm:
The string is iterated in linear time, looking for the number of characters given in the Split array (passed as parameters in the method) and for each character it will iterate the list until
N
, whereN
is the length of the string. Now, he will need to run the list and createM
temporary variables for each character in itSplit
, then create a list of values by indexing which is accessed in constant timeO(1)
.As a result you will obtain
O((N * M) + 1)
whereN
is the length ofstring
andM
the number ofsubstrings
generated in each operation ofSplit
.The second algorithm:
It is basically the same procedure as the first algorithm, only here, it will consume more memory, because it will have to create an array of characters and create a temporary variable and iterate the
string
one that in this case has been"()"
.The third algorithm:
It is a double-edged sword. The complexity will lie in the length or complexity of the rule, forgive the redundancy. This should only be used if the rule is a bit complex, validating emails, addresses, number formats, mentions and hashtags, etc... For example, if you were not going to use Regex to validate mentions or hashtags in a string, you would have to create a gigantic algorithm and Interval Tree to obtain the indices where each mention or hashtag is found. To work with strings of massive amounts, you would spend a ton of memory trying to get all the substrings that are mentions or hashtags into giant strings. Regular expressions should be used as a validator for complex strings, as they save you from creating a gigantic algorithm. Obviously in this case, it is the one with the greatest complexity and memory consumption.
For the fourth algorithm:
You would have to iterate twice the length
N
of the string to then get the result inN
so the complexity would beO((2 * N) + N)
.So in a top it would be:
O((2 * N) + N)
the fourth algorithm.O((N * M) + 1)
the first algorithm.O((N * M) + 1)
the second algorithm. The first algorithm consumes more memory.O(?)
the fourth algorithm. Regex is the most complicated and the one that consumes more memory. Beforehand, it can be known which is the one with the greatest complexity due to the process that it implies.Note that in your example these times are insignificant (none reach
1ms
processing time). So if you want to see the result in a better way, you would have to try it with a giant length for the chain). This answer is based on my experience in the algorithm, if someone is willing to document and contradict me or find an error, I am available to discuss it.You can read the documentation for the analysis of Algorithms Understanding Big O Notation or This link is more complete .