Today answering a question on this site I found a very interesting possible solution, because I accidentally deleted a part of the solution and that solution worked even though it didn't make sense to me.
Without any more preambles:
const regex = /([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)/
const strings = [
'AAAA_BBBB_CCCC_1_15_17'
,'AAAA_BBBB_1'
,'AAAA_BBBB_15_17'
,'AAAA_BBBB_CCCC_1_2'
]
strings.forEach(string => {
const [fullMatch, ...groups] = string.match(regex)
console.log(groups)
})
As you can see, I captured a group without capture using ((?:_\d+)+)
, and on the regex101 site it works for all languages, which up to now are:
- pcre (php)
- javascript
- python
- golang
Note: seeing that not everyone reads all the available information, the important thing is that I am obtaining the behavior of
/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+(?:_\d+)*)/
wearing
/([a-zA-Z]+)_([a-zA-Z]+_?[a-zA-Z]+?)((?:_\d+)+)/
which is strange, because if I don't catch, the caught group is just the last part that matches:
const regex = /(_\d+)+/g;
const str = `_1_2_3_4_5_6_7`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
I wish someone would explain to me why using a double capture worked and what the implications (positive or negative) are of trapping a group without capturing the way I did.
And it will work for you in any regex dialect.
(?:
…)
.In fact, using the first form would be an error, since you are unnecessarily repeating the
(?:_\d+)*
one at the end, which will never match anything, because the previous construction ((?:_\d+)+
), has already consumed all that there were, leaving nothing for the last one.It can be corroborated with an example, adding one more group around the last
(?:_\d+)*
.You are not using a double capture. In
((?:_\d+)+)
, only the outer group is the one that captures. And precisely(?:
…)
it is a group without capture .A structure like
((?:_\d+)+)
this is perfectly normal and is frequently used. Think of it this way: it's the same as(\d+)
, only what's repeated in((?:_\d+)+)
aren't just digits but underscores followed by digits.Nesting groups (with or without capturing) is just as valid as, and pretty much the same as, using nested loops in your code... Simple as that.
None. Neither positive nor negative. You wouldn't have achieved the same result without nesting a non-capturing group inside a capturing one like that... Again, it's a completely normal structure.
In fact, as a general rule of thumb, you should always use non-capturing groups
(?:
...)
when you don't need to get the text that was matched. A group without capturing does not take up unnecessary memory (neither in capturing the text, nor in generating the indices of the start and end positions).By the way, one more correction. Use a structure like:
it's a mistake. You are consecutively repeating 2 constructions that match the same thing. Since el
_
is optional, the regex can be converted to[a-zA-Z]+[a-zA-Z]+?
, and such a construct is the perfect recipe for catastrophic backtracking .This is an issue that isn't going to throw an error in the cases you're seeing, but with a slightly more complicated regex, longer texts, and a condition that doesn't match, it could cause the browser to freeze without returning a result.
Let's look at a test, not so drastic, but obvious enough:
And this, if it were part of a more complicated regex, could bring you serious problems.
Also, by using
([a-zA-Z]+_?[a-zA-Z]+?)
, you're requiring it to be at least 2 characters long, so it wouldn't match something likeA_B_1
.The truth is that it has no implications. A non-capturing group is simply used to group an expression for convenience, without the result being returned in a group, this does not mean that it cannot be part of another group.
Considering the following example:
I don't get the group of
a
's andb
's, but a group that I may not be interested in: the lasta
or lastb
of the expressiona|b
If I use a non-capturing group:
I don't get any group.
But if I am interested in knowing the complete chain between the two
c
, I am forced to put a group, completely enclosing the expression of interest, including*
:Getting the full set of
a
's andb
'sEDITION:
If what you are interested in is comparing your 2 expressions:
Y
Let me tell you that they are completely equivalent:
The last group in both:
Is the same as:
With
A = _\d+
and in the first:(?:A)+(?:A)*
is equivalent toA+A*
what is undoubtedly the same asA+
Notice that you are not even capturing the same non-capture group, but a different one:
((?:A)+)
the quantifier+
is made by a different expression, even if it was the same expression, there is nothing to prevent capturing the same group:((A))
is as valid as((?:A))