I have to process a string of comma separated values containing triplets of values and translate at run time each triplet to different types according to its content, the input data would be similar to:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each sub-string should be processed as follows:
1x2y3z
would be processed asVector3
withx = 1
,y = 2
andz = 3
.80r160g255b
would be processed asColor
withr = 80
,g = 160
andb = 255
.48h30m50s
would be processed asTime
withh = 48
,m = 30
ands = 50
.
The problem I run into is that each component is optional (although it always appears in the same order) so the following strings are also Vector3
, Color
and Time
correct:
1x3z
would be processed asVector3
withx = 1
,y = 0
andz = 3
.255b
would be processed asColor
withr = 0
,g = 0
andb = 255
.1h
would be processed asTime
withh = 1
,m = 0
ands = 0
.
What have I tried so far?
All components as optional.
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The characters A
, B
and C
would be replaced by the correct letter in each case. This expression works fine except for the fact that it returns twice as many expected results (one for the searched string and another returns an empty string just after the first match), for example:
1h1m1s
two matches:"1h1m1s"
.""
.
11x50z
two matches:"11x50z"
.""
.
11111h
two matches:"11111h"
.""
.
I can't say I didn't expect it... after all an empty string matches the provided regular expression when all components are empty; so to fix this issue i tried the following:
Quantifier from 1 to 3 elements.
((?:\d+[ABC]){1,3})
But with this expression, strings are captured with the wrong order or even with repeated elements:
1s1m1h
a match, it should not match (wrong order).11z50z
a match, it should not match (repeated components).1r1r1b
a match, it should not match (repeated components).
So I made another try with a modified version of my first try:
Match from start ^
to end $
of the string.
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
It works better than the first version but still matches empty strings, with the added disadvantage that I must first separate the string by each comma ( ,
) and pass the expression over each sub-string.
Using Lookahead
The attempt using Lookahead:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following chain:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
The results are very good, it detects valid matches without adding false positives. Unfortunately, every time a string is found that doesn't match the expression, it adds an empty string just before the invalid string (finds ""
before "1s1m1h"
, "1h1h1h"
, "adfank"
and "12322134445688"
) so I've made one last try by modifying the lookahead condition:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
This expression removes empty strings found before any strings that do not match (?:\d+[ABC]){1,3})
(empty strings before "adfank"
and "12322134445688"
), but empty strings before "1s1m1h"
, "1h1h1h"
are still caught.
So my question is: Is there a regular expression that matches value triplets in a given order, without repetitions, with all optional components but composed of at least one element, and doesn't match empty strings?
The regex tool I'm using is <regex>
from C++11 .
Let us start from the expression where each of the three magnitudes are optional
1. Anchor to the beginning of a value
To ensure that a value starts at the start of the text or at a comma, we add both options to the beginning.
2. Avoid empty matches
As you showed in your last attempt, a positive assertion ( positive lookahead ) can be used to guarantee that there is some character before the comma, without consuming this character within the global match. We just need to verify that there is at least 1 digit (
\d
).3. Only match if it matches the entire pattern
Now, as you mentioned in your last comment, such a pattern could match the lookahead, but then match an empty string. For that, we will add that at the end it must match a comma or the end of the string. In this case, we use another assertion, so that it doesn't consume the next comma (and is available for a next match).
Demo en regex101.com
4. Capture numbers and units separately
For practicality, we should use groups ( in parentheses ) to capture each of the values separately.
Demo en regex101.com
Code
Result:
Demo en ideone.com
Let's take one of the three possible groups since the solution should be later extensible:
What would be schematized:
Now, each of these three groups is optional, although to avoid false positives we must assume that at least one will always be present. That is, the group must contain at least
\d+x
or\d+y
or\d+z
. This assumption has certain implications:\d+x
it is possible that we find\d+y
and\d+z
\d+y
we may find\d+z
but we will never find\d+x
\d+z
we will not be able to find neither\d+x
nor\d+y
This taken to the regular expression would look like this:
This solution avoids retrieving empty strings since it always forces there to be at least one element.
As a result of a conversation in the chat with @Mariano, another option:
This would only be indicated if you can guarantee that the received data is correct, since it would be eaten, for example
1x2345abracadabra
, but of course it should be faster.The above expression could be expanded to be slightly less forgiving: