Added Details
I found on the internet the following two ways to deny:
?!
[^\w]
But I can't find documentation in Spanish and English that they use to describe the operation, I consider it too advanced to understand the meaning of both and how to use them properly to obtain the expected result. The current answer fixes the problem but gives no usage definition.
Problem Statement
I want to select all those words that are not in quotes within a text. I know how to do the opposite.
Example:
Lorem ipsum "pain sit amet" , consectetur adipiscing elit, maecenas est felis "sit amet" .
With the following regular expression you could take the words that are in parentheses:
/"([\w\s]+)"/gim
The result
[
1 => 'dolor sit amet',
2 => 'sit amet'
]
What I look for
[
1 => 'Lorem ipsum ',
2 => ', consectetur adipiscing elit, maecenas est felis ',
3 => '.',
]
Another example would be, from the following list:
- Hello
- hello
- hello
- hi*
- hello
Print / select those that do not use alphanumeric characters (I know how to take the opposite of the established indication). Take everything that is not an alphanumeric character, take all those words that do not have an "l", take everything that does not start with the letter "z", etc.
Working example: http://www.regextester.com/15
I would expect to do something like this for everything that doesn't start with "a":
/!^a.*/
But obviously it doesn't work for me, I'm waiting for your feedback.
Clarification
I would also like to understand the solution proposed and not just a copy-paste to solve the problem.
Note: The regular expression that I quote here to obtain the text works for me in PHP and JavaScript (languages that I use to solve the problem), I have seen that there are small variations of regular expressions in the different languages but between these 2 it is not something substantial. Therefore I would like the proposed solution to work in one of the 2.
I add this answer as information related to regular expressions. It is my first answer on SO in Spanish and it is not a translation, so if it is not correct I can delete it or correct it.
Regarding what you commented in your question:
These are two different concepts. On the one hand you have what is considered a
lookaround
and on the other hand a character class. They work like this:Lookarounds
Lookarounds could be understood as different ways of seeing if a pattern is (or is not) preceded or succeeded by another pattern. For example, the expression
hola(?!chau)
will match the wordhola
as long as the following word does not existchau
.Namely:
Your question is related to "how to deny", but I also wanted to mention that lookarounds are divided into:
hola(?=chau)
and will match the word hello only if there is then byehola(?!chau)
and it will match the word hello only if then there is NO bye(?<=chau)hola
and will match the word hello only if a exists bye before hello(?<!chau)hola
and will match the word hello only if there is NO bye before helloIt is important to mention that lookbehinds are not supported by Javascript in all browsers ( see compatibility ).
You can find more information about lookarounds at:
http://www.regular-expressions.info/lookaround.html
Character classes
On the other hand, there are character classes , which in Spanish would be understood as a set of characters (or class of characters) and is used using the square brackets
[
..]
.In other words, if we have
[aeiou]
, only the vowels without accent marks will be matched.Likewise, a class can be negated, as you mentioned
^
at the beginning using ... so[^aeiou]
in this case it's going to match a character that is n't a tildeless vowel.Here is more information about the character classes:
http://www.regular-expressions.info/charclass.html
verbs
Now, after giving you a bit of context. If you want to use regular expressions to catch/match all words that are not in quotes, then PCRE (Perl Compatible Regular Expressions, supported by PHP, R, Delphi and others) has verbs that are very useful in your case.
The best known are
(*SKIP)
and(*FAIL)
are often used together and are usually used in this way:Practical example
These types of patterns are often called a discard technique, and they always use the same form of patterns separated by
OR
:Thus, the above expression
".*?"(*SKIP)(*FAIL)|(\w+)
will discard all matches of whatever comes before skip and fail(".*?"
), and will capture the last pattern (which is using parentheses...parentheses are used to capture content).The regular expression
".*?"(*SKIP)(*FAIL)|(\w+)
explained would be:Therefore, in the link above, when that expression is applied to the text:
The following content is captured:
Conclusion, regular expressions in my opinion are spectacular but only if you know how to use them. In my personal case, I can't live without them, but like everything... to drive a nail you need a hammer and not a screwdriver. In the case of regex, they are great for pattern matching, but if you need logic then this is definitely not the tool to use.
It's best in these cases to take the easy way out (Regexp is hell). So if you already have how to find what you don't want to find with
So the easiest thing is to use
preg_split
to delete everything that matches that expressionWhen executing this in the chain that you have as an example, it will return three blocks, which are the blocks that are not contained in quotes
If you want to get what will be deleted you do first
preg_match()
and then you can do a normal split of the string usingexplode
no needpreg_split
.Of course you can use
preg_split
but it would be unnecessary processing cycles.For the other case it is a bit easier
anything with non-alphanumeric characters
Simply use a negated range like this expression which marks all non-alphanumeric characters
Already with this expression you can get the inputs that do mathc using
preg_grep
output
everything that doesn't start with a
With this expression:
^[^a]+
output
If you use the following regular expression:
Or more exactly something similar to the following code:
You get the following output:
View demo online.
If you take a look at the function's documentation
preg_split
, you'll find that the flagPREG_SPLIT_NO_EMPTY
removes empty strings from the output, and the flagPREG_SPLIT_DELIM_CAPTURE
returns the part of the regular expression enclosed in parentheses in the result.Discard technique (Also called " the best Regex trick " by RexEgg) -Works
in JavaScript.
It is very simple, it consists of
That is all!
This "trick" is based on the fact that it will match what one does not want to match, but here comes the trick: it will not be captured! That subtle difference is what will let us know if it matched our exception or if it matched what we wanted it to match.
The parentheses in
(esto sí)
create a group and, like any group, when they coincide with the text they capture it... That means that they are obtained separately in the result ofRegExp.exec()
or ofString.matchAll()
. So it's just a matter of checking if something was captured in group 1 or not.Let's take the example from the question: select all the text except the parts in quotes.
Code: