I am trying to make a regular expression that removes comments of the style //
and /**/
, at the moment, I used one taken from this site :
(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)
The problem was when the comment was in the form of a literal string ( with either single or double quotes ), eg:
var a = "//Holaaa";
So I tried to use lookbehind
and lookahead
together to escape both quotes and it came out like this:
(?<!\"|\')((/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*))(?!\"|\')
The problem with this is that it doesn't work for cases like the following:
var ar = "asasasas /*dsdsdsd*/ "
var ar = "asaasas //dsdsdsd"
var ar = "asasasas /*dsdsdsd*/ dsadasdsda"
var ar = "asaasas //dsdsdsd asdsadsadsad"
I tried changing (?<!\"|\')
by (?<!\"|\'.*)
and (?!\"|\')
by (?!.*\"|\')
, but that didn't work either.
What am I missing?
Note: the idea is to use it in Java, but the answer does not necessarily have to be in its standard, as long as I know the expression later I can adapt it on my own.
Problem
Honestly, the regex you pulled from that page sucks, not only in what it leaves out, but also in terms of efficiency. Your attempt to fix it with assertions ( lookahead / lookbehind ) is good, but it's a strategy that doesn't work very well. The explanation of why it will not work is too long, but it could be summarized in that something like
(?<!"|')
only checks 1 character back from the current position and, as much as we could do it with a variable length (-no, it can't), you would not be able to determine if the previous quote is opening or closing a comment. In short: wrong strategy (in which we have all fallen).Solution
For this type of case, where all the syntax prior to the position in which the match is sought is relevant, the way to get to that point is by consuming each part of the text, while validating each structure.
The regex should be anchored at the beginning of the text, or at the end of the previous replacement (
\G
), and match the text where a comment has no meaning, until the comment is found. Broadly speaking, it would replacewhere all previous text is captured and included in when replacing with
Regular phrase
Now, finding everything that is not a comment involves matching all characters except those with special meaning , and adding rules to match each of those exceptions (one
\
that escapes a character, quoted text, etc).As a way to simplify the explanation, I commented out the regex with the target of each structure:
Or in a line without comments:
With escaped slashes and quotes for Java:
show
https://regex101.com/r/wDg8LJ/1/
Java code
Result
show
http://ideone.com/NSGmCL