I want to get the code between the form tags of a web page using Google Apps Script, for which I am using match
the regular expression/<form(.*?)<\/form>/g
function test() {
var html = UrlFetchApp.fetch('https://docs.google.com/forms/d/1awKpg_diniayS6360kNXrcgihk36azQ3DJEaZqXDY7A/viewform?embedded=true').getContentText();
var form = html.match(/<form(.*?)<\/form>/g);
Logger.log(form);
}
I've tried replacing the modifier g
with other modifiers like m
and s
but the Google Apps Script editor doesn't accept them.
Is it possible to use these modifiers? How?
The regex you're using doesn't match the form because in
.*?
, period matches any character except line breaks . As for your question, the and modifiers have nothing to do with this behavior. causes all matches to be returned , and changes the behavior of and exclusively./g
/m
/g
/m
^
$
I show 2 methods to get all the forms of an HTML: the way I think is correct, and then how it could be done with regex (not as reliable, but significantly more efficient and less dangerous than using it
[\s\S]*
for the content of a tag).Method 1: Accessing the DOM (the correct way to do it)
With XMLService we can use the DOM ( Document Object Model ), which allows us to select all the tags
<form>
present in the document.XMLService has the method
.parse()
to process a plain structured string as an XML document. However, it is not permissive at all. It only accepts valid and well-structured XML (something that almost no web page complies with). And, in order to process it, we can perform a " trick ". An older version of XMLService , the XML Service , is much more forgiving and accepts plain HTML , converting it to valid XML. So, once processed, we can pass it to the XMLService . The bad news: XML is deprecated and there's no telling how long it will be around. But for now it works.This way is safe and fail-safe, since any problems with the HTML would be detected after
XML.parse()
, avoiding incorrect text output.The key to this is that once converted to XML, we have the advantage of having functions to move around/add/modify specific parts of each element. We use the method
getDescendants()
to get all the nodes, andgetName()
to see if it is a form.Result:
Method 2: Regex (may fail)
The following regex matches the text of a
<form>
from beginning to end.demo en regex101.com
Description
<\s*form
Matches<
+ 0 or more spaces +form
\b
matches if it is at a position that is a whole word boundary (thus matchesform
and notformosa
).[\s\S]*?
matches any character repeated 0 or more times. Also, the last?
one makes the repetition behave without greed , that is, it repeats itself as few times as possible (this syntax solves the problem of/.*?/s
).<\s*\/\s*form\b
is the pattern for</form
, which can have spaces around the slash and must be a whole word.[^>]*
consumes all characters, any character except>
.>
matches the end of the tag./gi
Set the modes: global, so that it returns all the results it finds; and without distinguishing between upper and lower case.Code
When could it fail?
There is a lot of information on the web about why you shouldn't use regex to process HTML. Very broadly, in any structure that obfuscates the tags
<form>
or allows them to be used without being evaluated as such. Without in far, the regular expression would fail when there is a form inside a comment:And this can be easily fixed, but then we'll have another weird case that would make it fail and could be fixed, and then another, and another, and so on.
HTML does not meet the conditions to be a regular language , it is not recommended to use regular expressions to deal with this language.
With JQuery you can do it easily:
This way you select all the Forms of the page, you go through them and for each one you collect its internal html, if you need to select a specific one you can modify the selector. This is one way, you can use other frameworks, but regular expressions are not recommended.
NOTE: That it is not a regular language does not mean that certain regular expressions do not work, but rather that it does not meet the conditions for all of them to work.
According to answer #1 in Issue 2098: Multiline regular expression modifier not working there are problems that prevent the use of the modifier
m
, however something like[\s\S]
.The following code if you successfully get the code between the form tags: