What is a promise in Javascript?

Question

Rubén

Asked: 2020-04-22 14:40:22 +0800 CST 2020-04-22 14:40:22 +0800 CST 2020-04-22 14:40:22 +0800 CST

Using Regular Expression Modifiers in Google Apps Script

772

I want to get the code between the form tags of a web page using Google Apps Script, for which I am using matchthe regular expression/<form(.*?)<\/form>/g

function test() {
  var html = UrlFetchApp.fetch('https://docs.google.com/forms/d/1awKpg_diniayS6360kNXrcgihk36azQ3DJEaZqXDY7A/viewform?embedded=true').getContentText();
  var form = html.match(/<form(.*?)<\/form>/g);
  Logger.log(form);
}

I've tried replacing the modifier gwith other modifiers like mand sbut the Google Apps Script editor doesn't accept them.

Is it possible to use these modifiers? How?

3 Answers

Voted

Mariano · Answer 1 · 2020-07-29T16:33:31+08:00

The regex you're using doesn't match the form because in .*?, period matches any character except line breaks . As for your question, the and modifiers have nothing to do with this behavior. causes all matches to be returned , and changes the behavior of and exclusively./g/m/g/m^$

I show 2 methods to get all the forms of an HTML: the way I think is correct, and then how it could be done with regex (not as reliable, but significantly more efficient and less dangerous than using it [\s\S]*for the content of a tag).

Method 1: Accessing the DOM (the correct way to do it)

With XMLService we can use the DOM ( Document Object Model ), which allows us to select all the tags <form>present in the document.

XMLService has the method .parse()to process a plain structured string as an XML document. However, it is not permissive at all. It only accepts valid and well-structured XML (something that almost no web page complies with). And, in order to process it, we can perform a " trick ". An older version of XMLService , the XML Service , is much more forgiving and accepts plain HTML , converting it to valid XML. So, once processed, we can pass it to the XMLService . The bad news: XML is deprecated and there's no telling how long it will be around. But for now it works.

This way is safe and fail-safe, since any problems with the HTML would be detected after XML.parse(), avoiding incorrect text output.

function doGet() {
  // En vez de obtener el html con UrlFetchApp.fetch(), se usa esto a modo de ejemplo
  var html = '<html>'
           + '  <body>'
           + '    <p>Texto a borrar</p>'
           + '    <form>'
           + '      <input type="radio" name="sexo" value="masculino" checked="1"> Masculino<br>'
           + '      <input type="radio" name="sexo" value="femenino"> Femenino'
           + '    </form>'
           + '    <p>Esto no debe aparecer</p>'
           + '    <form>'
           + '      Segundo form <input type="button" value="Funciona">'
           + '    </form>'
           + '  </body>'
           + '</html>';
  
  // Se crea el documento
  var doc = Xml.parse(html, true);        //Xml.parse está obsoleto pero sigue funcionando y mejor que XmlService
  var body = doc.html.body.toXmlString(); //truco para que funcione XmlService (sino no acepta HTML que no cumple como XML)
  var atom = XmlService.getNoNamespace();
  doc = XmlService.parse(body);
  var root = doc.getRootElement();
  var i, resultado = '';
  
  // Se obtienen todos los forms
  var forms = getElementsByTagName(root, 'form');
  
  // Se unen en un string
  for(i in forms) resultado += XmlService.getRawFormat().format(forms[i]);
  
  // Envíar resultado como salida del script
  return HtmlService.createHtmlOutput(resultado);
}

function getElementsByTagName(element, tagName) {  
  // Fuente: https://sites.google.com/site/scriptsexamples/learn-by-example/parsing-html
  var data = [];
  var descendants = element.getDescendants();
  for(i in descendants) {
    var elt = descendants[i].asElement();     
    if( elt !=null && elt.getName()== tagName) data.push(elt);      
  }
  return data;
}

The key to this is that once converted to XML, we have the advantage of having functions to move around/add/modify specific parts of each element. We use the method getDescendants()to get all the nodes, and getName()to see if it is a form.

Result:

<body><form> <input type="radio" name="sexo" value="masculino" checked="1"> Masculino<br> <input type="radio" name="sexo" value="femenino"> Femenino </form><form> Segundo form <input type="button" value="Funciona"> </form></body>

Method 2: Regex (may fail)

The following regex matches the text of a <form>from beginning to end.

/<\s*form\b[\s\S]*?<\s*\/\s*form\b[^>]*>/gi

demo en regex101.com

Description

<\s*formMatches < + 0 or more spaces + form
\bmatches if it is at a position that is a whole word boundary (thus matches formand not formosa).
[\s\S]*?matches any character repeated 0 or more times. Also, the last ?one makes the repetition behave without greed , that is, it repeats itself as few times as possible (this syntax solves the problem of /.*?/s).
<\s*\/\s*form\bis the pattern for </form, which can have spaces around the slash and must be a whole word.
[^>]*consumes all characters, any character except >.
>matches the end of the tag.
/giSet the modes: global, so that it returns all the results it finds; and without distinguishing between upper and lower case.

Code

function doGet() {
  // En vez de obtener el html con UrlFetchApp.fetch(), se usa esto a modo de ejemplo
  var html = '<html>'
           + '  <body>'
           + '    <p>Texto a borrar</p>'
           + '    <form>'
           + '      <input type="radio" name="sexo" value="masculino" checked="1"> Masculino<br>'
           + '      <input type="radio" name="sexo" value="femenino"> Femenino'
           + '    </form>'
           + '    <p>Esto no debe aparecer</p>'
           + '    <form>'
           + '      Segundo form <input type="button" value="Funciona">'
           + '    </form>'
           + '  </body>'
           + '</html>';
  
  var regex = /<\s*form\b[\s\S]*?<\s*\/\s*form\b[^>]*>/gi;
  
  // Se extraen todos los forms del html y se unen en un string
  var forms = html.match(regex);
  var resultado = forms.join('');
  return HtmlService.createHtmlOutput(resultado);
}

When could it fail?

There is a lot of information on the web about why you shouldn't use regex to process HTML. Very broadly, in any structure that obfuscates the tags <form>or allows them to be used without being evaluated as such. Without in far, the regular expression would fail when there is a form inside a comment:

<form>
  Formulario con comentarios
  <!-- Comentario con "</form>" dentro --!>
</form>

And this can be easily fixed, but then we'll have another weird case that would make it fail and could be fixed, and then another, and another, and so on.

Mikel · Answer 2 · 2020-08-06T04:09:51+08:00

Mikel

2020-08-06T04:09:51+08:002020-08-06T04:09:51+08:00

HTML does not meet the conditions to be a regular language , it is not recommended to use regular expressions to deal with this language.

With JQuery you can do it easily:

$('form').each(function(){
   var html = $( this ).html();
})

This way you select all the Forms of the page, you go through them and for each one you collect its internal html, if you need to select a specific one you can modify the selector. This is one way, you can use other frameworks, but regular expressions are not recommended.

NOTE: That it is not a regular language does not mean that certain regular expressions do not work, but rather that it does not meet the conditions for all of them to work.

0

Rubén · Answer 3 · 2020-04-22T14:40:22+08:00

Rubén

2020-04-22T14:40:22+08:002020-04-22T14:40:22+08:00

According to answer #1 in Issue 2098: Multiline regular expression modifier not working there are problems that prevent the use of the modifier m, however something like [\s\S].

The following code if you successfully get the code between the form tags:

function doGet() {
  var html = UrlFetchApp.fetch('https://docs.google.com/forms/d/1awKpg_diniayS6360kNXrcgihk36azQ3DJEaZqXDY7A/viewform?embedded=true').getContentText();
  var output = html.match(/<form[\s\S]*form>/g);
  return HtmlService.createHtmlOutput(output);
}

-1

Using Regular Expression Modifiers in Google Apps Script

Method 1: Accessing the DOM (the correct way to do it)

Method 2: Regex (may fail)

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?