I need to get the text inside a string HTML
which may contain malicious code, so I need the method not to execute scripts
, download external resources, etc.
Example of HTML
:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top: 0;margin-bottom: 0;}</style>
<script>alert('Cuidado script!')</script>
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
</head>
<body dir="ltr">
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Buenos días Señor X.</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Muchas gracias por el envió.</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Cordialmente</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Sr Y </div>
<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
<table style="border-top: 1px solid #D3D4DE;">
<tbody>
<tr>
<td style="width: 55px; padding-top: 18px;">
<a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail" target="_blank"><img onload="alert('Cuidado imagen!')" onerror="alert('Cuidado error!')" alt="" width="46" height="29" style="width: 46px; height: 29px;" src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif"></a>
</td>
<td style="width: 470px; padding-top: 17px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">
Libre de virus. <a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail" target="_blank" style="color: #4453ea;">
www.avast.com</a> </td>
</tr>
</tbody>
</table>
<a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"></a>
</div>
</body>
</html>
Expected result:
should not run any
script
External resources ( images, styles, etc ) should not be downloaded
The result should be the text:
Buenos días Señor X. Muchas gracias por el envió. Cordialmente Sr Y Libre de virus. www.avast.com
One solution is to use
DOMParser
Example:
Credits:
Escaping all possible malicious script is quite complicated. Here are many of the XSS vulnerabilities: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet but they are not all because they are added.
The malicious script may not necessarily be JavaScript, because it could be executed in the language of the server.
The ideal is not to do this, but if you do, you have to take into account every possibility, here I leave the OWASP article regarding the prevention of XSS attacks https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html