I need to get the text inside a string HTML
which may contain malicious code, so I need the method not to execute scripts
, download external resources, etc.
Example of HTML
:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top: 0;margin-bottom: 0;}</style>
<script>alert('Cuidado script!')</script>
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
</head>
<body dir="ltr">
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Buenos días Señor X.</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Muchas gracias por el envió.</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Cordialmente</div>
<div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
Sr Y </div>
<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
<table style="border-top: 1px solid #D3D4DE;">
<tbody>
<tr>
<td style="width: 55px; padding-top: 18px;">
<a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail" target="_blank"><img onload="alert('Cuidado imagen!')" onerror="alert('Cuidado error!')" alt="" width="46" height="29" style="width: 46px; height: 29px;" src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif"></a>
</td>
<td style="width: 470px; padding-top: 17px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">
Libre de virus. <a href="https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail" target="_blank" style="color: #4453ea;">
www.avast.com</a> </td>
</tr>
</tbody>
</table>
<a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"></a>
</div>
</body>
</html>
Expected result:
should not run any
script
External resources ( images, styles, etc ) should not be downloaded
The result should be the text:
Buenos días Señor X. Muchas gracias por el envió. Cordialmente Sr Y Libre de virus. www.avast.com