What is a promise in Javascript?

Question

Miquel Coll

Asked: 2020-07-01 06:57:03 +0800 CST 2020-07-01 06:57:03 +0800 CST 2020-07-01 06:57:03 +0800 CST

Convert docx document to html

772

I have a document docxalready saved in bytes[]and I need to pass it to html so I can display it on a page.

I am using Visual Studio with .NET to develop it in C#.

Currently it works for me from pdf that is easy to transform to html but this is not the case with docxany Microsoft product since I cannot use the native library interopsince it is not a guarantee that the server has it installed.

The end result is:

strFinalDoc = strFinalDoc.Replace("<body>", "<body>" + documentInfoHtml + "<BR /><BR />");

Where documentInfoHtmlis the result of transforming them bytes[]to html and strFinalDocis simply the content that replaces that bodyof a page.

I have found a solution but practically all interopof them use either paid libraries.

Do you know any way to do it with free software or open projects?

Also I have to do the same process for files xlsand xlsx.

The current answer is very good but it only covers one file docand not thedocx

It is also important to keep the existing CSS styles as much as possible, so answers that simply extract the content to generate it myself HTMLis not enough in the sense that it would lose all the formatting.

3 Answers

Voted

jasilva · Answer 1 · 2020-07-01T10:01:22+08:00

Using Apache POI is relatively easy to do. So we can use NPOI to do the transformation in C#.

Based on this answer from Convert Word to HTML with Apache POI

java version

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new
FileInputStream("D:\\temp\\seo\\1.doc"));

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
        DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);

TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();

String result = new String(out.toByteArray());
System.out.println(result);

Let's convert this to C#

HWPFDocumentCore wordDocument = WordToHtmlUtils.LoadDoc(@"D:\Hola.doc"); 

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
    new XmlDocument());

wordToHtmlConverter.ProcessDocument(wordDocument);

XmlDocument htmlDocument = wordToHtmlConverter.Document;

htmlDocument.Save(@"D:\Hola.html");

I recommend that you do not download NPOI through nuget (current version 2.2.1) and use version 2.1.3.1 but from the official page since two more files are needed that do not come in nuget NPOI.ScratchPad.HSSF.dlland NPOI.ScratchPad.HWPF.dllboth are compiled with NET Framework 2.x and you need that the other libraries are version 2.x too. These 2 files can be downloaded from the NPOI Github

Doing tests it seems that the version of NPOI has a bug in the final result of the HTML, since to simulate the format it creates the style with the first letter of the label type and an incremental number

<!-- ejemplo POI java-->
span.s1{color:red;}
...
<span class="s1">Hola</span>

but for some reason the NET version doesn't render them right

<!-- ejemplo NPOI C#-->
span.s1{color:red;}
...
<span>Hola</span>

Maybe it has to do with Transformerit but I don't know what the equivalence will be in C #

By doing a manual count, you may no longer need to make the output look good

    ....
    XmlNode node = htmlDocument.FirstChild.LastChild; //encontramos el body
    EditNode(node); //metodo de edición recursiva
    htmlDocument.Save(@"D:\tmp18\Hola.html");
}

Dictionary<string, int> cuenta; //para llevar la cuenta de cada elemento

private void EditNode(XmlNode node) {
    try
    {
        XmlElement xe = (XmlElement)node;     

        xe.SetAttribute("class", cuenta[xe.LocalName].ToString()); //localName seria span o p por ejemplo
        cuenta[xe.LocalName] += 1;
    }
    catch (Exception) { return; }

    if (node.HasChildNodes) {
        foreach (XmlNode x in node.ChildNodes) {                
            EditNode(x);
        }
    }

}

Alfonso Carrasco · Answer 2 · 2020-07-01T10:29:32+08:00

Well, a Word document is made up of XML, so why not starting from this point, just convert your XMLto HTML. Look at the MSDN page they show you the structure of a word document in xml, here is the structure:

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
  <CoreProperties xmlns="http://schemas.microsoft.com/package/2005/06/md/core-properties"> 
   <Title>Word Document Sample</Title> 
   <Subject>Microsoft Office Word 2007</Subject> 
   <Creator>2007 Microsoft Office System User</Creator> 
   <Keywords/> 
   <Description>2007 Microsoft Office system .docx file</Description> 
   <LastModifiedBy>2007 Microsoft Office System User</LastModifiedBy> 
   <Revision>2</Revision> 
   <DateCreated>2005-05-05T20:01:00Z</DateCreated> 
   <DateModified>2005-05-05T20:02:00Z</DateModified> 
  </CoreProperties>

And in the same way in the MSDN they also give you an example of the use of XmlDocument Class, here is an example of it:

 using System;
 using System.IO;
 using System.Xml;

 public class Sample
 {
   public static void Main()
   {
     //Create the XmlDocument.
     XmlDocument doc = new XmlDocument();
     doc.LoadXml("<?xml version='1.0' ?>" +
            "<book genre='novel' ISBN='1-861001-57-5'>" +
            "<title>Pride And Prejudice</title>" +
            "</book>");

     //Display the document element.
     Console.WriteLine(doc.DocumentElement.OuterXml);
  }
 }

Now, to access the nodes you can do it like this:

  public XmlNode GetBook(string uniqueAttribute, XmlDocument doc)
  {
      XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
      nsmgr.AddNamespace("bk", "http://www.contoso.com/books");
      string xPathString = "//bk:books/bk:book[@ISBN='" + uniqueAttribute +      "']";
      XmlNode xmlNode = doc.DocumentElement.SelectSingleNode(xPathString, nsmgr);
     return xmlNode;
  }

So that's where you already concatenate all your code to it HTML. I saw the codes in the MSDN XmlDocument Class

Emilio Platzer · Answer 3 · 2020-07-10T05:59:58+08:00

Emilio Platzer

2020-07-10T05:59:58+08:002020-07-10T05:59:58+08:00

Conversion

As you have already realized, the docx is nothing more than a zipped xml, and therefore easily convertible to HTML.

shipping to customer

To send the information to the client (to make sure they don't see just a txt) you have to remember to send the Headers first:

Content-Type:text/html; charset=utf8
Content-Length: 12345

In charset you have to put what corresponds and also in content length (in bytes, not in chars, remember that a utf8 char can measure more than one byte). The length is used so that the browser knows how many bytes it expects and can put the progress bar when the document is long.

compact

Once it works without compacting you could consider using a middleware or a module to send the compacted information (gzip for example)

1

Convert docx document to html

Conversion

shipping to customer

compact

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?