I have a document docx
already saved in bytes[]
and I need to pass it to html so I can display it on a page.
I am using Visual Studio with .NET to develop it in C#.
Currently it works for me from pdf that is easy to transform to html but this is not the case with docx
any Microsoft product since I cannot use the native library interop
since it is not a guarantee that the server has it installed.
The end result is:
strFinalDoc = strFinalDoc.Replace("<body>", "<body>" + documentInfoHtml + "<BR /><BR />");
Where documentInfoHtml
is the result of transforming them bytes[]
to html and strFinalDoc
is simply the content that replaces that body
of a page.
I have found a solution but practically all interop
of them use either paid libraries.
Do you know any way to do it with free software or open projects?
Also I have to do the same process for files xls
and xlsx
.
The current answer is very good but it only covers one file doc
and not thedocx
It is also important to keep the existing CSS styles as much as possible, so answers that simply extract the content to generate it myself HTML
is not enough in the sense that it would lose all the formatting.
Using Apache POI is relatively easy to do. So we can use NPOI to do the transformation in C#.
Based on this answer from Convert Word to HTML with Apache POI
Let's convert this to C#
I recommend that you do not download NPOI through nuget (current version 2.2.1) and use version 2.1.3.1 but from the official page since two more files are needed that do not come in nuget
NPOI.ScratchPad.HSSF.dll
andNPOI.ScratchPad.HWPF.dll
both are compiled with NET Framework 2.x and you need that the other libraries are version 2.x too. These 2 files can be downloaded from the NPOI GithubDoing tests it seems that the version of NPOI has a bug in the final result of the HTML, since to simulate the format it creates the style with the first letter of the label type and an incremental number
but for some reason the NET version doesn't render them right
Maybe it has to do with
Transformer
it but I don't know what the equivalence will be in C #By doing a manual count, you may no longer need to make the output look good
Well, a Word document is made up of XML, so why not starting from this point, just convert your
XML
toHTML
. Look at the MSDN page they show you the structure of a word document in xml, here is the structure:And in the same way in the MSDN they also give you an example of the use of
XmlDocument Class
, here is an example of it:Now, to access the nodes you can do it like this:
So that's where you already concatenate all your code to it
HTML
. I saw the codes in the MSDN XmlDocument ClassConversion
As you have already realized, the docx is nothing more than a zipped xml, and therefore easily convertible to HTML.
shipping to customer
To send the information to the client (to make sure they don't see just a txt) you have to remember to send the Headers first:
In charset you have to put what corresponds and also in content length (in bytes, not in chars, remember that a utf8 char can measure more than one byte). The length is used so that the browser knows how many bytes it expects and can put the progress bar when the document is long.
compact
Once it works without compacting you could consider using a middleware or a module to send the compacted information (gzip for example)