People, I hope you understand that this is a special situation. It is a question that does not follow the rules, but right now the world is a mess and I really need if someone can give me a hand with this. Fill me with negatives if you want but please don't close the question.
At my work we are manually uploading some files referring to the complaints reported by people regarding COVID-19. The issue is that this virus is growing exponentially, which is why our work is also growing, and we are behind in statistics, which is partly why the number of known cases is lower than it actually is. Without further ado, I'll show you what I have:
We receive a .doc file (ask in the chat before asking the question here, and they asked me the version of Word. Actually we work with different versions, I have the latest version at home because I am a student and MS gives it away and always updates it , but at work I have 2010, I think the version may not matter since I want to work with text, I could even copy it and take it to a notepad for example)
This .doc file has about 100 texts that have more or less this form
Illicit Investigation – Illicit Port (COVID-19 Coronavirus) – Senior Female 03/19 – PU. 81027 – High 07:06 am – CP Campana; 05:00 a.m. She received a call to XXXX (Dda. XXXX 1000), realizing that her neighbor XXX hers, returned from Brazil the day before, not complying with the isolation protocol. Personal proceeded to provide the corresponding telephone numbers for such cases, giving notice to Health personnel.
Each of these texts must be passed to a table in a database, which has about 20 fields. I would like to at least be able to get the 3 data that I put in bold, which are the date, the Urgent Part number, which is always after PU.
and the time, which is always the one after the word alta
(since there may be other times in the paragraph, anyway the time I'm looking for is the first one that appears, so you can have those two criteria, or the first appearance, or after "high")
If you don't know how to do it but you know tools that can do it, it would also be of great help to me
The thing is that we are passing them by hand, and this virus is growing by leaps and bounds with it the complaints, and we are not being able to upload this.
The idea is that each new text (could be identified by the appearance of the string PU.
) creates a new row (it can be in an excel or access table, it doesn't matter) and place those 3 fields.
I hope you can understand an exceptional question at an exceptional time.
One approach, using JavaScript (NodeJS) is to create a script
script
that does an analysis of the Word document (.docx
) and writes the result to an Excel workbook (.xlsx
).I have implemented this code very quickly to try to provide a solution to the problem, I understand that it is for an emergency.
You need:
I don't do case sensitive checks, this is a first approximation to a possible solution.
Using the module
word-text-parser
we get the paragraphs of the Word document.We filter each paragraph to keep those that exclusively contain the strings: PU. and High
We also create a regular expression to get the date, which we assume is always in the following format: XX/XX . Where XX are numeric digits from 00 to 99. (It is not validated if a date is valid, therefore 03/12 will be valid as well as 12/03).
Then we separate each paragraph into words, and we search for the words we want according to the aforementioned criteria.
Each value is placed in an object with the following structure:
A list of these values is saved and an Excel workbook is generated using the ExcelJS library.
The final code may look like this:
I have a repository of this implemented on Github , there is 1 test Word file (
demo.docx
), which has a random (unstructured) structure.I hope that you can make progress with this. Any questions ask.
With LibreOffice you can convert a .doc to .txt by command line:
From there you can get what you need with regular expressions either with the same command line or with another programming language. By first dividing the text by the separator
" - "
. Even the data you need could be extracted from the database depending on the version.