I have a docx of articles published in marketing magazines that contains scales and I would like to extract them.
For example, with the following scale I would like to obtain the title: Ten-Item and Five-Item Presonality Inventories, the questions "I see my self as:" and the answers.
{
"title":"Ten-Item and Five-Item Presonality Inventories",
"scales":{
"I see myself as":{
"answer0":"1. Extraverted, enthusiastic",
"answer1":"2. Critical, quarrelsome",
"answer2":"3. Dependable, self-disciplined",
...
},
"I see my self as":{
"answer0":"Extraverted, enthusiastic (that is, sociable ... ",
"answer1":"2. Agreeable, kind ...",
...
}
}
}
or something similar if duplicate scale names are a problem.
Until today I can extract the content separately, with the following code:
import pandas as pd
!pip install tika
from tika import parser
!pip install python-docx
from docx import *
# if you use colab, you need to have it in your drive in a "Books" folder
document = Document('/content/drive/My Drive/Books/handbook-of-marketing-scale-2011.docx')
dict_open = False
qas = {}
qa = {}
last_line_is_bold_or_italic = False
last_line_is_digit = False
for para in document.paragraphs:
for run in para.runs:
try:
# we have a question and nothing before: begining of a q-aire
if run.bold or run.italic and last_line_is_digit == False: # first line must be bold # second must be numbers
last_line_is_bold_or_italic = True
question = run.text
# we have an answer and last line was a question (last_line_is_bold_or_italic) or an answer (last_line_is_digit)
# we are inserting an answer in a q-aire
elif run.text[0].isdigit() and (last_line_is_bold_or_italic or last_line_is_digit):
last_line_is_bold_or_italic = False
last_line_is_digit = True
number = run.text.split(".")[0]
answer = run.text.split(".")[-1]
qa[number] = answer
# we have a question (last_line_is_bold_or_italic) and last line was an answer, we close the preceding question
# we are inserting the q-aire in the dictionnary of all q-aires
elif run.bold or run.italic and last_line_is_digit:
last_line_is_digit = False
# close dict
qas[question] = qa
qa = {}
dict_open = False
# we might have something bold, italics, ... but who knows what it is?
# we reset
else:
last_line_is_bold_or_italic = False
last_line_is_digit = False
question = None
pass
except IndexError:
last_line_is_bold_or_italic = False
last_line_is_digit = False
question = None
if dict_open:
qa = {}
dict_open = False
else:
pass
But I get:
{' ': {'11, 186–93': '',
'2002': ' Professor Bearden served as the University SEC Faculty Athletics Representative from 2006 to 2010 and received the first Distinguished Service Award from the ',
'23, 387–93': ''},
' slight agreement;': {'0, ': '0, ',
'26 (June), 85–98': '',
'34 (June), 456–67': ''},
' somewhat characteristic, ': {'1982 by the American Psychological Association': '',
'3': '3',
'42 (1), 116–31': ''},
' somewhat true, but with exceptions; ': {'4': '4'},
' uncertain, ': {'4': '4'},
'.': {'05), respectively': '39 ',
'28) was reportedfor the scale(Houseand Rizzo 1972,p': '',
'7, 467–505': ''},
'Academy of Marketing Science, ': {'1': ' In general, do you talk to your friends and neighbors about cable television:',
'24 (2)': '24 (2)',
'31, 55–64': '',
'98\tHANDBOOK OF MARKETING SCALES': '98\tHANDBOOK OF MARKETING SCALES'},
'Analytic/Holistic Thinking Scale: AHS': {'01)': '',
'12, 341–52': '',
'12, 663–82': '',
'13, 121–37': '',
'1995 by John Wiley & Sons, Inc': '',
'1996 by Elsevier Science': '',
'22, 41–53': '',
'24–38': '',
'288\tHANDBOOK OF MARKETING SCALES': '288\tHANDBOOK OF MARKETING SCALES',
'51, 407–15': ''},
'Assertiveness and Aggressiveness': {'432\tHANDBOOK OF MARKETING SCALES': '432\tHANDBOOK OF MARKETING SCALES'},
'Attention to Social Comparison Information: ATSCI': {'128\tHANDBOOK OF MARKETING SCALES': '128\tHANDBOOK OF MARKETING SCALES',
'224 sample': '',
'30, 526–37': '',
'4 ': '4 ',
'50; and consumer behavior measures) show strong support for the validity of the ATSCI': '',
'76, 461–71': ''},
'Attitude Toward Private Label Products Scale': {'01) with measures of experiential shopping motives, compulsive buying, pleasure, and arousal, respec-tively': '',
'3 (3), 239–49': '',
'370\tHANDBOOK OF MARKETING SCALES': '370\tHANDBOOK OF MARKETING SCALES',
'40 (August), 310–20': '',
'5-point Likert-type scales': '',
'7-point scales with endpoints indicated above': '',
'79 (Summer), 77–95': '',
'93 (college students)': ''},
'Behavioral Identification Form: BIF': {'110 (3), 403–21': '',
'290\tHANDBOOK OF MARKETING SCALES': '290\tHANDBOOK OF MARKETING SCALES',
'57 (4), 660–71': ''},
'Business Research, ': {'31': '31'},
'Consumer Attitudes Toward Marketplace Globalization': {'27, 37–65': '',
'29, 83–100': '',
'392\tHANDBOOK OF MARKETING SCALES': '392\tHANDBOOK OF MARKETING SCALES',
'6': '6'},
'Consumer Involvement Profiles: CIP': {'18, 392–401': '',
'1991 by University of Chicago Press': '',
'230': '230',
'234': '234',
'240\tHANDBOOK OF MARKETING SCALES': '240\tHANDBOOK OF MARKETING SCALES',
'30, 3–23': '',
'31 (1), 3–23': '',
'5 ': '5 ',
'51 (July), 5–15': ''},
'Consumer’s Need for Uniqueness: CNFU': {'44\tHANDBOOK OF MARKETING SCALES': '44\tHANDBOOK OF MARKETING SCALES',
'58 subjects': '',
'86 (October), 518–27': ''},
'Customer-Based Reputation of a Service Firm: CBR Scale': {'23 (September), 227–39': '',
'35, 127–43': '',
'396\tHANDBOOK OF MARKETING SCALES': '396\tHANDBOOK OF MARKETING SCALES',
'62, 924–30': ''},
'Electronic Service Quality: E-S-QUAL': {'412\tHANDBOOK OF MARKETING SCALES': '412\tHANDBOOK OF MARKETING SCALES',
'7 (February), 213–33': ''},
'Emotions: Dimensions of Emotions: PAD': {'1997 by the University of Chicago': '',
'24, 127–46': '',
'310\tHANDBOOK OF MARKETING SCALES': '310\tHANDBOOK OF MARKETING SCALES',
'91 (4), 780–95': ''},
'Ethics: Improving Evaluations of Business Ethics': {'448\tHANDBOOK OF MARKETING SCALES': '448\tHANDBOOK OF MARKETING SCALES'},
'Ethnocentrism: Consumer Ethnocentrism: CETSCALE': {'25 (1), 26–37': '',
'28, 320–27': '',
'92\tHANDBOOK OF MARKETING SCALES': '92\tHANDBOOK OF MARKETING SCALES'},
'Gender Dimensions of Brand Personality': {'1997 by the American Marketing Association': '',
'209 university students': '',
'3': ' Overall quality of the original brand (1 ',
'34 (August), 347–56': '',
'34, 347–56': '',
'342\tHANDBOOK OF MARKETING SCALES': '342\tHANDBOOK OF MARKETING SCALES',
'346\tHANDBOOK OF MARKETING SCALES': '346\tHANDBOOK OF MARKETING SCALES',
'4': ' Perceived difficulty in designing and making the extension (1 ',
'46 (January), 105–19': ''},
'General Self-Control': {'1, 2, 4, 8, and 11 compose the “hedonic” subscale': '',
'78\tHANDBOOK OF MARKETING SCALES': '78\tHANDBOOK OF MARKETING SCALES'},
'Horizontal and Vertical Individualism and Collectivism': {'51 (April), 407–15': '',
'54\tHANDBOOK OF MARKETING SCALES': '54\tHANDBOOK OF MARKETING SCALES',
'74 (1), 118–28': ''},
'Innovativeness: Use Innovativeness': {'05) to “new product trial': '”',
'116\tHANDBOOK OF MARKETING SCALES': '116\tHANDBOOK OF MARKETING SCALES',
'118\tHANDBOOK OF MARKETING SCALES': '118\tHANDBOOK OF MARKETING SCALES',
'1995 by Lawrence Erlbaum Associates, Inc': '',
'4 (4), 329–45': ''},
'Job Characteristic Inventory: JCI': {'20 (March), 31–44': '',
'38 (May), 269–77': '',
'4 ': '4 ',
'456\tHANDBOOK OF MARKETING SCALES': '456\tHANDBOOK OF MARKETING SCALES',
'480\tHANDBOOK OF MARKETING SCALES': '480\tHANDBOOK OF MARKETING SCALES',
'84)': '',
'9, 639–53': ''},
'Journal of Consumer Research,': {'16': '16', '24)': ''},
'Leadership: Transactional and Transformational Leadership': {'1996 by the American Marketing Association': '',
'526\tHANDBOOK OF MARKETING SCALES': '526\tHANDBOOK OF MARKETING SCALES',
'60, 89–105': ''},
'Long-Term Orientation: LTO': {'01)': '',
'10, 1–22': '',
'16 (February), 64–73': '',
'16 (February), 6–17': '',
'18 (May), 133–45': '',
'24 (4), 366–74': '',
'26\tHANDBOOK OF MARKETING SCALES': '26\tHANDBOOK OF MARKETING SCALES',
'28 (4), 674–89': '',
'28 (June), 121–34': '',
'31 (June), 209–19': '',
'34, 100–17': '',
'4': '4',
'5-point scale labeled 1 ': '5-point scale labeled 1 ',
'56 (2), 131–49': '',
'7 (3), 309–19': '',
'70 (1), 172–94': '',
'78, 98–104': '',
'88 (5), 879–903': '',
'9 (June), 139–64': '',
'9, 1–26': ''},
'Meaning of Branded Products Scale': {'25, 82–93': '',
'34, 347–56': '',
'352\tHANDBOOK OF MARKETING SCALES': '352\tHANDBOOK OF MARKETING SCALES'},
'Need to Evaluate: NES': {'05)': '',
'1996 by the American Psychological Association': '',
'38\tHANDBOOK OF MARKETING SCALES': '38\tHANDBOOK OF MARKETING SCALES',
'5 ': '5 ',
'70 (1), 172–94': ''},
'Note: ': {'1 to 7': '1 to 7'},
'Notes: ': {'1': '1',
'15 (January), Pages 77–91': '',
'29 (March), 551–65': ''},
'Opinion Leadership': {'96\tHANDBOOK OF MARKETING SCALES': '96\tHANDBOOK OF MARKETING SCALES'},
'Organizational Commitment': {'108, 17–94': '',
'16, 321–38': '',
'27, 333–44': '',
'538\tHANDBOOK OF MARKETING SCALES': '538\tHANDBOOK OF MARKETING SCALES',
'64, 295–314': ''},
'Organizational Justice': {'540\tHANDBOOK OF MARKETING SCALES': '540\tHANDBOOK OF MARKETING SCALES'},
'Personality and Social Psychology, ': {'01) between a measure of impulsivity and sensory innovativeness': '',
'1990 by Elsevier Science': '',
'20, 293–315': '',
'42': '42'},
'Positive and Negative Affect Scales (PANAS)': {'114 (all nonstudents)': '',
'12, 281–300': '',
'26 (February), 30–43': '',
'316\tHANDBOOK OF MARKETING SCALES': '316\tHANDBOOK OF MARKETING SCALES',
'54, 1063–70': ''},
'Power: Dependence-Based Measure of Interfirm Power in Channels': {'01, one-tailed) indicated a reasonable degree of stability': '',
'12, 177–87': '',
'13': '',
'2': '',
'21': '',
'34 (June), 324–40': '',
'546\tHANDBOOK OF MARKETING SCALES': '546\tHANDBOOK OF MARKETING SCALES',
'558\tHANDBOOK OF MARKETING SCALES': '558\tHANDBOOK OF MARKETING SCALES',
'8': ''},
'Pricing Tactic Persuasion Knowledge: PTPK': {'12 (December), 341–52': '',
'382\tHANDBOOK OF MARKETING SCALES': '382\tHANDBOOK OF MARKETING SCALES'},
'Purchasing Involvement: PI': {'1985 by the American Marketing Association': '',
'268\tHANDBOOK OF MARKETING SCALES': '268\tHANDBOOK OF MARKETING SCALES',
'49, 72–82': ''},
'Reference Group Influence: Consumer Susceptibility to Reference Group Influence': {'140\tHANDBOOK OF MARKETING SCALES': '140\tHANDBOOK OF MARKETING SCALES',
'7 (November), 1–15': ''},
'Research, ': {'16': '', '31 (December), 551–56': '31 (December), 551–56'},
'Response Profile: Viewer Response Profile: VRP': {'19, 37–46': '',
'324\tHANDBOOK OF MARKETING SCALES': '324\tHANDBOOK OF MARKETING SCALES',
'70), respectively': ''},
'Salesperson Performance': {'1': '',
'1993 by the American Marketing Association': '',
'50 (1), 1–28': '',
'512\tHANDBOOK OF MARKETING SCALES': '512\tHANDBOOK OF MARKETING SCALES',
'520\tHANDBOOK OF MARKETING SCALES': '520\tHANDBOOK OF MARKETING SCALES',
'57, 70–80': ''},
'Satisfaction-Channel Satisfaction: SATIND and SATDIR': {'15), GFI ': '15), GFI ',
'1984 by the American Marketing Association': '',
'226–33': '',
'32, 534–52': '',
'35 (September), 382–97': '',
'4 ': '4 ',
'45 (1), 215–33': '',
'54, 80–93': '',
'586\tHANDBOOK OF MARKETING SCALES': '586\tHANDBOOK OF MARKETING SCALES',
'76 (Spring), 11–32': ''},
'Self-Concept Clarity: SCC': {'01)': '',
'1996 by the American Psychological Association': '',
'58\tHANDBOOK OF MARKETING SCALES': '58\tHANDBOOK OF MARKETING SCALES',
'70 (1), 141–56': '',
'82 Canadian students at the University of British Columbia': ''},
'Service Convenience: SERVCON': {'35 (4), 144–56': '',
'38 (May), 269–77': '',
'418\tHANDBOOK OF MARKETING SCALES': '418\tHANDBOOK OF MARKETING SCALES',
'64, 12–40': ''},
'Service Quality of Retail Stores': {'1996 by Sage Publications': '',
'24 (1), 3–16': '',
'408\tHANDBOOK OF MARKETING SCALES': '408\tHANDBOOK OF MARKETING SCALES',
'64, 12–40': ''},
'Style of Processing Scale: SOP': {'1, 109–26': '',
'12, 125–34': '',
'1985 by University of Chicago Press': '',
'296\tHANDBOOK OF MARKETING SCALES': '296\tHANDBOOK OF MARKETING SCALES',
'36 (June), 56–72': '',
'5 ': '5 '},
'TV Program Connectedness Scale': {'148\tHANDBOOK OF MARKETING SCALES': '148\tHANDBOOK OF MARKETING SCALES',
'30 (4), 526–37': ''},
'Tension: Job-Induced Tension': {'1 ': '1 ',
'510\tHANDBOOK OF MARKETING SCALES': '510\tHANDBOOK OF MARKETING SCALES'},
'The Technology Readiness Index (or Techqual™)': {'122\tHANDBOOK OF MARKETING SCALES': '122\tHANDBOOK OF MARKETING SCALES',
'2 (May),307–20': ''},
'Value Consciousness and Coupon Proneness: VC and CP': {'30, 234–45': '',
'386\tHANDBOOK OF MARKETING SCALES': '386\tHANDBOOK OF MARKETING SCALES'},
'Vanity: Trait Aspects of Vanity': {'1995 by University of Chicago Press': '',
'21, 612–26': '',
'64\tHANDBOOK OF MARKETING SCALES': '64\tHANDBOOK OF MARKETING SCALES'},
'Work-Family Conflict and Family-Work Conflict Scales': {'1996 by the American Psychological Association': '',
'506\tHANDBOOK OF MARKETING SCALES': '506\tHANDBOOK OF MARKETING SCALES',
'81 (4), 400–10': '',
'86, CFI ': '86, CFI '},
'exceptions; ': {'1974 by the American Psychological Association': '',
'3': '3',
'30 (4), 526–37': '',
'51, 125–39': ''},
'medium relevance': {'3, ': '3, '}}
I think this can help:
Este script lee un archivo
handbook-of-marketing-scale-2011.docx
en el directorio actual y escribe el JSON resultado en la salida estándar. Si se almacena este código enscript.py
se puede ejecutar así:python3 script.py
. Si se quiere guardar la salida en un archivo se puede ejecutar así:python3 script.py > output.json
.Este script itera por los párrafos del documento ignorando aquellos que están vacíos. Solo se consideran los párrafos entre el comienzo del capítulo 2 y el apéndice. La salida es un arreglo de objetos con una estructura como la del siguiente fragmento:
La propiedad
title
resulta del último texto centrado encontrado. La propiedadscale
resulta del último texto encontrado antes de una numeración. La propiedadanswers
resulta de los párrafos que conforman una numeración.Algunas respuestas tienen opciones. En el script se almacenan en
subanswers
y se unen al texto de la respuesta separadas por comas y precedidas por dos puntos. Ejemplo:Make gifts instead of buying: never, occasionally, frequently, usually, always
.La clave para resolver este problema fue encontrar la forma de distinguir numeraciones de otros párrafos. Usando
python-docx
encontré que se puede hacer usandoparagraph._p.pPr.numPr
. Esta aproximación tiene la desventaja de usar una variable de uso interno_p
de los párrafos, pero parece ser la única opción disponible en este momento dadas las limitaciones de la biblioteca.After finishing processing a scale, the following condition is used to determine if it is valid:
len(scale) > 1 and len(answers) > 2
. Sometimes, due to the very structure of the document, some numbers for which the name of the scale is too short or have too few answers are interpreted as scales. With this condition those possible scales are excluded and they are not included in the results.Here you can see the JSON result.