I am trying to scrape questions and answers from google forms whose urls are in a csv file . Here is an excerpt:
links_y_temas.csv
:
Link,Task
https://docs.google.com/forms/d/1MYSjxAMCXMXB02GPXqkewVx35_ptzv0XO7GRZQGwyGE/edit?usp=sharing,Hotel ABC
https://docs.google.com/forms/d/1VQpzX1GqsnI92J1trlP37v3GjVgNMp1h-1Rh_n8orII/edit?usp=sharing,Airline XYZ
https://docs.google.com/forms/d/1z-qyHp7O4eTI848b6L8_vh59IJ-y0bFpAo--zwPKJxY/edit?usp=sharing,Airline XYZ
https://docs.google.com/forms/d/1IoDbsif5qorINuUrF1Dl9iMtIdnAsTvVA3vMVsVHjy8/edit?usp=sharing,Airline XYZ
I'm trying to get the quiz questions and answers/urls to put into a csv file that looks like the following:
pickle_file_name,id,question,answer_1,answer_2,answer_3,answer_4,answer_5,answer_6,answer_7,answer_8,answer_9,answer_10,answer_11,answer_12,answer_13,answer_14
applicantHotel_ABC_c,1,How do you feel about your next vacation after COVID-19?,,,,,,,,,,,,,,
applicantHotel_ABC_c,2,When do you think your next vacation can start?,In next 3 months,In next 6 months,In next 1 year,Only once COVID-19 is under control,Only once COVID-19 vaccine is developed,,,,,,,,,
applicantHotel_ABC_c,3,What are your preferences regarding medical treatment policy (with additional cost)?,Doctor's availability in hotel,Ventilator availability in hotel,Tie-ups with nearby hospitals,Availability of medical rooms with primary first aid care,,,,,,,,,,
applicantHotel_ABC_c,4,What is your preferences of complementary breakfast?,Buffet breakfast with social distancing,Buffet breakfast replaced with Ala-carte with limited options,Breakfast to be delivered in room with limited options (chargeable),Packaged breakfast only,,,,,,,,,,
However, there are two types of quizzes, those that are published (the Google forms we face in real life) and those for which we have access to the backend. If I flip through both types of quizzes, I can't get the questions from the posted questions.
In effect I get the following types of exceptions:
StaleElementReferenceException
: This exception is difficult to understand. It seems to come into play when reading a posted Google Form. You can read the first question and its answers but not the rest. This can happen if a DOM operation happening on the page is temporarily causing the element to be inaccessible. To account for these cases, I'll try to access the element multiple times in a loop before finally throwing an exception.UnexpectedAlertPresentException
: it is because it seems that there is the button to modify thelocation: https://docs.google.com/forms/d/1HAMUvDpYiz-SpQpKwUOxyHqn3Be7OV9vLER3K5ltrxg/edit?usp=sharing UnexpectedAlertPresentException
content_area.get_attribute("aria-label")
: I feel like I'm repeating over and over again about the same thing.
Stop talking! Here is my code that allows you to get a little over 5% of Google Forms:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from selenium.common.exceptions import ElementNotInteractableException, NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common import exceptions
import pickle
WDWTIME = 20
USER = '[email protected]' # ciertos google forms le requieren
PWD = "yourpassword"
def setup_chromedriver():
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("C:\Programs\chromedriver.exe")
"""Algunos de los formularios de Google necesitan un acceso"""
url = 'https://www.google.com/accounts/'
driver.get(url)
# Encontrar el login
login_field = WebDriverWait(driver, WDWTIME).until(
EC.presence_of_element_located((By.ID, 'identifierId')))
login_field.send_keys(USER)
# Haga clic en el botón siguiente
driver.find_element_by_id('identifierNext').click()
# Encontrar el campo de la contraseña
time.sleep(4)
driver.set_page_load_timeout(50)
driver.set_script_timeout(50)
password_field = WebDriverWait(driver, WDWTIME).until(
EC.presence_of_element_located((By.ID, 'password')))
password_field = password_field.find_element_by_tag_name('input')
password_field.send_keys(PWD)
# Haz click sobre "next" button
driver.find_element_by_id('passwordNext').click()
driver.set_page_load_timeout(30)
driver.set_script_timeout(30)
return driver
def load_data():
df = pd.read_csv("research_assistant_intern_recruitment_an.csv")
filter_col = ["Link"]
return df, filter_col
def get_published_questionnaire():
print("published questionnaire")
questionnaire = {}
btns = driver.find_elements_by_css_selector(".appsMaterialWizButtonEl")
# el botón "Siguiente", *advertencia* "Solicitar acceso de edición" también está activado.
next_btns = driver.find_elements_by_class_name("appsMaterialWizButtonPaperbuttonContent.exportButtonContent")
if next_btns:
next_btns[-1].click()
next_btns = driver.find_elements_by_class_name("appsMaterialWizButtonPaperbuttonContent.exportButtonContent")
while next_btns != []:
containers = driver.find_elements_by_class_name(
"freebirdFormviewerViewNumberedItemContainer"
)
len_containers = len(containers)
for container in containers:
len_containers -=1
print("len_containers: ", len_containers)
try:
question = container.find_element_by_class_name(
"freebirdFormviewerViewItemsItemItemTitle.exportItemTitle.freebirdCustomFont"
)
except NoSuchElementException:
print("No question, NoSuchElementException")
continue
except exceptions.StaleElementReferenceException:
print("No question, StaleElementReferenceException")
continue
responses = container.find_elements_by_class_name(
"docssharedWizToggleLabeledLabelText"
)
extracted_text = [response.text for response in responses]
questionnaire[question.text] = extracted_text
content_areas = driver.find_elements_by_class_name(
"quantumWizTextinputSimpleinputInput.exportInput"
)
for content_area in content_areas:
skip = ["Document title", "Titre du document", "Adresse e-mail valide"]
if content_area.get_attribute("aria-label") in skip and not content_area.get_attribute("aria-label").isspace():
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
else:
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
content_area.send_keys("10102015")
content_areas = driver.find_elements_by_class_name(
"quantumWizTextinputPaperinputInput.exportInput"
)
for content_area in content_areas:
if content_area.get_attribute("type") == "date" and not content_area.get_attribute("type").isspace():
condition = content_area.get_attribute("type")
if condition == "date":
content_area.send_keys("10102015")
elif content_area.get_attribute("max") and not content_area.get_attribute("max").isspace():
max = content_area.get_attribute("max")
content_area.send_keys(max)
elif content_area.get_attribute("aria-label") and not content_area.get_attribute("aria-label").isspace():
condition = content_area.get_attribute("aria-label")
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
if condition == "State (Two letter Abbreviation)":
content_area.send_keys("CA")
else:
content_area.send_keys("10102015")
for content_area in content_areas:
skip = ["Document title", "Titre du document", "Adresse e-mail valide"]
if content_area.get_attribute("aria-label") in skip and not content_area.get_attribute("aria-label").isspace():
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
else:
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
content_area.send_keys("10102015")
btns_answers = driver.find_elements_by_css_selector(".appsMaterialWizToggleRadiogroupElContainer")
for btn_answer in btns_answers:
try:
driver.execute_script('arguments[0].scrollIntoView(true);', btn_answer)
btn_answer.click()
except ElementNotInteractableException:
pass
except exceptions.ElementClickInterceptedException:
continue
# long answers
content_areas = driver.find_elements_by_class_name(
"quantumWizTextinputPapertextareaInput.exportTextarea"
)
for content_area in content_areas:
content_area.send_keys("This restaurant is really good! Me and my boyfriend went there on our holiday \
we had dinner there at 3 of February food was 100% And the service vas 150% And i really want to thank "
"\Asie for a really good service as for his coworkers. We highly recommended \
this restaurant!")
# check boxes
btn_check_boxes = driver.find_elements_by_class_name(
"docssharedWizToggleLabeledContainer.freebirdFormviewerViewItemsCheckboxContainer"
)
for btn_check_box in btn_check_boxes:
btn_check_box.click()
break
# btn_check_box[-1].click()
# # other weird check boxes
btn_check_boxes = driver.find_elements_by_class_name(
"docssharedWizToggleLabeledLabelText.exportLabel.freebirdFormviewerViewItemsCheckboxLabel"
)
for btn_check_box in btn_check_boxes:
btn_check_box.click()
break
# btns[-1].click()
next_btns = driver.find_elements_by_class_name(
"appsMaterialWizButtonPaperbuttonContent.exportButtonContent")
if next_btns != []:
next_btns[-1].click()
next_btns = []
else:
continue
print("questionnaire: ", questionnaire)
return questionnaire
def get_backend_questionnaire():
print("backend questionnaire")
# a veces empezamos con algo que parece una página publicada con un botón "siguiente"
# if driver.find_element_by_id('identifierNext'):
# driver.find_element_by_id('identifierNext').click()
questionnaire = {}
# Obtengo todas las cartas con preguntas y respuestas dentro de ellas
containers = driver.find_elements_by_class_name(
"freebirdFormeditorViewItemContentWrapper"
)
driver.set_page_load_timeout(30)
driver.set_script_timeout(30)
# para cada carta
for container in containers:
try:
question = container.find_element_by_css_selector(".exportTextarea[aria-label='Intitulé de la question']")
except NoSuchElementException:
print("NoSuchElementException in " + str(container))
continue
# Obtener las respuestas
responses = container.find_elements_by_css_selector(
".quantumWizTextinputSimpleinputInput.exportInput"
)
extracted_responses = [response.get_attribute("data-initial-value") for response in responses]
questionnaire[question.text] = extracted_responses
driver.set_page_load_timeout(30)
driver.set_script_timeout(30)
print("questionnaire backend: ", questionnaire)
return questionnaire
def extract(driver, df, survey):
count_questionnaires = 0
result = []
count_not_empty = 0.0
print("survey: ", survey)
for location, task in zip(df.Link, df.Task):
if task == survey:
print("location: ", location)
questionnaire = {}
if "docs.google.com" in str(location):
count_questionnaires +=1.0
driver.get(location)
# test if it is a published version
try:
ask_access_btn = driver.find_elements_by_class_name(
"freebirdFormviewerViewNavigationHeaderButtonContent"
)
except exceptions.UnexpectedAlertPresentException:
print("UnexpectedAlertPresentException")
get_published_questionnaire
if ask_access_btn:
questionnaire = get_published_questionnaire()
else:
questionnaire = get_backend_questionnaire()
if questionnaire not in [{}, {'': ''}]:
count_not_empty += 1.0
print(questionnaire)
result.append({str(count_questionnaires): questionnaire})
count_questionnaires += 1
print("count_questionnaires: ", count_questionnaires)
if count_questionnaires != 0:
print("count_not_empty/count_questionnaires: ", count_not_empty/count_questionnaires)
return result
if __name__ == '__main__':
""" Necesita acceder a la cuenta de Google para acceder a ciertos cuestionarios. También configurar chromedriver para que se ejecute en
estado sin cabeza """
driver = setup_chromedriver()
published_questionnaires = [] # tracking published ones
""" Cargar CSV de los formularios de Google """
df, columns = load_data()
surveys = ['Hotel ABC', "Airline XYZ", "The Ministry of Tourism of France"]
for survey in surveys:
result = extract(driver, df, survey)
survey = survey.replace(" ", "_")
pickle_out = open("applicant" + survey + "_c.p", "wb")
pickle.dump(result, pickle_out)
pickle_out.close()
print("published_questionnaires: ", published_questionnaires)
The output is:
C:\Users\antoi\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\python3.7.exe C:/Users/antoi/Documents/Programming/Scraping/Python/new_scraper.py
survey: Hotel ABC
location: https://docs.google.com/forms/d/1MYSjxAMCXMXB02GPXqkewVx35_ptzv0XO7GRZQGwyGE/edit?usp=sharing
backend questionnaire
NoSuchElementException in <selenium.webdriver.remote.webelement.WebElement (session="bb78e4006674ba0bd9384f86518d8b95", element="7271d9f5-478a-4dfc-8f9e-6d9248da5bcc")>
...
NoSuchElementException in <selenium.webdriver.remote.webelement.WebElement (session="bb78e4006674ba0bd9384f86518d8b95", element="3ff21bcf-6e8d-4b88-b64d-10cf5a11b18f")>
NoSuchElementException in <selenium.webdriver.remote.webelement.WebElement (session="bb78e4006674ba0bd9384f86518d8b95", element="16e5fa0b-3502-4ab7-a7f7-a47b1b8fa388")>
NoSuchElementException in <selenium.webdriver.remote.webelement.WebElement (session="bb78e4006674ba0bd9384f86518d8b95", element="fdc1bafb-51c9-4cc7-8f43-675ea9061338")>
questionnaire backend: {'How do you feel about your next vacation after COVID-19?': [], 'When do you think your next vacation can start?': ['In next 3 months', 'In next 6 months', 'In next 1 year', 'Only once COVID-19 is under control', 'Only once COVID-19 vaccine is developed', ''], ... , 'Education Level': ['No higher education', 'Diploma', "Bachelor's", "Master's", 'PhD', "Other's", ''], 'Annual Income': ['< £ 30,000', '£ 30,000 to £ 50,000', '£ 50,000 to £ 80,000', '£ 80,000 to £ 120,000', '> £ 120,000', ''], 'Feedback / Comments': [''], 'Email (Optional)': ['']}
location: https://docs.google.com/forms/d/1_iRBtfJANF5MGWqoIMQUxBdeuAa4ePMltdIsVRmdY5Y/edit?usp=sharing
published questionnaire
len_containers: 9
No question, NoSuchElementException
len_containers: 8
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
len_containers: 7
No question, StaleElementReferenceException
len_containers: 6
No question, StaleElementReferenceException
len_containers: 5
No question, StaleElementReferenceException
len_containers: 4
No question, StaleElementReferenceException
len_containers: 3
No question, StaleElementReferenceException
len_containers: 2
No question, StaleElementReferenceException
len_containers: 1
No question, StaleElementReferenceException
len_containers: 0
No question, StaleElementReferenceException
questionnaire: {'Age': ['Under 18', '18-24', '25-34', '35-44', '45-54', 'Over 55']}
{'Age': ['Under 18', '18-24', '25-34', '35-44', '45-54', 'Over 55']}
location: https://docs.google.com/forms/d/1j0nk_Oo-_pfJBM4UcWITDPXT97-qX5mZpb3uVyKS3CA/edit?usp=sharing
published questionnaire
len_containers: 13
No question, NoSuchElementException
len_containers: 12
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
len_containers : ...
...
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
len_containers: 0
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
questionnaire: {'On average, how many times per year do you travel for 2 days or more? *': ['1 to 2', '3 to 4', '5 or more'], ... , 'What are your expectations from the accommodations’ Pinterest page? (Please select all that apply) *': ['See pictures of customer service staff', 'See pictures of staff in general', 'See pictures of the destination', 'See pictures of all the types of rooms', 'See pictures of services available', 'See pictures that are not on the hotel’s website', "I don't have a Pinterest account", "I don't use Pinterest for these purposes"]}
location: https://docs.google.com/forms/d/1kq5dhHvftF6tWmRk_7cG4H0Mkzr6xnUhpTV0j2xIYeE/edit?usp=sharing
UnexpectedAlertPresentException
published questionnaire
len_containers: 13
No question, NoSuchElementException
len_containers: 12
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
len_containers: 11
...
len_containers: 1
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
len_containers: 0
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
content_area.get_attribute("aria-label"): Other response
questionnaire: {'On average, how many times per year do you travel for 2 days or more? *': ['1 to 2', '3 to 4', '5 or more'], ... 'What are your expectations from the accommodations’ Pinterest page? (Please select all that apply) *': ['See pictures of customer service staff', 'See pictures of staff in general', 'See pictures of the destination', 'See pictures of all the types of rooms', 'See pictures of services available', 'See pictures that are not on the hotel’s website', "I don't have a Pinterest account", "I don't use Pinterest for these purposes"]}
location: https://docs.google.com/forms/d/1IFqdsm9yO8h17JsJPN4c84vpQP06PxIquWfTmRN-TVw/edit?usp=sharing
So it looks like we're lost in a loop...
In each form there is a javascript variable that contains all the questions and the associated answers. Beautifulsoup seems to be a better option than Selenium for this case. It is easier to get the questions and you do not need to navigate through the elements of the document.