I have a couple of notebooks in databricks, notebook1 for text processing (tokenize, delete stopwords...) which outputs a clean text file. The notebook2 , reads the clean text and performs a sentiment analysis.
I add a view of the output schema of the notebook1 dataframe :
cleanDF.printSchema
Out[25]: <bound method DataFrame.printSchema of DataFrame[ID: string, Year: int, TypeComment: string, NewText: string, ExecutionName: string, ExecutionTime: string]>
And the output dataframe from notebook1 looks like this:
+------------------+----+-----------+--------------------+----------------+--------------------+
| ID|Year|TypeComment| NewText| ExecutionName| ExecutionTime|
+------------------+----+-----------+--------------------+----------------+--------------------+
|aaaaaaaaaaaaadWUAQ|2020| General|limpieza general....|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaae2UAA|2020| General| todo correcto...|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaaxUUAQ|2020| General| correcto|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaaEJUAY|2020| General| bien|TextPretreatment|2020-08-10 08:28:...|
|a0aaaaaaaaaaaaaUAQ|2020| General|rocio ventas trad...|TextPretreatment|2020-08-10 08:28:...|
+------------------+----+-----------+--------------------+----------------+--------------------+
only showing top 5 rows
In local, to generate the output file I use this instruction, which generates a single csv with the indicated file name:
cleanDF.toPandas().to_csv("./Test/Outputs/TextPre-treatment.csv", header=True)
In local, everything works correctly, because I have the routes of my machine for Inputs and Outputs in each notebook. However, when passing it to Databricks , I have problems trying to run notebook2 , since its input file is the output of notebook1 , which when written to DBFS generates different directories.
Going through some posts like write-single-csv-file-using-spark-csv and save-content-of-spark-dataframe-as-a-single-csv-file , in databricks I have replaced the file generation line and I am using this is the code with which I store my output dataframe from notebook1 in DBFS:
dbutils.fs.rm("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv", True)
cleanDF.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv")
I have also tried this alternative:
cleanDF.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv")
In both cases it stores a single file with the results (because my output file is not large), but it generates the following path and files:
The problem comes when I run notebook2 and try to read from that path, the output of notebook1 . Since I can not give a specific file name, because the name changes after each execution. In notebook2 I do this:
pathText= "/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv"
schema = StructType([
StructField("Index", StringType()),
StructField("ID", StringType()),
StructField("Year", IntegerType()),
StructField("TypeComment", StringType()),
StructField("NewText", StringType()),
StructField("ExecutionName", StringType()),
StructField("ExecutionTime", StringType())
])
df= spark.read.csv(pathText, schema=schema , header=True)
df.show(5)
It apparently reads the file, but returns everything to null:
+-----+----+----+-----------+-------+-------------+-------------+
|Index| ID|Year|TypeComment|NewText|ExecutionName|ExecutionTime|
+-----+----+----+-----------+-------+-------------+-------------+
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
+-----+----+----+-----------+-------+-------------+-------------+
only showing top 5 rows
If I check the content of the file part-00000-tid-*.csv
in DBFS, if it has content:
I would like to know how I get notebook2 to correctly read the output file of notebook1 (either one or several) and if there is a more appropriate way to do this type of file reading, when in databricks the input of a notebook depends on the output of other.
My configuration is:
Spark NLP version: 2.5.5
Apache Spark version: 2.4.5
Databricks Runtime: 6.5.x-cpu-ml-scala2.11