I have a couple of notebooks in databricks, notebook1 for text processing (tokenize, delete stopwords...) which outputs a clean text file. The notebook2 , reads the clean text and performs a sentiment analysis.
I add a view of the output schema of the notebook1 dataframe :
cleanDF.printSchema
Out[25]: <bound method DataFrame.printSchema of DataFrame[ID: string, Year: int, TypeComment: string, NewText: string, ExecutionName: string, ExecutionTime: string]>
And the output dataframe from notebook1 looks like this:
+------------------+----+-----------+--------------------+----------------+--------------------+
| ID|Year|TypeComment| NewText| ExecutionName| ExecutionTime|
+------------------+----+-----------+--------------------+----------------+--------------------+
|aaaaaaaaaaaaadWUAQ|2020| General|limpieza general....|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaae2UAA|2020| General| todo correcto...|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaaxUUAQ|2020| General| correcto|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaaEJUAY|2020| General| bien|TextPretreatment|2020-08-10 08:28:...|
|a0aaaaaaaaaaaaaUAQ|2020| General|rocio ventas trad...|TextPretreatment|2020-08-10 08:28:...|
+------------------+----+-----------+--------------------+----------------+--------------------+
only showing top 5 rows
In local, to generate the output file I use this instruction, which generates a single csv with the indicated file name:
cleanDF.toPandas().to_csv("./Test/Outputs/TextPre-treatment.csv", header=True)
In local, everything works correctly, because I have the routes of my machine for Inputs and Outputs in each notebook. However, when passing it to Databricks , I have problems trying to run notebook2 , since its input file is the output of notebook1 , which when written to DBFS generates different directories.
Going through some posts like write-single-csv-file-using-spark-csv and save-content-of-spark-dataframe-as-a-single-csv-file , in databricks I have replaced the file generation line and I am using this is the code with which I store my output dataframe from notebook1 in DBFS:
dbutils.fs.rm("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv", True)
cleanDF.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv")
I have also tried this alternative:
cleanDF.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv")
In both cases it stores a single file with the results (because my output file is not large), but it generates the following path and files:
The problem comes when I run notebook2 and try to read from that path, the output of notebook1 . Since I can not give a specific file name, because the name changes after each execution. In notebook2 I do this:
pathText= "/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv"
schema = StructType([
StructField("Index", StringType()),
StructField("ID", StringType()),
StructField("Year", IntegerType()),
StructField("TypeComment", StringType()),
StructField("NewText", StringType()),
StructField("ExecutionName", StringType()),
StructField("ExecutionTime", StringType())
])
df= spark.read.csv(pathText, schema=schema , header=True)
df.show(5)
It apparently reads the file, but returns everything to null:
+-----+----+----+-----------+-------+-------------+-------------+
|Index| ID|Year|TypeComment|NewText|ExecutionName|ExecutionTime|
+-----+----+----+-----------+-------+-------------+-------------+
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
| null|null|null| null| null| null| null|
+-----+----+----+-----------+-------+-------------+-------------+
only showing top 5 rows
If I check the content of the file part-00000-tid-*.csv
in DBFS, if it has content:
I would like to know how I get notebook2 to correctly read the output file of notebook1 (either one or several) and if there is a more appropriate way to do this type of file reading, when in databricks the input of a notebook depends on the output of other.
My configuration is:
Spark NLP version: 2.5.5
Apache Spark version: 2.4.5
Databricks Runtime: 6.5.x-cpu-ml-scala2.11
I would try removing the index. Because it seems to me that it is an error in the definition of the schema. It seems to me that you have the same problem that you see in this question: NULL values when trying to import CSV in Azure Databricks DBFS