tagged 【apache-spark】Questions- page 1

EJS

Asked: 2020-08-10 14:14:27 +0800 CST

How to read with pyspark files in databricks from DBFS that have been left as a result of a previous execution

1

I have a couple of notebooks in databricks, notebook1 for text processing (tokenize, delete stopwords...) which outputs a clean text file. The notebook2 , reads the clean text and performs a sentiment analysis.

I add a view of the output schema of the notebook1 dataframe :

cleanDF.printSchema

Out[25]: <bound method DataFrame.printSchema of DataFrame[ID: string, Year: int, TypeComment: string, NewText: string, ExecutionName: string, ExecutionTime: string]>

And the output dataframe from notebook1 looks like this:

+------------------+----+-----------+--------------------+----------------+--------------------+
|                ID|Year|TypeComment|             NewText|   ExecutionName|       ExecutionTime|
+------------------+----+-----------+--------------------+----------------+--------------------+
|aaaaaaaaaaaaadWUAQ|2020|    General|limpieza general....|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaae2UAA|2020|    General|    todo correcto...|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaaxUUAQ|2020|    General|            correcto|TextPretreatment|2020-08-10 08:28:...|
|aaaaaaaaaaaaaEJUAY|2020|    General|                bien|TextPretreatment|2020-08-10 08:28:...|
|a0aaaaaaaaaaaaaUAQ|2020|    General|rocio ventas trad...|TextPretreatment|2020-08-10 08:28:...|
+------------------+----+-----------+--------------------+----------------+--------------------+
only showing top 5 rows

In local, to generate the output file I use this instruction, which generates a single csv with the indicated file name:

cleanDF.toPandas().to_csv("./Test/Outputs/TextPre-treatment.csv", header=True)

In local, everything works correctly, because I have the routes of my machine for Inputs and Outputs in each notebook. However, when passing it to Databricks , I have problems trying to run notebook2 , since its input file is the output of notebook1 , which when written to DBFS generates different directories.

Going through some posts like write-single-csv-file-using-spark-csv and save-content-of-spark-dataframe-as-a-single-csv-file , in databricks I have replaced the file generation line and I am using this is the code with which I store my output dataframe from notebook1 in DBFS:

dbutils.fs.rm("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv", True)

cleanDF.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv")

I have also tried this alternative:

cleanDF.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv")

In both cases it stores a single file with the results (because my output file is not large), but it generates the following path and files:

The problem comes when I run notebook2 and try to read from that path, the output of notebook1 . Since I can not give a specific file name, because the name changes after each execution. In notebook2 I do this:

pathText= "/FileStore/tables/Test/Outputs/Pretreatment/TextPre-treatment.csv"

schema = StructType([
    StructField("Index", StringType()),
    StructField("ID", StringType()),
    StructField("Year", IntegerType()),
    StructField("TypeComment", StringType()),
    StructField("NewText", StringType()),
    StructField("ExecutionName", StringType()),
    StructField("ExecutionTime", StringType())
    ])
df= spark.read.csv(pathText, schema=schema , header=True)
df.show(5)

It apparently reads the file, but returns everything to null:

+-----+----+----+-----------+-------+-------------+-------------+
|Index|  ID|Year|TypeComment|NewText|ExecutionName|ExecutionTime|
+-----+----+----+-----------+-------+-------------+-------------+
| null|null|null|       null|   null|         null|         null|
| null|null|null|       null|   null|         null|         null|
| null|null|null|       null|   null|         null|         null|
| null|null|null|       null|   null|         null|         null|
| null|null|null|       null|   null|         null|         null|
+-----+----+----+-----------+-------+-------------+-------------+
only showing top 5 rows

If I check the content of the file part-00000-tid-*.csvin DBFS, if it has content:

I would like to know how I get notebook2 to correctly read the output file of notebook1 (either one or several) and if there is a more appropriate way to do this type of file reading, when in databricks the input of a notebook depends on the output of other.

My configuration is:

Spark NLP version:  2.5.5
Apache Spark version:  2.4.5
Databricks Runtime:  6.5.x-cpu-ml-scala2.11

Ares777

Asked: 2020-05-24 05:26:37 +0800 CST

PySpark - How to extract information from a column of a Spark DataFrame in JSON format but of type String

0

I find myself working with PySpark and using the Spark DataFrame in which each row of the DataFrame contains this information (which will always be the same), although the values inside "tree", "grass" and "weed" may vary ".

{tree={in_season=true, index={color=null, category=null, value=null}, display_name=Tree, data_available=false}, weed={in_season=false, index={color=null, category=null, value=null}, display_name=Weed, data_available=false}, grass={in_season=true, index={color=null, category=null, value=null}, display_name=Grass, data_available=false}}

What I'm trying to do is keep some fields, for example, from "tree", keep the fields "in_season", "index -> value", "display_name", among others.

The dataframe has the following schema:

df2.printSchema()

data: map (nullable = true)
- key: string
- value: string (valueContainsNull = true)
types: string (nullable = true)
plants: string (nullable = true)

What I have tried so far is to use StructType() as follows:

schema = ArrayType(
    StructType([StructField("tree", StringType())]))

df3 = df2.withColumn("tree", from_json(df2.types, schema))

The result I am getting is NULL for each row of the dataframe.

Is there any other way to do this, or do I have to do it with the StructType in another way?

Thank you very much in advance for the help!

Juan

Asked: 2020-02-22 09:14:07 +0800 CST

Get most frequent word and number of times it is repeated in a file

1

I am reading a file and counting the words to show how many times each word is repeated.

readme = sc.textFile("README.md")

wordCounts = readme.flatMap(lambda line: line.split()).map(lambda word: (word, 
1)).reduceByKey(lambda a, b: a+b)

wordCounts.collect()

Returns the following:

[('#', 1),
('Apache', 1),
('Spark', 14),
('is', 6),
('It', 2), ('provides', 1),
( 'high-level', 1),
('APIs', 1),
('in', 5), ('Scala,', 1),
('Java,', 1),
('an', 3) ,
('optimized', 1), ('engine', 1),
('supports', 2),
('computation', 1),
('analysis.',1),
('set', 2),
('of', 5),
('tools', 1),
('SQL', 2),
('MLlib', 1),
('machine', 1), ('learning,', 1),
( 'graphX', 1),
('graph', 1), ('processing,', 1),
('Documentation', 1),

..... (There is more data but I cut it here)]

As you can see, the word that is repeated the most is Spark with 14 registers, now:

How can I show only that word and the total number of records?

Kevin AB

Asked: 2020-02-04 19:10:36 +0800 CST

Multiple computers working as 1 [closed]

0

I have several slightly old computers thrown away, and I wanted to know how I could align them in a kind of cluster, to be able to run applications in a more "fast" way, so to speak, in which the workload can be distributed, etc. Seeing some options, maybe with apacher spark I could do it, however, what other options are there?

How to read with pyspark files in databricks from DBFS that have been left as a result of a previous execution

PySpark - How to extract information from a column of a Spark DataFrame in JSON format but of type String

Get most frequent word and number of times it is repeated in a file

Multiple computers working as 1 [closed]

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?

Questions[apache-spark]