What is a promise in Javascript?

Question

EJS

Asked: 2020-05-09 10:09:21 +0800 CST 2020-05-09 10:09:21 +0800 CST 2020-05-09 10:09:21 +0800 CST

Calculate new column in spark df, crossing list type column in df1 and text column in df2 with pyspark

772

I'm using spark 2.4.5 and I need to calculate the sentiment score from the list of tokens (MeaningfulWords column) from df1 , based on the words and scores from df2 (sentiment dictionary). In Df1 I must create a new column with the list of token scores and another column with the average sentiment (sum of scores/total words) of each record.

The dataframes look like this:

df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
|                ID|               MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003|                 [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+

df2.show(5)
+-----+----------+
|score|      word|
+-----+----------+
| 1.68|abandonado|
| 3.18|    abejas|
|  2.8|    aborto|
| 2.46| abrasador|
| 8.13|    abrazo|
+-----+----------+

The result of the new columns should be something like this:

+------------------+---------------------+
|         MeanScore|            ScoreList|
+------------------+---------------------+
|              2.95|[3.10, 2.50, 1.28,...|
|              2.15|[1.15, 3.50, 2.75,...|
|              2.75|[4.20, 1.00, 1.75,...|
|              3.25|[3.25, 2.50, 3.20,...|
|              3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+

I have reviewed several options using .join, but when dealing with different data types between the columns, it gives an error.

I've checked options like https://stackoverflow.com/questions/36576196/joining-pyspark-dataframes-on-nested-field , but I can't do a direct join between the two columns because they have different data types.

I've also tried converting the Df's to RDD's and using a function, like so:

def map_words_to_values(review_words, afinn_dict):
return [afinn_dict[word] for word in review_words if word in afinn_dict]

RDD1=swRemoved.rdd.map(list) 
RDD2=Dict_df.rdd.map(list)

reviewsRDD_afinn_values = RDD1.map(lambda tupple: (tupple[0], map_words_to_values(tupple[1], RDD2)))
reviewsRDD_afinn_values.take(3)

But with this last option I get the following error:

PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

I know how to solve it with pandas, but I would like to find the correct way to solve it with spark without punishing performance.

1 Answers

Voted

EJS · Answer 1 · 2020-05-10T03:09:47+08:00

They have solved my problem at https://stackoverflow.com/questions/61687997/calculate-new-column-in-spark-dataframe-crossing-a-tokens-list-column-in-df1-wi :

You can do this first with a joinusing array_contains(MeaningfulWords,word), then groupByand collect_listfrom all the words they did join, then using the higher order functions transformand aggregateto get the mean score (valid in spark2.4+).

The higher order function aggregateonly accepts integer values, so it was necessary transformto use convert, and at the end divide by 100 (Assuming a maximum of 2 decimal places, eg 2.81).

df1.show()

#+------------------+----------------------------+
#|ID                |MeaningfulWords             |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien]               |
#+------------------+----------------------------+

df2.show()

#+-----+---------+
#|score|     word|
#+-----+---------+
#| 1.68|     casa|
#|  2.8|  alejado|
#| 1.03|     buen|
#| 3.68|    gusto|
#| 0.68| clientes|
#|  2.1|contentos|
#| 2.68|    servi|
#| 1.18|    resto|
#| 1.98|     bien|
#+-----+---------+


from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""))\
   .groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
                      ,F.collect_list("score").alias("ScoreList"))\
   .withColumn("MeanScore", F.expr("""aggregate((transform(ScoreList,x->int(x*100)))\
                                      ,0,(x,acc)-> acc+x,acc->(acc/100)/ size(Scorelist))""")).show(truncate=False)

#+------------------+----------------------------+-----------------------+---------+
#|ID                |MeaningfullWords            |ScoreList              |MeanScore|
#+------------------+----------------------------+-----------------------+---------+
#|abcde00000qMQ00003|[resto, bien]               |[1.18, 1.98]           |1.58     |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68]      |1.82     |
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975   |
#+------------------+----------------------------+-----------------------+---------+

Calculate new column in spark df, crossing list type column in df1 and text column in df2 with pyspark

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?