What is a promise in Javascript?

Question

EJS

Asked: 2020-05-09 10:09:21 +0800 CST 2020-05-09 10:09:21 +0800 CST 2020-05-09 10:09:21 +0800 CST

用pyspark计算spark df中的新列，df1中的交叉列表类型列和df2中的文本列

772

我正在使用 spark 2.4.5，我需要根据df2（情感词典）中的单词和分数从df1的标记列表（MeaningfulWords 列）中计算情感分数。在 Df1 中，我必须创建一个包含标记分数列表的新列，以及另一个包含每条记录的平均情绪（分数总和/总单词）的列。

数据框如下所示：

df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
|                ID|               MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003|                 [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+

df2.show(5)
+-----+----------+
|score|      word|
+-----+----------+
| 1.68|abandonado|
| 3.18|    abejas|
|  2.8|    aborto|
| 2.46| abrasador|
| 8.13|    abrazo|
+-----+----------+

新列的结果应该是这样的：

+------------------+---------------------+
|         MeanScore|            ScoreList|
+------------------+---------------------+
|              2.95|[3.10, 2.50, 1.28,...|
|              2.15|[1.15, 3.50, 2.75,...|
|              2.75|[4.20, 1.00, 1.75,...|
|              3.25|[3.25, 2.50, 3.20,...|
|              3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+

我已经使用 .join 查看了几个选项，但是在处理列之间的不同数据类型时，它会出错。

我已经检查了诸如https://stackoverflow.com/questions/36576196/joining-pyspark-dataframes-on-nested-field之类的选项，但是我无法在两列之间进行直接连接，因为它们具有不同的数据类型。

我还尝试将 Df 转换为 RDD 并使用函数，如下所示：

def map_words_to_values(review_words, afinn_dict):
return [afinn_dict[word] for word in review_words if word in afinn_dict]

RDD1=swRemoved.rdd.map(list) 
RDD2=Dict_df.rdd.map(list)

reviewsRDD_afinn_values = RDD1.map(lambda tupple: (tupple[0], map_words_to_values(tupple[1], RDD2)))
reviewsRDD_afinn_values.take(3)

但是使用最后一个选项，我收到以下错误：

PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

我知道如何用 pandas 解决它，但我想找到正确的方法来用 spark 解决它而不影响性能。

1 Answers

Voted

EJS · Answer 1 · 2020-05-10T03:09:47+08:00

他们在https://stackoverflow.com/questions/61687997/calculate-new-column-in-spark-dataframe-crossing-a-tokens-list-column-in-df1-wi解决了我的问题：

您可以先使用joinusing array_contains(MeaningfulWords,word)，然后groupBy从collect_list他们所做的所有单词join中执行此操作，然后使用高阶函数transform并aggregate获得平均分数（在 spark2.4+ 中有效）。

高阶函数aggregate只接受整数值，因此必须transform使用转换，最后除以 100（假设最多 2 个小数位，例如 2.81）。

df1.show()

#+------------------+----------------------------+
#|ID                |MeaningfulWords             |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien]               |
#+------------------+----------------------------+

df2.show()

#+-----+---------+
#|score|     word|
#+-----+---------+
#| 1.68|     casa|
#|  2.8|  alejado|
#| 1.03|     buen|
#| 3.68|    gusto|
#| 0.68| clientes|
#|  2.1|contentos|
#| 2.68|    servi|
#| 1.18|    resto|
#| 1.98|     bien|
#+-----+---------+


from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""))\
   .groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
                      ,F.collect_list("score").alias("ScoreList"))\
   .withColumn("MeanScore", F.expr("""aggregate((transform(ScoreList,x->int(x*100)))\
                                      ,0,(x,acc)-> acc+x,acc->(acc/100)/ size(Scorelist))""")).show(truncate=False)

#+------------------+----------------------------+-----------------------+---------+
#|ID                |MeaningfullWords            |ScoreList              |MeanScore|
#+------------------+----------------------------+-----------------------+---------+
#|abcde00000qMQ00003|[resto, bien]               |[1.18, 1.98]           |1.58     |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68]      |1.82     |
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975   |
#+------------------+----------------------------+-----------------------+---------+

用pyspark计算spark df中的新列，df1中的交叉列表类型列和df2中的文本列

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?