What is a promise in Javascript?

Question

EJS

Asked: 2020-07-30 14:13:11 +0800 CST 2020-07-30 14:13:11 +0800 CST 2020-07-30 14:13:11 +0800 CST

Replace different very similar texts, by a fixed value using pyspark

772

I'm doing nlp with pyspark on some customer reviews and I want to hide the brand name of the company, replacing all matches in the text, with a fixed value (eg "brand"). I have tried creating a function and using regexp_replace, but having many ways to write the mark, it is not very practical, and what I am using seems a bit clumsy.

The function I have is something like this:

def anonimization(column):
    col=regexp_replace(column,'las tres hermanas','marca')
    col=regexp_replace(col,'treshermanas','marca')
    col=regexp_replace(col,'tres hermanas','marca')
    col=regexp_replace(col,'la tres hermana','marca')
    col=regexp_replace(col,'3hermanas','marca'
    col=regexp_replace(col,'las tres herman','marca')
                         
    return col

and the call is this:

cleaned_text=cleaned_text.select('ID','Year',anonimization(col('text')).alias('text'),'TypeComment')

To begin with, it is not replacing the matches well and secondly I think it is not the best way to do it since in the event of any small variation (eg a typo) it would no longer identify it, which implies that the list in the function can grow much.

I would like to find a more efficient way to use regex_replaceto solve this problem, or if there is another method to solve my problem.

The input text is already lowercase and extraneous characters have been removed.

I am using pyspark on top of spark 2.4.5.

1 Answers

Voted

Jino Michel Aque · Answer 1 · 2020-08-06T20:19:02+08:00

For what you are asking for, I think that before trying to make all the words match, I would try to evaluate the similarity between the two strings and filter it based on that. I'm going to use the Levenshtein distance solution and I'm going to filter it by a fixed number.

I changed the function by adding the DataFrame and the filter mark as parameters. My code to the solution with an example DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lower, col, regexp_replace, udf, levenshtein, lit, expr, when
from pyspark.sql.types import *

spark = SparkSession.builder.master("local").appName("String Distance").getOrCreate()

df = spark.createDataFrame(
    [(1, 33, 'las tres hermanas'),
    (2, 45, 'treshermanas'),
    (3, 12, 'tres hermanas'),
    (4, 14, 'la tres hermana'),
    (5, 73, '3hermanas'),
    (6, 62, 'las tres herman'),
    (7, 14, 'la tres HERMANAS'),
    (8, 14, 'son todas diferntes'),
    (10, 12, 'tres tristes tigres'),], ['id', 'cantidad', 'marca'])

df.show()
"""
+---+--------+-------------------+
| id|cantidad|              marca|
+---+--------+-------------------+
|  1|      33|  las tres hermanas|
|  2|      45|       treshermanas|
|  3|      12|      tres hermanas|
|  4|      14|    la tres hermana|
|  5|      73|          3hermanas|
|  6|      62|    las tres herman|
|  7|      14|   la tres HERMANAS|
|  8|      14|son todas diferntes|
| 10|      12|tres tristes tigres|
+---+--------+-------------------+
"""

def anonimization(dataframe, marca):
    marca_control = marca.lower().replace(" ", "")
    stringDistanceDf = dataframe.\
        withColumn("marca_limpia", regexp_replace(lower(col("marca")), " ", "")).\
        withColumn("control_str", lit(marca_control)).\
        withColumn("string_distance", levenshtein(col("marca_limpia"), col("control_str")))

    new_column_2 = when(col("string_distance") <= 7 , lit("marca")).otherwise(lit("desconocido"))
    finalDf = stringDistanceDf.\
        withColumn("marca_anom", new_column_2).\
        drop("marca","marca_limpia","control_str","string_distance")
    return finalDf

marca = "LAS TRES HERMANAS"
testDf = anonimization(df, marca)
testDf.show()
"""
+---+--------+-----------+
| id|cantidad| marca_anom|
+---+--------+-----------+
|  1|      33|      marca|
|  2|      45|      marca|
|  3|      12|      marca|
|  4|      14|      marca|
|  5|      73|      marca|
|  6|      62|      marca|
|  7|      14|      marca|
|  8|      14|desconocido|
| 10|      12|desconocido|
+---+--------+-----------+
"""

Replace different very similar texts, by a fixed value using pyspark

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?