I'm doing nlp with pyspark on some customer reviews and I want to hide the brand name of the company, replacing all matches in the text, with a fixed value (eg "brand"). I have tried creating a function and using regexp_replace
, but having many ways to write the mark, it is not very practical, and what I am using seems a bit clumsy.
The function I have is something like this:
def anonimization(column):
col=regexp_replace(column,'las tres hermanas','marca')
col=regexp_replace(col,'treshermanas','marca')
col=regexp_replace(col,'tres hermanas','marca')
col=regexp_replace(col,'la tres hermana','marca')
col=regexp_replace(col,'3hermanas','marca'
col=regexp_replace(col,'las tres herman','marca')
return col
and the call is this:
cleaned_text=cleaned_text.select('ID','Year',anonimization(col('text')).alias('text'),'TypeComment')
To begin with, it is not replacing the matches well and secondly I think it is not the best way to do it since in the event of any small variation (eg a typo) it would no longer identify it, which implies that the list in the function can grow much.
I would like to find a more efficient way to use regex_replace
to solve this problem, or if there is another method to solve my problem.
The input text is already lowercase and extraneous characters have been removed.
I am using pyspark on top of spark 2.4.5.
For what you are asking for, I think that before trying to make all the words match, I would try to evaluate the similarity between the two strings and filter it based on that. I'm going to use the Levenshtein distance solution and I'm going to filter it by a fixed number.
I changed the function by adding the DataFrame and the filter mark as parameters. My code to the solution with an example DataFrame: