Dear, does anyone know a method to find non-exact matches between text strings?
For example:
I have the following text "STATUS MSG PACK ACM L" (column 1) and it should return "PACK L" (column 2).
I have 2 lists, one written by a person that are longer texts and another that corresponds to the message to search for, which is the correct one.
I enclose an example of the two lists: column 1 should be searched for in column 2, and return the most associated element of column 2:
https://drive.google.com/file/d/0B11sJdX_AaJBd2lvWGszaFpXM2c/view?usp=sharing
For fuzzy searches there are multiple tools and methods, but using factory Python we already have the base library
difflib
that allows us to obtain aratio
similarity between strings. For example:In this example we measure the similarity of
Hola Mundo
with other strings and we see that logically itHola Mundo!
obtains a similarity ratio greater thanHola Mundo cruel
. The idea then, would be to go through a list, and for each element, check the ratios with respect to the elements of the second list, the largest will be the most similar. Something like that:In
matches
we end up having the elements of the second list, ordered from greater similarity to less, the first element should be the optimal one.Important : In this way we will always find a "similarity", as an additional improvement you may have to contemplate a
ratio
minimum of similarity to consider that the "matching" has been achieved, this value can only be defined by experimenting.Even better is the way suggested by FjSevilla because it is more compact and because it already incorporates the logic to evaluate the minimum ratio:
As a curiosity, it would be necessary to indicate that
difflib
it is strongly based on THE GESTALT APPROACH algorithm of 1987.