I have a database similar to this:
library(stringr) #Para la función str_extract
ejemplo <- data.frame(columna= c("{ em , 680/3 }", "no 659-11", "no funciona la 2507",
"a-4-2 no funciona", "p-8 con presencia de arena", " no 2-5c12s si sirve",
"mecanica no 1ty -22s brinca"))
I am interested in extracting the part of each string that is:
- Number
- letter plus number
- number plus letter
and, in general, strings referring to "identifiers".
Using a simple regular expression, I can extract
ejemplo$col2 <- ifelse(grepl("[0-9]+", ejemplo$columna), str_extract(ejemplo$columna, "[0-9]+"), NA)
ejemplo
columna col2
1 { em , 680/3 } 680
2 no 659-11 659
3 no funciona la 2507 2507
4 a-4-2 no funciona 4
5 p-8 con presencia de arena 8
6 no 2-5c12s si sirve 2
7 mecanica no 1ty -22s brinca 1
which only has a good result for the third string. Honestly I feel lost and I don't know if there is a way to do everything I want in one go or I will have to build several regular expressions to achieve it. I greatly appreciate any guidance on this.
The desired output would be:
deseado
1 680/3
2 659-11
3 2507
4 a-4-2
5 p-8
6 2-5c12s
7 1ty-22s
Well, the following expression occurs to me, which is quite complex but I don't see how to simplify it:
It would read more or less like this:
-
or sign/
Basically, it is about forcing at least one digit to appear between the letters, because if we had tried a simpler expression like
[0-9a-zA-Z\/-]+
then, any word in the text would have fit even if it had no digits.Demo of operation in regex101
Abulafia did a great job finding a monolithic expression that solves all the cases you have presented. Another way is to pose multiple more exact patterns with each
id
, leaving the most complex and specific ones at the beginning and the most generic ones at the end:We build a vector with the patterns and paste them into a single string with a
|
(or
). This is much less performant than abulafia's solution, but in some cases it can make the logic clearer.