What is a promise in Javascript?

Question

Asked: 2020-04-10 11:41:26 +0800 CST 2020-04-10 11:41:26 +0800 CST 2020-04-10 11:41:26 +0800 CST

Extract data from a Regex Parser in Python

772

I need to obtain data from an NLP string of vehicle sales phrases and obtain an array with dictionaries of two elements, of the type:

[
  {vehiculo:'Car', Cantidad: 1},
  {vehiculo:'Motorbike', Cantidad: 1}
]

I have almost everything done except the easiest thing which is to extract the tags from the Regex Parser grammar.

At the moment I have the following: With the input phrase: "I sold a car and a motorbike"

1.- Segment the phrase and get:

['\nI sold a car and a motorbike']

2.- Tokenized:

['I', 'sold', 'a', 'car', 'and', 'a', 'motorbike']

3.- Post Tagger morphological analysis:

[('I', 'PRP'), ('sold', 'VBD'), ('a', 'DT'), ('car', 'NN'), ('and', 'CC'), ('a', 'DT'), ('motorbike', 'NN')]

4.- RegexpParser with the following grammar:

    grammar = r'''
    Vehiculo: {<CD>*<NN>+}  
    {<JJ>*<NN>+}
    {<CD>*<NN><IN>*<NN>+}  
    Cantidad: {<JJ>}
    {<CD>}
    {<DT>}
    '''

And I get:

Parsed Sentence =  (S
  I/PRP
  sold/VBD
  (Cantidad a/DT)
  (Vehiculo car/NN)
  and/CC
  (Cantidad a/DT)
  (Vehiculo motorbike/NN))

My question is how I can obtain the dictionaries of this type by extracting the labels and data from the previous statement, with some command without having to do a manual search for text within the string:

[
  {vehiculo:'Car', Cantidad: 1},
  {vehiculo:'Motorbike', Cantidad: 1}
]

Thank you and regards,

2 Answers

Voted

abulafia · Answer 1 · 2020-04-11T04:21:27+08:00

The result of the RegexParser is a Tree, and as such it has methods to loop through it, flatten it, and do a lot of operations on it. Without knowing exactly what structure all your example sentences can have, or if the sentence can contain different amounts of a, etc., it is impossible to give a general solution. In any case, I show you a code example that would work for this case, and you can now adapt it to your needs.

First, so that the code is reproducible for everyone, I show all the importnecessary and previous steps of the analysis:

import nltk
from nltk.chunk import *
from nltk.chunk.util import *
from nltk.chunk.regexp import *
from nltk import Tree
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

sentence = "I sold a car and a motorbike"
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
grammar=r'''
    Vehiculo: {<CD>*<NN>+}  
    {<JJ>*<NN>+}
    {<CD>*<NN><IN>*<NN>+}  
    Cantidad: {<JJ>}
    {<CD>}
    {<DT>}
    '''
resultado = RegexpParser(grammar).parse(tagged)

If you try to just print that result, what you get is its representation as a string:

>>> print(resultado)
(S
  I/PRP
  sold/VBD
  (Cantidad a/DT)
  (Vehiculo car/NN)
  and/CC
  (Cantidad a/DT)
  (Vehiculo motorbike/NN))

But it resultadois actually of type Tree. That allows us, at a minimum, to be able to iterate through its elements and operate accordingly. For example, if the element is a "leaf" (a terminal node), it will be a tuple whose element [0] is the word and element [1] its part of speech. If, on the other hand, it is not a leaf, it will be an intermediate node with new branches (as occurs in the Quantity and Vehicle cases). In this case, the node has one .label()that will give us the value "Quantity" or "Vehicle", and its own sub-nodes, which will already be leaves.

With this information we can set up a loop like the following:

data = []
for nodo in resultado:
    if type(nodo) == tuple:
      continue
    tipo = nodo.label()
    for elemento in nodo:
      if type(elemento) != tuple:
        continue
      palabra, categoria = elemento
      if tipo == 'Vehiculo':
        data.append(dict(vehiculo=palabra, cantidad=1))

At the end, datait will contain:

[{'cantidad': 1, 'vehiculo': 'car'}, {'cantidad': 1, 'vehiculo': 'motorbike'}]

Naturally, as I said, this is not very general. Every time a vehicle appears I count it and put quantity=1. I don't know if you should be able to support phrases like "I sold two cars and three motorbikes", which would make things quite complicated, or if you can have structures with higher levels of nesting. In any case, you already have some clues where to shoot. See also the Tree documentation .

Kelvinator · Answer 2 · 2020-01-21T05:59:04+08:00

In the end I solved it this way, I suppose there will be simpler ways to do it and with better results, but my knowledge at that time was what it was, I hope it helps you:

# -*- coding: utf-8 -*-
"""
Created on Sat Mar 30 19:46:27 2019

Practica usando REGEX TAGGER

@author: Luis Martinez Martin
"""

# Importamos las librerias con las que vamos a trabajar
import nltk
from nltk.chunk.util import conlltags2tree, tree2conlltags
#from nltk import ChunkParserI
import nltk.chunk, nltk.tag
from nltk.corpus import conll2000

class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
       train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] 
                     for sent in train_sents]

       self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
       pos_tags = [pos for (word,pos) in sentence]
       tagged_pos_tags = self.tagger.tag(pos_tags)
       chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
       conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                    in zip(sentence, chunktags)] 
       return nltk.chunk.conlltags2tree(conlltags)    


class BigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents]
        self.tagger = nltk.BigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (words, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                    in zip(sentence, chunktags)] 
        return nltk.chunk.conlltags2tree(conlltags)    

class TrigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
        for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
        in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags) 


# Funcion de segmentacion de frases
def Segmentacion(menu):
    sentences = nltk.tokenize.sent_tokenize(menu)
    return (sentences)

# Función de tokenizacion    
def Tokenizacion(sentences):     
    tokens=nltk.word_tokenize(sentences,"spanish")
    return(tokens)

# Función de Analisis Morfologico POS Tagger
def Pos_Tag(tokens):
    tagged = nltk.pos_tag(tokens)
    return(tagged)

# Funcion RegexpParser
def RegPar(menu):  

    grammar = r'''
    Comida: {<CD>*<NN>+}  # numero (opcional) + uno o mas nombres  (1 bocadillo)
    {<JJ>*<NN>+}
    {<CD>*<NN><IN>*<NN>+}  # numero (opcional) + nombre + preposicion + nombre (1 bocadillo de calamares)
    Cantidad: {<JJ>}
    {<CD>}
    {<DT>}
    {<NN>}
    '''

    regex_parser = nltk.RegexpParser(grammar)
    parsed_sentence = regex_parser.parse(menu)

    return(parsed_sentence)

def GeneraArray(resultado):
    data = []
    for nodo in resultado:
        if type(nodo) == tuple:
          continue
        tipo = nodo.label()
        cant = 1
        for elemento in nodo:
          if type(elemento) != tuple:
            continue
          palabra, categoria = elemento

          if categoria == 'JJ' and (palabra == 'un' or palabra == 'una'):
              cant = 1

          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'dos'):
              cant = 2
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'tres'):
              cant = 3
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'cuatro'):
              cant = 4
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'cinco'):
              cant = 5 
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'seis'):
              cant = 6
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'siete'):
              cant = 7
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'ocho'):
              cant = 8
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'nueve'):
              cant = 9  
          if (categoria == 'NN' or categoria == 'NNS') and (palabra == 'diez'):
              cant = 10    

          if tipo == 'Comida' and (palabra != 'dos' and palabra != 'tres' and palabra != 'cuatro'
                                   and palabra != 'cinco' and palabra != 'seis' and palabra != 'siete' 
                                   and palabra != 'ocho' and palabra != 'nueve' and palabra != 'diez' 
                                   and palabra != 'y' and palabra != ','  and palabra != '.'):
            data.append(dict(comida=palabra, cantidad=cant))    
    return(data)        

def carga_corpus():
    corpus = "Quisiera pedir un hamburguesa,Quiero una tortilla y una cerveza,Me pones un pollo y una ensalada,Quiero una paella,Quiero un bocadillo,Quiero una pizza,Ponme una sopa,Quiero un filete,Quisiera pedir una ensalada,Quiero cinco bocadillos,Quisiera una empanada,Quiero unas croquetas,Quisiera morcilla,Quiero pedir un solomillo,Quiero unos macarrones,Quiero una Lasagna,Quiero una hamburguesa, una de patatas fritas y una cerveza,Quiero un lenguado,Quiero un bonito,Quisiera una sepia,Quiero cinco cervezas,Quiero tres sidras y tres pinchos,Quiero cinco manzanas y tres melocotones,Quisiera cuatro solomillos,Quiero una naranja y dos peras"

    return (corpus)


# Función principal    
def main():
    # cargo el corpus con pedidos al restaurante
    corpora = carga_corpus() 

    Segm = Segmentacion(corpora)
    print ("\n\n1. Frases:",Segm)

    tok = Tokenizacion(corpora)
    print ("\n\n2. Tokens:",tok)

    ptag = Pos_Tag(tok)
    print ("\n\n3. Analisis Morfologico:",ptag)

    # Construimos Regex Parser
    RegexParser = RegPar(ptag)
    print("\n\n4 Parsed Sentence = ", RegexParser)  

    GeneraSalida = GeneraArray(RegexParser)
    print("\n\n5 Salida = ",GeneraSalida)

    iob_tags = tree2conlltags(RegexParser)
    print ("\n\n6 IOB Tags = ",iob_tags)

    tree = conlltags2tree(iob_tags)
    print  ("\n\n7 Tree = ",tree)


    test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
    train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

    Uchunker = UnigramChunker(train_sents)
    print("\n\n8 Acierto con unigramas: ", Uchunker.evaluate(test_sents))
    print("\n\n9 SENTENCE: ", Uchunker.parse(ptag))  

    Bchunker = BigramChunker(train_sents)
    print("\n\n10 Acierto con Bigramas: ", Bchunker.evaluate(test_sents))
    print("\n\n11 SENTENCE: ", Bchunker.parse(ptag))  

    Tchunker = TrigramChunker(train_sents)
    print("\n\n12 Acierto con Trigramas: ", Tchunker.evaluate(test_sents))
    print("\n\n13 SENTENCE: ", Tchunker.parse(ptag))          


main()

Extract data from a Regex Parser in Python

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?