Module dopg :: Class GoodmanDOP
[hide private]
[frames] | no frames]

Class GoodmanDOP

source code

Instance Methods [hide private]
 
__init__(self, treebank, rootsymbol='S', wrap=False, cnf=True, cleanup=True, parser=<class 'nltk.parse.pchart.InsideChartParser'>, **parseroptions)
initialize a DOP model given a treebank.
source code
 
goodmanfd(self, tree, ids, nonterminalfd) source code
 
decorate_with_ids(self, tree, ids)
add unique identifiers to each non-terminal of a tree.
source code
 
nodefreq(self, tree, nonterminalfd, leaves=1)
count frequencies of nodes by calculating the number of subtrees headed by each node.
source code
 
goodman(self, tree, utree, bitparfmt=True)
given a parsetree from a treebank, yield a goodman reduction of eight rules per node (in the case of a binary tree).
source code
 
probabilities(self, cfg, fd)
merge cfg and frequency distribution into a pcfg with the right probabilities.
source code
 
frequencies(self, cfg, fd)
merge cfg and frequency distribution into a list of weighted productions with frequencies as weights (as expected by bitpar).
source code
 
removeids(self, tree)
remove unique IDs introduced by the Goodman reduction
source code
 
parse(self, sent)
memoize parse trees.
source code
 
mostprobableparse(self, sent, sample=None)
warning: this problem is NP-complete.
source code
 
mostconstituentscorrect(self, sent)
not working yet.
source code
Method Details [hide private]

__init__(self, treebank, rootsymbol='S', wrap=False, cnf=True, cleanup=True, parser=<class 'nltk.parse.pchart.InsideChartParser'>, **parseroptions)
(Constructor)

source code 
initialize a DOP model given a treebank. uses the Goodman
reduction of a STSG to a PCFG.  after initialization,
self.parser will contain an InsideChartParser.

>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> print d.grammar
        Grammar with 12 productions (start state = S)
                NP -> 'mary' [1.0]
                NP@1 -> 'mary' [1.0]
                S -> NP VP [0.25]
                S -> NP VP@2 [0.25]
                S -> NP@1 VP [0.25]
                S -> NP@1 VP@2 [0.25]
                S@0 -> NP VP [0.25]
                S@0 -> NP VP@2 [0.25]
                S@0 -> NP@1 VP [0.25]
                S@0 -> NP@1 VP@2 [0.25]
                VP -> 'walks' [1.0]
                VP@2 -> 'walks' [1.0]
>>> print d.parser.parse("mary walks".split())
(S (NP mary) (VP@2 walks)) (p=0.25)             

@param treebank: a list of Tree objects. Caveat lector:
        terminals may not have (non-terminals as) siblings.
@param wrap: boolean specifying whether to add the start symbol
        to each tree
@param parser: a class which will be instantiated with the DOP 
        model as its grammar. Support BitParChartParser.

instance variables:
- self.grammar a WeightedGrammar containing the PCFG reduction
- self.fcfg a list of strings containing the PCFG reduction 
  with frequencies instead of probabilities
- self.parser an InsideChartParser object
- self.exemplars dictionary of known parse trees (memoization)

decorate_with_ids(self, tree, ids)

source code 

add unique identifiers to each non-terminal of a tree.

>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> d.decorate_with_ids(tree, count())
Tree('S@0', [Tree('NP@1', ['mary']), Tree('VP@2', ['walks'])])
Parameters:
  • ids - an iterator yielding a stream of IDs

nodefreq(self, tree, nonterminalfd, leaves=1)

source code 

count frequencies of nodes by calculating the number of subtrees headed by each node. updates "nonterminalfd" as a side effect

>>> fd = FreqDist()
>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> d.nodefreq(tree, fd)
4
>>> fd.items()
[('S', 4), ('NP', 1), ('VP', 1)]

#[('S', 9), ('NP', 2), ('VP', 2), ('mary', 1), ('walks', 1)]

Parameters:
  • nonterminalfd - the FreqDist to store the counts in.

goodman(self, tree, utree, bitparfmt=True)

source code 

given a parsetree from a treebank, yield a goodman reduction of eight rules per node (in the case of a binary tree).

>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> utree = d.decorate_with_ids(tree, count())
>>> sorted(d.goodman(tree, utree, False))
[(NP, ('mary',)), (NP@1, ('mary',)), (S, (NP, VP)), (S, (NP, VP@2)),
(S, (NP@1, VP)), (S, (NP@1, VP@2)), (S@0, (NP, VP)), 
(S@0, (NP, VP@2)), (S@0, (NP@1, VP)), (S@0, (NP@1, VP@2)), 
(VP, ('walks',)), (VP@2, ('walks',))]

probabilities(self, cfg, fd)

source code 

merge cfg and frequency distribution into a pcfg with the right probabilities.

Parameters:
  • cfg - a list of Productions
  • nonterminalfd - a FreqDist of (non)terminals (with and without IDs)

frequencies(self, cfg, fd)

source code 

merge cfg and frequency distribution into a list of weighted productions with frequencies as weights (as expected by bitpar).

Parameters:
  • cfg - a list of Productions
  • nonterminalfd - a FreqDist of (non)terminals (with and without IDs)

parse(self, sent)

source code 

memoize parse trees. TODO: maybe add option to add every parse tree to the set of exemplars, ie., incremental learning. this uses the most probable derivation (not very good).

mostprobableparse(self, sent, sample=None)

source code 

warning: this problem is NP-complete. using an unsorted chart parser avoids unnecessary sorting (since we need all derivations anyway).

Parameters:
  • sent - a sequence of terminals
  • sample - None or int; if int then sample that many parses

mostconstituentscorrect(self, sent)

source code 

not working yet. almost verbatim translation of Goodman's (1996) most constituents correct parsing algorithm, except for python's zero-based indexing. needs to be modified to return the actual parse tree. expects a pcfg in the form of a dictionary from productions to probabilities