dopg.GoodmanDOP

init(self, treebank, rootsymbol=`'S'`, wrap=False, cnf=True, cleanup=True, parser=<class 'nltk.parse.pchart.InsideChartParser'>, **parseroptions)
(Constructor)

initialize a DOP model given a treebank. uses the Goodman
reduction of a STSG to a PCFG.  after initialization,
self.parser will contain an InsideChartParser.

>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> print d.grammar
        Grammar with 12 productions (start state = S)
                NP -> 'mary' [1.0]
                NP@1 -> 'mary' [1.0]
                S -> NP VP [0.25]
                S -> NP VP@2 [0.25]
                S -> NP@1 VP [0.25]
                S -> NP@1 VP@2 [0.25]
                S@0 -> NP VP [0.25]
                S@0 -> NP VP@2 [0.25]
                S@0 -> NP@1 VP [0.25]
                S@0 -> NP@1 VP@2 [0.25]
                VP -> 'walks' [1.0]
                VP@2 -> 'walks' [1.0]
>>> print d.parser.parse("mary walks".split())
(S (NP mary) (VP@2 walks)) (p=0.25)             

@param treebank: a list of Tree objects. Caveat lector:
        terminals may not have (non-terminals as) siblings.
@param wrap: boolean specifying whether to add the start symbol
        to each tree
@param parser: a class which will be instantiated with the DOP 
        model as its grammar. Support BitParChartParser.

instance variables:
- self.grammar a WeightedGrammar containing the PCFG reduction
- self.fcfg a list of strings containing the PCFG reduction 
  with frequencies instead of probabilities
- self.parser an InsideChartParser object
- self.exemplars dictionary of known parse trees (memoization)

decorate_with_ids(self, tree, ids)

source code

add unique identifiers to each non-terminal of a tree.

>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> d.decorate_with_ids(tree, count())
Tree('S@0', [Tree('NP@1', ['mary']), Tree('VP@2', ['walks'])])

Parameters:

ids - an iterator yielding a stream of IDs

nodefreq(self, tree, nonterminalfd, leaves=1)

source code

count frequencies of nodes by calculating the number of subtrees headed by each node. updates "nonterminalfd" as a side effect

>>> fd = FreqDist()
>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> d.nodefreq(tree, fd)
4
>>> fd.items()
[('S', 4), ('NP', 1), ('VP', 1)]

#[('S', 9), ('NP', 2), ('VP', 2), ('mary', 1), ('walks', 1)]

Parameters:

nonterminalfd - the FreqDist to store the counts in.

goodman(self, tree, utree, bitparfmt=True)

source code

given a parsetree from a treebank, yield a goodman reduction of eight rules per node (in the case of a binary tree).

>>> tree = Tree("(S (NP mary) (VP walks))")
>>> d = GoodmanDOP([tree])
>>> utree = d.decorate_with_ids(tree, count())
>>> sorted(d.goodman(tree, utree, False))
[(NP, ('mary',)), (NP@1, ('mary',)), (S, (NP, VP)), (S, (NP, VP@2)),
(S, (NP@1, VP)), (S, (NP@1, VP@2)), (S@0, (NP, VP)), 
(S@0, (NP, VP@2)), (S@0, (NP@1, VP)), (S@0, (NP@1, VP@2)), 
(VP, ('walks',)), (VP@2, ('walks',))]

probabilities(self, cfg, fd)

source code

merge cfg and frequency distribution into a pcfg with the right probabilities.

Parameters:

cfg - a list of Productions
nonterminalfd - a FreqDist of (non)terminals (with and without IDs)

frequencies(self, cfg, fd)

source code

merge cfg and frequency distribution into a list of weighted productions with frequencies as weights (as expected by bitpar).

Parameters:

cfg - a list of Productions
nonterminalfd - a FreqDist of (non)terminals (with and without IDs)

mostprobableparse(self, sent, sample=None)

source code

warning: this problem is NP-complete. using an unsorted chart parser avoids unnecessary sorting (since we need all derivations anyway).

Parameters:

sent - a sequence of terminals
sample - None or int; if int then sample that many parses

mostconstituentscorrect(self, sent)

source code

not working yet. almost verbatim translation of Goodman's (1996) most constituents correct parsing algorithm, except for python's zero-based indexing. needs to be modified to return the actual parse tree. expects a pcfg in the form of a dictionary from productions to probabilities

Class GoodmanDOP

init(self, treebank, rootsymbol=`'S'`, wrap=False, cnf=True, cleanup=True, parser=<class 'nltk.parse.pchart.InsideChartParser'>, **parseroptions)
(Constructor)

decorate_with_ids(self, tree, ids)

nodefreq(self, tree, nonterminalfd, leaves=1)

goodman(self, tree, utree, bitparfmt=True)

probabilities(self, cfg, fd)

frequencies(self, cfg, fd)

parse(self, sent)

mostprobableparse(self, sent, sample=None)

mostconstituentscorrect(self, sent)

Class GoodmanDOP

__init__(self, treebank, rootsymbol='S', wrap=False, cnf=True, cleanup=True, parser=<class 'nltk.parse.pchart.InsideChartParser'>, **parseroptions) (Constructor)

decorate_with_ids(self, tree, ids)

nodefreq(self, tree, nonterminalfd, leaves=1)

goodman(self, tree, utree, bitparfmt=True)

probabilities(self, cfg, fd)

frequencies(self, cfg, fd)

parse(self, sent)

mostprobableparse(self, sent, sample=None)

mostconstituentscorrect(self, sent)

init(self, treebank, rootsymbol=`'S'`, wrap=False, cnf=True, cleanup=True, parser=<class 'nltk.parse.pchart.InsideChartParser'>, **parseroptions)
(Constructor)