Module arbobanko
[hide private]
[frames] | no frames]

Module arbobanko

source code

treebank conversion script; expects no arguments, uses stdin & stdout. input is VISL horizontal tree format see: http://beta.visl.sdu.dk/treebanks.html#The_source_format output: s-expression, ie., tree in bracket notation. TODO: turn this into a nltk.Corpus reader

Functions [hide private]
 
cnf(tree)
make sure all terminals have POS tags; invent one if necessary ("parent_word")
source code
 
leaves(xx)
include "non-terminals" if they have no children
source code
 
parse(input, stripmorph=True)
parse a horizontal tree into an s-expression (ie., WSJ format).
source code
 
clean(a, stripmorph=True) source code
 
reparse(tree)
following code contributed by Alex Martelli at StackOverflow: http://stackoverflow.com/questions/2815020/converting-a-treebank-of-vertical-trees-to-s-expressions
source code
 
main()
take a treebank from stdin in horizontal tree format, and output it in s-expression format (ie., bracket notation, WSJ format).
source code
Variables [hide private]
  example = 'X:np\n=H:n("konsekvenco" <*> <ac> P NOM)\tKonsekven...
  example2 = 'STA:fcl\n=S:np\n==DN:pron-dem("tia" <*> <Dem> <Du>...
  example3 = 'STA:par\n=CJT:fcl\n==fA:adv("krome" <*>) Krome\n...
  relinelev = re.compile(r'(=*)(.*)')
  reclean = re.compile(r'\s*\((\S+)[^\)]*\)')
  __package__ = None
Function Details [hide private]

parse(input, stripmorph=True)

source code 

parse a horizontal tree into an s-expression (ie., WSJ format). Defaults to stripping morphology information. Parentheses in the input are converted to braces.

>>> print example
X:np
=H:n("konsekvenco" <*> <ac> P NOM)      Konsekvencoj
=DN:pp
==H:prp("de")   de
==DP:np
===DN:adj("ekonomia" <Deco> P NOM)      ekonomiaj
===H:n("transformo" P NOM)      transformoj     
>>> parse(example.splitlines())
'(X:np (H:n Konsekvencoj) (DN:pp (H:prp de) (DP:np (DN:adj ekonomiaj) (H:n transformoj))))'     
>>> print example2
STA:fcl
=S:np
==DN:pron-dem("tia" <*> <Dem> <Du> <dem> DET P NOM)     Tiaj
==H:n("akuzo" <act> <sd> P NOM) akuzoj
=fA:adv("certe")        certe
=P:v-fin("dauxri" <va+TEMP> <mv> FUT VFIN)      dauxros
>>> parse(example2.splitlines())
'(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros))'
>>> parse(example3.splitlines())
'(STA:par (CJT:fcl (fA:adv Krome) (,) (S:np (DN:art la) (H:n savo) (DN:pp (H:prp de) (DP:np (H:n konkuranto)))) (P:v-fin helpos (((DN:prop Microsoft))))) CJT:icl (P:v-pcp2 refuti) (Od:np (H:n akuzojn) (DN:pp (H:prp pri) (DP:n monopolismo))))'

reparse(tree)

source code 

following code contributed by Alex Martelli at StackOverflow: http://stackoverflow.com/questions/2815020/converting-a-treebank-of-vertical-trees-to-s-expressions

parse a horizontal tree into an s-expression (ie., WSJ format). Defaults to stripping morphology information. Parentheses in the input are converted to braces.

>>> reparse(example.splitlines())
'(X:np (H:n Konsekvencoj) (DN:pp (H:prp de) (DP:np (DN:adj ekonomiaj) (H:n transformoj))))'     
>>> reparse(example2.splitlines())
'(STA:fcl (S:np (DN:pron-dem Tiaj) (H:n akuzoj)) (fA:adv certe) (P:v-fin dauxros))'
>>> reparse(example3.splitlines())
'(STA:par (CJT:fcl (fA:adv Krome) (,) (S:np (DN:art la) (H:n savo) (DN:pp (H:prp de) (DP:np (H:n konkuranto)))) (P:v-fin helpos (DN:prop Microsoft))) (CJT:icl (P:v-pcp2 refuti) (Od:np (H:n akuzojn) (DN:pp (H:prp pri) (DP:n monopolismo)))))'

main()

source code 

take a treebank from stdin in horizontal tree format, and output it in s-expression format (ie., bracket notation, WSJ format). Checks whether original sentence and leaves of the tree match, and discards the tree if they don't. Also removes trees marked problematic with the tag "CAVE" in the comments. Example input: <s_id=812> SOURCE: id=812 ID=812 Necesus adapti la metodon por iuj alilandaj klavaroj. A1 STA:fcl =P:v-fin("necesi" <*> <mv> COND VFIN) Necesus =S:icl ==P:v-inf("adapti" <mv>) adapti ==Od:np ===DN:art("la") la ===H:n("metodo" <ac> S ACC) metodon ===DN:pp ====H:prp("por" <aquant>) por ====DP:np =====DN:pron("iu" <quant> DET P NOM) iuj =====DN:adj("alilanda" P NOM) alilandaj =====H:n("klavaro" <cc-h> <tool-mus> P NOM) klavaroj .

</s>


Variables Details [hide private]

example

Value:
'''X:np
=H:n("konsekvenco" <*> <ac> P NOM)\tKonsekvencoj
=DN:pp
==H:prp("de")\tde
==DP:np
===DN:adj("ekonomia" <Deco> P NOM)\tekonomiaj
===H:n("transformo" P NOM)\ttransformoj'''

example2

Value:
'''STA:fcl
=S:np
==DN:pron-dem("tia" <*> <Dem> <Du> <dem> DET P NOM)     Tiaj
==H:n("akuzo" <act> <sd> P NOM) akuzoj
=fA:adv("certe")        certe
=P:v-fin("dauxri" <va+TEMP> <mv> FUT VFIN)      dauxros'''

example3

Value:
'''STA:par
=CJT:fcl
==fA:adv("krome" <*>)   Krome
==,
==S:np
===DN:art("la") la
===H:n("savo" <act> <event> S NOM)      savo
===DN:pp
...