@ Enriching DOP with morphology: 

@ a pilot using a constructed language

0440949 _Andreas van Cranenburgh_ Cognitive Models of Language project 

@ Abstract

Esperanto is a constructed language with a rich and regular morphology.  It
seems likely that taking its morphology into account when parsing syntax will
improve accuracy. I will investigate the effects of considering morphological
and phrase structure analysis as separate, autonomous steps, versus combining
them into a single DOP model. I will assume a hierarchical representation for
both syntax and morphology.

Since there is no gold standard treebank with phrase structures for Esperanto,
I will construct a small toy corpus for testing. Furthermore, experiments with
U-Dop for both morphology and phrase structure are possible.

Lastly, previous work with Esperanto has resulted in a highly successful (>95%
precision on a small test corpus) constraint grammar (Bick 2007), and a formal
model of morphology and syntax in the form of an adpositional grammar (Gobbo
2009); an adpositional grammar is a dependency grammar combining directed
dependencies with the dimension of trajector/landmark from construction
grammar. These provide a means of comparison and a potential treebank.

@ Research questions:

* Does making morphology transparent to syntax improve parsing results for
syntax?
* Are morphology and syntax autonomous? ie., is morphology opaque or
transparent to syntax? These possibilities correspond to a modularist (cf.
Pinker 1994, Jackendoff 2003) vs. an interactionist approach (cf.  MacWhinney
1987). Here modularism refers to the functionalist hypothesis of the autonomy
of syntax from other levels, both the stronger claim of processing autonomy,
and its weaker form of representational autonomy. In effect the issue at stake
here is the nature of the morphology-syntax interface. On the one hand there is
the extreme of syntax seeing only a Part-of-Speech tag (and possibly the word
as well if syntax is lexicalised), on the other hand there is the other extreme
where morphology is literally a part of syntax that has been conveniently
ignored in the majority of work in (computational) linguistics to date.
Jackendoff's (2003) parallel archictecture suggest a compromise where
interfaces of different autonomous levels are possible (e.g.,
phonology-semantics to deal with focus effects).
* Are words the smallest units of syntax, or is it perhaps morphemes?

Because this is a pilot project, only the first question shall be answered,
but this may provide a hint as to the other questions. In addition the answer
to the first question shall only concern Esperanto.

Approach:

* Construct a corpus of sentences annotated with phrase structures, and a 
lexicon of words annotated with morphological structures. The assumption is
that while syntax may use information in morphology, morphology does not need
information from syntax, hence the possibility of constructing a morphological
corpus independent of the text corpus; note that this amounts to assuming that
for the purpose of constructing a corpus the morphology is context-free.
* Divide the corpus into training and testing, train on the former with DOP1 or 
DOP*; (Zollman 2005) (the latter only given a sufficiently large corpus)
* Evaluate morphology: evaluate performance of morphology; should be good 
enough to continue with syntax
* Morphology transparent to syntax: take treebank corpus, merge phrase 
structure trees with morphological analyses, construct a single DOP model
* Morphology opaque to syntax: construct a DOP model for morphology, taking one 
word at a time, and a DOP model for syntax, producing phrase structure trees
without morphology.  Morphological structure and phrase structure can be parsed
in parallel and independent of each other.

@ Morphology

Much work in Computational Linguistics focuses exclusively on syntax; this is
a form of syntactocentrism, a term coined in Generative Linguistics 
(Jackendoff 2003). This also goes for Data-Oriented Parsing (DOP), although
excursions into semantics have been made. In this project I will go in the
other direction and turn to the stratum of morphology. Most accounts of
morphology in Computational Linguistics seem to present the structure of words
as a sequence of morpheme-feature pairs (e.g., Jurafsky & Martin 2000),
as parsed by a (Stochastic) Finite State Transducer (cf., Schmid et al. 2004).

However, due to the complexity and potentially unlimited productivity of
morphology in Esperanto such a representation will necessarily contain only
part of the structural information of words in Esperanto (more on this in the
next section). Such an approach is to the representation of the present project
what POS tagged sentences are to hierarchical phrase structure trees. Although
the present project focuses on Esperanto, the method of adding morphology to
DOP should generalize to other languages, especially to languages such as
English which display only a very limited amount of morphological productivity
and hence exhibit only a subset of the derivational complexity in
morphologically richer languages.

@ About Esperanto

Esperanto is a constructed language (also referred to as a planned language).
The term ``artificial language" that is sometimes employed is inappropriate, as
its artificial design is just a point in time of its century long continuous
usage and evolution. It is a spoken language with its own literature and culture, so while it
may not  be a ``natural language" strictly speaking (Gobbo (2009) uses the term
Quasi-Natural Language), it is certainly a human language that performs all the
communicative and expressive functions of Ethnic languages, albeit mostly as a
second language used by a diverse and scattered speech community. 

Typologically Esperanto has the unique character of being a morphologically
agglutinative and synthetic language with a vocabulary largely based on Romance
languages (apart from some German & Russian words, and schematic function words
as well). Its word formation is highly compositional (ie., its word formation
is fully transparent). Its syntax is schematic (designed) and allows for a
relatively free word-order through obligatory case marking, though in practice
a default word-order of SVO has emerged, with systematic deviations, triggered
by complex constituents and by pragmatics to express focus; these findings
accord with relatively universal features found in natural languages (Jansen
2007). Cases are marked  either through inflection in case of the accusative or
through a set of prepositions intended to be unambiguous (e.g., the English
preposition "with" translates in two ways in Esperanto, through the
instrumentalis "per" and "kun", with as in together. 

Concerning prepositions, the initial intention was to express some rather vague
relations such as "believing _in_ God" (which is neither spatial nor temporal,
it would appear) with a semantically neutral preposition for an unspecified
relation, the preposition "je"; however, this seems to have fallen in disuse,
probably through interference from Ethnic languages. However, an interesting
hypothesis could be that this reflects an evolutionary pressure for
distinctions and ambiguities to correspond with the meanings that are actually
expressed (the prior probability of wanting to express some meaning) -- while
an abstruse philosophical treatise may theoretically discuss "believing" while
residing spatially or temporally "in God", this possibility is vanishingly rare
so that making the distinction is wasted effort.

While Esperanto's morphology is agglutinative and synthetic (it has an index of
agglutination of 1,0 and an average synthesis index (word-morphemo ratio) of
1.8-2, reported by Wells 1989), it is not poly-synthetic such as Inuit
languages; single words cannot express what is denoted by a whole phrase
in other languages, and grammatical roles are not marked, nor is the nature of
the relation between elements that make up a word specified. Concerning the
relations between morphemes, consider the Dutch word "zoektechnieken", which
could translated as "techniques for search," though "for" is not specified in
the Dutch word. In an agglutinative language invariant morphemes that express
only a single grammatical meaning are concatenated unmodified, such that
identifying the elements that make up a word is relatively easy (although
ambiguities may arise through overlap; ie., when concatenating two smaller
morphemes results in a string of characters that coincides with a larger
morpheme). The process of word formation is completely productive and without
exceptions; the only proviso is that a formation should make sense semantically
when considering the meaning of its constituent elements. 

There is obligatory agreement in number and case within noun phrases. Verb
paradigms are simple: tense is marked with the ending, person and number
through the personal pronoun.

Esperanto's productive morphology can be summarized using a regular
grammar. The following is adapted from Schubert (1993; caveat lector: Schubert
incorrectly characterizes this grammar as recursive), which in turn is based
on Kalocsay's (1980) account. I have translated it into a regular grammar,
proving that Esperanto's lexicon of word forms can be enumerated by a regular
language; to my knowledge this is the first such description to date. The
grammar for function words: 

{{{
function_word := adverb | preposition | numeral
adverb := prefix adverb
preposition := prefix preposition
numeral := numeral numeral{U+002A}
prefix := mal | ne | ...
suffix := il | et | ...
}}}

Content words are a little more involved (ibid):

{{{
word := prefix{U+002A} left{U+002A} right ending (epsilon | declension)
left := right (epsilon | ending)
right := prefix{U+002A} root suffix{U+002A}
ending := o | a | e 
declension := j | n
verb-ending := as | is | os | us | u
root := akv | far | ...
}}}

In these rules, "prefix" and "suffix" refers to a closed class of affixes;
"(verb-)ending" refers to a one or two-character ending marking the
Part-of-Speech; "declension" refers to either a null marking (nominative,
singular) or the accusative and/or plurality marking. Furthermore, "*" is the
Kleene star, "|" is the alternation operator, and lastly concatenation is
implied. This grammar incorporates three processes of word-formation in
Esperanto: derivation (concatenating elements to form words), compounding
(concatenating elements to words to form more complex words), and POS category
change.  The latter refers to nominalizations and other possible mappings
between Parts-of-Speech.

While this grammar should in all likelihood exhaust Esperanto's morphology, 
it is of little use for computational linguistics because of its ambiguity
and flat structure.  Whereas POS-tagging can be done practically error-free
using a rule-based algorithm (save for proper names and foreign words), deeper
morphological structure will depend on the morphemes in question, and possibly
their semantics as well. However, in this project it is assumed that the latter
does not play a major role as doing semantics is infeasible (it is my
contention that semantics relies on extensive extra-linguistic world
knowledge). We will assume that derivations and compound words are constructed
in a stochastic process that can be leared from examples (words with their
appropriate structure, that is).

Another way in which the grammar falls short is that it does not consider the
grammatical character of roots in Esperanto (Schubert 1993). Although initially
controversial, the thesis that bare roots (without their grammatical endings)
have a grammatical category to which they belong has by now been almost
universally accepted in Esperantology. In effect this entails that roots in
Esperanto belong to a prototypical semantic class (sometimes several).  These
classes are verbal, adjectival and noun-like (adverbial roots are part of the
adjectival roots, arguably they are part of the ``qualities" class). The
typical example is "MARTEL" and "TOND", roots for hammer and cutting,
respectively. The category of the former is a noun and thus "martelo" means a
hammer, and the derived "marteli" means to hammer. The latter is a verb root
meaning, with "tondi" meaning to cut, and the derivation "tondilo" meaning a
tool to cut or a scissor, requiring an affix to denote a tool derived from a
verb (directly affixing a noun ending to the root would mean ``a cut"). Without
recording the grammatical category of roots, a model of Esperanto morphology
would not be able to predict the correct derivations.

The present work glosses over a related feature of Esperanto roots, the fact
that verbs are transitive or intransitive (valency), requiring an affix to change from
the one to the other meaning. The reason for glossing over this aspect is that
this information should become part of a more general account of argument
structure (i.e., including prepositional arguments) that is beyond the scope of
this project. Take these examples:

(1) "La akvo bolas" (the water boils)

(2) "Mi boligas la akvon" (I boil the water)

(3) "Mi finis la libron" (I finished the book)

(4) "La libro finiĝis" (the book finished)

Sentences (1) and (3) contain the original verb, while (2) and (4) contain
affixed verbs with a different subcategorization frame. This feature of
Esperanto has been critized as being a needless distinction (common sense
usually yields the correct meaning, as for example English demonstrates),
as well as the rather arbitrary choices that have been made as to the
transitivity, requiring a language user to memorize them by
rote. It has also resulted in confusing paronyms such as ``pesi'' (to weigh
something) and ``pezi'' (to weigh X kilos, to be heavy). It is however an
unchangeable part of the language.

@ About Data-Oriented Parsing

Data-Oriented Parsing (Scha 1990; Bod & Scha 1996, henceforth DOP) is a
computational framework for modeling natural language processing (NLP) and
other hierarchical cognitive phenomena. Its basic assumptions are:

* knowledge of language is made up of a corpus of concrete experiences
rather than abstract rules; this concrete experience is stored in
exemplars, pairings of surface forms and their structure.
* when faced with a new sentence, all fragments of past experiences can be
consulted to analyze the given sentence
* fragments can be combined using one or more operations which obtain with a
certain (estimated) probability

Two crucial aspects are the representation used to describe the concrete
experiences and the method for ranking the possible analyses. Most research in
Computational Linguistics currently focuses on isolated sentences annotated
with phrase-structures trees; this project will follow the same approach with
the addition of morphological structure. Various methods for selecting the best
parse tree exist for DOP; the best performing methods combine a notion of
simplicity (the derivation requiring the least amount of fragments) with
likelihood (estimated probability); e.g., the most likely from the n shorted
derivations.

It should be noted that Esperanto, as a free word-order language, is more
suitably described using depedency structures. However, given extent of
previous work on DOP with phrase-structure trees, I have opted to assume such
hierarchical representantions instead. This is merely a pragmatically motivated
assumption. Work on combining DOP and dependency structures is forthcoming.

What makes DOP so promising is that if any computational approach to language
can be said to successfully learn a language given enough data (ie., without
recourse to innate knowledge), DOP is bound to be one of them.  This is because
the Data in Data-Oriented Parsing refers to exploiting all of the available
data. Whereas more traditional methods in Computational Linguistics such as
Probabilitistic Context-Free Grammars (PCFG) derive abstract rules from a
treebank, throwing away valuable contextual information, DOP retains all
exemplars and their fragments (modulo some potential pruning method
corresponding to memory decay depending on usage and age etc.). This allows for
the recognition of long-range dependencies such as in the construction "more X
than Y." Also, compared to a PCFG, the statistical independence assumptions of
DOP are less strong, because they can be spread over different derivations
resulting in the same parse tree (ie., the assumptions made by each of the
derivations of the most probable parse are corroborated by its other
derivations). Prescher et al. (2004) observe that DOP combines the memory-based
aspects of non-probabilistic machine learning techniques such as k-nearest
neighbor with a probabilistic approach to deal with unseen (novel) exemplars;
thus DOP provides a way to deal with the spectrum ranging from stock phrases
that can be memorized by rote to completely novel sentences. The larger the
fragments used in a derivation, the less indepedence assumptions need to be
made; however, novel sentences can be parsed by backing off to smaller
fragments. Thus, in the limit (as the corpus size approaches infinity) DOP does
not make any independence assumptions at all. 

A fascinating parallel could be said to exist between DOP and the human immune
system:

[[[
	"Edelman received the Nobel prize in 1972 for his model of the
	recognition processes of the immune system. Recognition of bacteria is
	based on competitive selection in a population of antibodies. This
	process has several intriguing properties (p. 78): 
 
	1) There is more than one way to recognize successfully any particular shape;
	2) No two people have identical antibodies;
	3) The system exhibits a form of memory at the cellular level (prior to
	antibody reproduction).

	Edelman extends this theory to a more general "science of recognition": 
 
	By "recognition," I mean the continual adaptive matching or fitting of
	elements in one physical domain to novelty occurring in elements of
	another, more or less independent physical domain, a matching that
	occurs without prior instruction. [T]here is no explicit information
	transfer between the environment and organisms that causes the
	population to change and increase its fitness. (p. 74)" 
	-- Clancey (1991)
]]]

This general theory that is hinted at here is Edelman's Neural Darwinism,
a theory of competition describing the development of the human brain and
the development of consciousness. The "species" selected for might be mental
categories, conceptualizations, linguistic exemplars, etc. 

DOP's notion of _spurious ambiguities_ (different ways of deriving the same
parse tree) accords perfectly with 1).  While DOP does not explicitely claim
that "no two people have identical [exemplars]", it might very well be (which
dramatically changes the scope of DOP from a potentially purely linguistic
account modeling a language to a necessarily psychological one modeling an
idiolect); certainly no two individuals will have the exact same corpus. I am
unsure exactly how to interpret 3), but reliance on memory is certainly the
defining trait of DOP (as opposed to other formalisms which are typically
biased to computation over memory).

@ DOP and Esperanto

The appropriateness of DOP for Esperanto should be noted. In contrast
with the earlier a priori, philosophical languages published as completed
projects (Maat 1999), Esperanto was presented in a modest brochure
(Zamenhof 1887) purporting to fully describe its grammar in 16 rules, along
with examples of original and translated prose and poetry, inviting the reader
to start building and using the language by following its examples. That
Zamenhof summarized his language in 16 rules may well have been a nod to the
rival constructed language Volapük (Schleyer 1884), a popular but highly
complex language of bygone days purportedly communicated to its author by God.
The complexity of Volapük is demonstrated by the fact that its verb paradigm
contains 1584 conjugations, by combining tense, aspect, voice, person, number
and gender, among others. Such features made Volapük difficult to learn and
use, just as the philosophical languages. During the first Esperanto congress
the _Fundamento_ (Zamenhof 1905) was ratified as the untouchable foundation of
the language, containing the 16 grammar rules, a dictionary with 2600 words and
translations in six languages, and a collection of exercises; all of these had
been published at least a decade earlier and where already sanctioned through
practice. In effect, the _Fundamento_ can be considered as the authoritative
corpus on Esperanto, to which only new vocabulary is to be added as needed,
provided that it follows its orthography. Concerning morphology in particular,
Schubert (1993) notes, after referring to Zamenhof instruction of consulting
the supplied dictionary of roots and affixes:

[[[
	"Apart from this recipe for deciphering Esperanto texts,
	Zamenhof did not tell the users of his language exactly HOW to
	build complex words. He relied on providing a vast number of models and
	examples" (emphasis in the original)
]]]

Further on, Schubert notes:

[[[
	"Zamenhof may have intuitively felt the impossibility of describing a
	language exhaustively by means of rules. Such an insight would make his
	thinking very modern indeed. In any case he preferred to give examples
	rather than working out a detailed word grammar."
]]]

This clearly justifies our intention of analyzing Esperanto using an
exemplar-based model, not only pragmatically because of DOP's success, but
historically as well, since it accords with Esperanto's emergence. An
interesting sidenote is that in the years after its publication, Esperanto's
word formation processes appear to have regularized (Schubert 1989), favoring
new coinings such as "aspekti" (to appear) over Germanisms such as "elrigardi" (Wennergren 2005)
(literally to look out) in the sense of to appear (Dutch "er uitzien", German
"aussehen"), naturally the literal sense of looking out e.g. a window remains.
Interestingly, this is the opposite of creolization where a pidgin acquires a
relatively complex rule system; a more important argument against the
creolization of Esperanto is that creolization is by definition driven by a
newly formed, geographically homogenic community of native speakers, which
Esperanto certainly does not have. Furthermore, if Esperanto were to be a
pidgin (it is not; cf.  Haitao 2001), it would be one of an extremely curious
sort: a pidgin with an authoritative corpus and a language academy. As Miner
(2008) remarks, the latter is something which Chomsky could have facetiously
remarked, instead he has claimed (quite incorrectly) that Esperanto is not a
language for it lacks a generative grammar, putatively because it "parasitizes"
on other languages (footnote: paraphrased from an interview transcript
available at http://www3.sympatico.ca/mlgr/chomsky.pdf); this clearly belies
his ignorance of Esperanto, as well as being an obvious non-sequitur (perhaps
Chomsky implicitly believes that _real_ languages develop _de novo_ without any
interlinguistic interaction to speak of).

@ Tag set

The tag set for the hand-annotated corpora, inspired by the Penn-treebank is as follows:

* Constituents: VP, PP, NP, N' (constituents that behave like a noun), 
NC (conjunction + NN/N'), NPC (conjunction + NP), VPC (conjunction + VP), 
SC (conjuction + S), S' (if/that + S). 
* Part-of-speech (simplified version of Penn tagset): NN, VB, PR, JJ, DT, RB, PRP, CC
* Morphology, open class: N (noun), V (verb), J (adjectival), 
closed class: P (prepositional), A (affix), and auto-generated unique tags for
all grammatical endings and declensions (o, j, n, etc.).

The Monato treebank uses a different tag set, based on the 
{http://beta.visl.sdu.dk/visl/eo/index.php EspGram} constraint 
grammar. The POS tags of the morphology corpus should be adapted 
to fit those of the Monato treebank.

Annotated example sentences:

{{{
(S (S (NP (DT la) (N' (JJ venontajn) (N' (JJ apartajn) (NN pecojn)))) 
(VP (NP (PRP mi)) (VP (VBP donas)))) (S' (IN ke) (S (NP (DT la) (NN lernantoj)) 
(VP (VB povu) (VP (VP (VP (VB ripeti) (RB praktike)) (NP (NP (DT la) (NN regulojn)) 
(PP (IN de) (NP (DT l') (N' (NN gramatiko) (JJ internacia)))))) 
(VPC (CC kaj) (VP (VP (VB kompreni) (RB bone)) (NP (NP (NP (DT la) (NN signifon)) 
(NPC (CC kaj) (NP (DT la) (NN uzon)))) (PP (IN de) 
(NP (DT l') (N' (NN sufiksoj) (NC (CC kaj) (NN prefiksoj)))))))))))))

(S (NP (NN amiko)) (VP (VB venis)))
}}}

{i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN amiko]] [VP [VB venis]]])}

{i(/phpsyntaxtree/pngtree.php?data=[S [S [NP [DT la] [N' [JJ venontajn] [N' [JJ apartajn] [NN pecojn]]]] [VP [NP [PRP mi]] [VP [VBP donas]]]] [S' [IN ke] [S [NP [DT la] [NN lernantoj]] [VP [VB povu] [VP [VP [VP [VB ripeti] [RB praktike]] [NP [NP [DT la] [NN regulojn]] [PP [IN de] [NP [DT l'] [N' [NN gramatiko] [JJ internacia]]]]]] [VPC [CC kaj] [VP [VP [VB kompreni] [RB bone]] [NP [NP [NP [DT la] [NN signifon]] [NPC [CC kaj] [NP [DT la] [NN uzon]]]] [PP [IN de] [NP [DT l'] [N' [NN sufiksoj] [NC [CC kaj] [NN prefiksoj]]]]]]]]]]]]])}

Annotated example words:

{{{
(JJ (JJ (V (V (P en) (V konduk)) (V it)) a) j)

(NN (N (J (J (A mal) (J riĉ)) (A eg)) (A ul)) o)

(VB (P al) (VB (V glu) i))
}}}

{i(/phpsyntaxtree/pngtree.php?data=[JJ [JJ [V [V [P en] [V konduk]] [V it]] a] j])} {i(/phpsyntaxtree/pngtree.php?data=[NN [N [J [J [A mal] [J rich]] [A eg]] [A ul]] o])} {i(/phpsyntaxtree/pngtree.php?data=[VB [P al] [VB [V glu] i]])}

@ Implementation

* Goodman reduction: {https://unstable.nl/andreas/dopg.py own implementation}, using 
{http://groups.google.com/group/nltk-dev/browse_thread/thread/86ca038723195978/c112b8d171b33d25 NLTK}. 
maybe add backoff DOP or DOP*; fast PCFG parsing using 
{http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html bitpar} 
(Schmid 2004), a bit vector based chart parser.

Implementation details: low memory usage (248 MB), grammar is written directly
to disk (although a ramdisk is preferable, like /tmp in most systems); this is
made possible by deriving the rules using a generator object (a form of
explicit lazy evaluation). Applying the Goodman reduction to a 2000 sentence
treebank takes about 15 minutes and produces a grammar of about 850 MB. The
process is fully CPU-bound, further speedup is possible through an optimizing
compiler (psyco), or re-implementing key parts in C (eg. using Cython).  A
trade-off of speed for memory is possible by building up the grammar in memory
instead of directly to disk; this requires 1.7 GB of memory, which will quickly
become unmanageable with larger treebanks.

Previously considered possibilities:

* {http://staff.science.uva.nl/~simaan/dopdis/ dopdis} (C): already has Goodman reduction and DOP*;
* {http://sourceforge.net/projects/lilian/ lilian} (Java): has Goodman reduction, no DOP*; also has U-DOP.
* Gideon Borensztajn's {http://staff.science.uva.nl/~gideon/sourcecode/DOPParser.tar.gz DOPParser} (Java): has Goodman reduction

@ Segmentation

Before a morphological structure can be assigned to a word, it must be
segmented into morphemes (similar to tokenization before parsing syntax). While
it is claimed that in agglutinative languages in general and in Esperanto in
particular it is ``trivial" to recover the segments that make up a word (eg. Schubert
1993), this is a rather informal remark which is not borne out in practice.
Morpheme boundaries are not marked, and ambiguities may arise due to
overlapping roots.

I have devised a form of "Data-Oriented Segmentation" to expand the
coverage of segmentation beyond that of the words in the morphology corpus. The
algorithm works as follows:

* take the set of segmented words in the corpus by reading off the leaves of their trees
* construct a dictionary from positions to the set of morphemes occurring at that position
* generate possible words by taking the cartesian product of all morphemes occurring
 at position 0 and 1, corresponding to all possible 2-morpheme words using the available
 vocabulary of roots.
* repeat until position n where n is highest number of morphemes in the treebank to
 generate all possible words with n+1 morphemes.

Unfortunately this algorithm suffers from overgeneration. This should be remedied by 
discarding any segmentations contradicting the initial set of (supervised) segmentations. 

An alternative method of generating segmentations:

* take the set of segmented words in the corpus by reading off the leaves of their trees
* construct a dictionary from number of morphemes to words with that number of morphemes
* generate possible words with n morphemes by taking the pointwise cartesian product of
 all words with n morphemes (ie., cartpi(zip(words[n])) )

This still overgenerates, though less so (eg., word class, plural and
accusative endings in the wrong order; it may be necessary to treat endings
separately). A third way would be to use a bigram model and produce every
possible sequence up till a certain length, which avoids these issues. A fourth
way would be to employ the context-free grammar described above to generate all
valid words up to a certain length given a collection of roots along with their
categories.

@ DOP model composition

In order to produce a combined morphology-syntax model, it is necessary to be able
 to compose a DOP model and a treebank. This is defined in the following manner:

* let M be a DOP model and S a treebank, where for example M contains morphology and S contains
phrase structure trees.
* the composition M o S yields a new DOP model by generating a new treebank S' based 
on the trees in the treebank S annotated with analyses of words parsed with M
(assuming correct segmentation).
* treebank S' is generated by iterating over the POS tags of the trees in S 
and substituting each POS tag with a tree from M.
* the morphology-syntax model is obtained by instantiating a DOP model from S'

Note that this procedure assumes that disambiguation of morphology is context-free
and perfect, the most probable parse is used for decorating the syntax treebank.
This assumption should be empirically verified. 

Example:

{{{
S := {{ (S (NP (NN amiko)) (VP (VB venis))) }
M := {{ (NN (N amik) o) (VB (V ven) is) }

S o M = {{ (S (NP (NN (N amik) o)) (VP (VB (V ven) is))) }
}}}

S := {i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN amiko]] [VP [VB venis]]])}
M := {i(/phpsyntaxtree/pngtree.php?data=[NN [N amik] o])}  {i(/phpsyntaxtree/pngtree.php?data=[VB [V ven] is])}
S o M = {i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN [N amik] o]] [VP [VB [V ven] is]]])}

@ Corpora

Toy corpora:

* morphology: hand annotated list of 290 words, containing all closed class words and affixes, 
and various open class roots and derivations. Compiled from various more or less naturalistic 
sources (eg. Wennergren 2005, Miner 2006)
* syntax: hand annotated list of 14 sentences (first paragraph of Zamenhof's Dua Libro). 
  coverage of morphology is 100% with respect to this corpus.

Treebanks:

* morphology: semi-supervised corpus generated from dictionaries (TBD)
* syntax: Monato treebank (Bick, personal communication), a corpus parsed with EspGram (Bick 2007).
Number of sentences: 1995, tokens: 30,397, types: 9247. Average sentence length: 15.338. Resulting
grammar is 859 MB. Have not been able to parse with it yet, because it requires too much memory (perhaps
a packed parse forest chart parser is better?). Treebank requires preprocessing (TBD).

@ Results on toy corpora

Using a syntax and morphological corpus that do not contain the word "ven'as", 
but with a morphology model that can derive it from "don'as" and the past tense "ven'is":

{{{
sentence: amiko venas
morphology:
(NN (N@222 amik) o) (p=0.00417101147028)
(VB (V ven) as) (p=0.000334168755221)
syntax:
error Grammar does not cover some of the input words: "'venas'".
morphology + syntax combined:
['amik', 'o', 'ven', 'as']
(S (NP@91 (NN (N amik) o)) (VP@94 (VB (V ven) as))) (p=1.12188584593e-28)
}}}

The corpus contains the plural "prefiksoj", which is inflected to an accusative here:

{{{
sentence: mi donas prefikson
morphology:
(PRP@170 mi) (p=1.0)
(VB (V@173 don) as) (p=0.0350877192982)
(NN (NN (N@219 prefiks) o) n) (p=6.08906783983e-05)
syntax:
error Grammar does not cover some of the input words: "'prefikson'".
morphology + syntax combined:
['mi', 'don', 'as', 'prefiks', 'o', 'n']
(S
  (NP@293 (PRP mi))
  (VP
    (VB (V@22 don) as)
    (NP@256 (NN (NN (N@89 prefiks) o) n)))) (p=9.85999896556e-46)
}}}

However, it is perhaps unfair not to assign categories to unknown words. 
In the following results I let unknown words be assigned any open class 
POS tag (uniform probability for now). This is not an elegant solution 
because POS tags are transparently marked in Esperanto, so perfect tagging
can be performed with a rule-based approach; the problem is that this 
should be integrated into the parsing algorithm, because the alternative of
parsing sequences of POS tags instead of words is a cop out I do not want to make.

Here is a large sentence from later in the "Dua Libro" (which is, fittingly, about compounding in Esperanto):

{{{
sentence: Vortoj kunmetitaj estas kreataj per simpla kunligado de simplaj vortoj
morphology:
Vortoj (NN (NN (N@420 Vort) (NN_o@421 o)) (NN_j@825 j))
kunmetitaj (JJ (J (V kunmetit) (J_a@28 a)) (JJ_j@29 j))
estas (VB (V est) (VB_as as))
kreataj (JJ (J (V kreat) (J_a@28 a)) (JJ_j@29 j))
per (IN@1177 per)
simpla (JJ (J simpl) (JJ_a@613 a))
kunligado (NN (J kunligad) (NN_o@855 o))
de (IN de)
simplaj (JJ (J (V simpl) (J_a@28 a)) (JJ_j@29 j))
vortoj (NN (NN (N@420 vort) (NN_o@421 o)) (NN_j@825 j))
morphology + syntax combined:
['Vort', 'o', 'j', 'kunmetit', 'a', 'j', 'est', 'as', 'kreat', 'a', 'j', 'per', 'simpl', 'a', 'kunligad', 'o', 'de', 'simpl', 'a', 'j', 'vort', 'o', 'j']
(S
  (NP@128 (NN (NN (N@320 Vort) (NN_o@321 o)) (NN_j@27 j)))
  (VP
    (NP (JJ (J (V kunmetit) (J_a@11 a)) (JJ_j@12 j)))
    (VP
      (VB (V@255 est) (VB_as@256 as))
      (VP
        (JJ (J (V kreat) (J_a@11 a)) (JJ_j@12 j))
        (NP
          (NP
            (JJ (V (N per) (V simpl)) (JJ_a@171 a))
            (NN (N kunligad) (NN_o@451 o)))
          (PP
            (IN de)
            (N\'
              (JJ (J (V simpl) (J_a@11 a)) (JJ_j@12 j))
              (NN (NN (N@320 vort) (NN_o@321 o)) (NN_j@27 j)))))))))
}}}

{i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=[S [NP@128 [NN [NN [N@320 Vort] [NN_o@321 o]] [NN_j@27 j]]] [VP[NP [JJ [J [V kunmetit]  [J_a@11 a]] [JJ_j@12 j]]][VP [VB [V@255 est] [VB_as@256 as]] [VP[JJ [J [V kreat] [J_a@11 a]] [JJ_j@12 j]][NP [NP[JJ [V [N per] [V simpl]] [JJ_a@171 a]][NN [N kunligad] [NN_o@451 o]]] [PP[IN de][N' [JJ [J [V simpl] [J_a@11 a]] [JJ_j@12 j]] [NN [NN [N@320 vort] [NN_o@321 o]] [NN_j@27 j]]]]]]]]])}

(Translation: Compound words are created using simple concatenation of simple words [NB: words means roots here])

There are some mistakes in segmenting (kun-met-it, kre-at, per simpl-a, kun-lig-ad). 
The phrase structure has mistakes as well, eg. "vortoj kunmetitaj" is a constituent, 
"per simpla..." should be a PP but this is overlooked because it got an incorrect
 POS tag. But given that the syntax corpus contains only 14 sentences it is perhaps 
striking that a parse was produced at all.

The modularist approach yields the following parse tree:

{{{
syntax & morphology separate:
Vortoj kunmetitaj estas kreataj per simpla kunligado de simplaj vortoj 
(S
  (NP
    (NN (NN (N Vort) (NN_o o)) (NN_j j))
    (JJ (J (V kunmetit) (J_a a)) (JJ_j j)))
  (VP
    (VP
      (VB (V est) (VB_as as))
      (NP (JJ (J (V kreat) (J_a a)) (JJ_j j)) (IN per)))
    (NP
      (NP (JJ (J simpl) (JJ_a a)) (NN (J kunligad) (NN_o o)))
      (PP
        (IN de)
        (N\'
          (JJ (J (V simpl) (J_a a)) (JJ_j j))
          (NN (NN (N vort) (NN_o o)) (NN_j j)))))))
}}}

{i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=[S [NP[NN [NN [N Vort] [NN_o o]] [NN_j j]][JJ [J [V kunmetit] [J_a a]] [JJ_j j]]] [VP[VP [VB [V est] [VB_as as]] [NP [JJ [J [V kreat] [J_a a]] [JJ_j j]] [IN per]]][NP [NP [JJ [J simpl] [JJ_a a]] [NN [J kunligad] [NN_o o]]] [PP[IN de][N' [JJ [J [V simpl] [J_a a]] [JJ_j j]] [NN [NN [N vort] [NN_o o]] [NN_j j]]]]]]])}

The morphology is identical, but syntactically the results are a little different, 
eg. the first noun and adjective are together in an NP. However, the preposition 
"per" appears oddly at the end of an NP, instead of introducing a PP (in the 
previous tree it ended up prefixing an NP because the model cannot distinguish 
the difference between word and morpheme boundary).

@ Todo

* parse bitpar chart output into NLTK (currently only most probable derivation; we need most probable parse and maybe shortest derivation, SL-DOP etc.)
* automate testing & evaluation, apply to toy corpus
* use Reta Vortaro / ergane Esperanto dictionary and root lists 
  to induce segmentation / morphology model in a semi-supervised fashion.
* check morphology coverage against vocabulary of Monato treebank
* distinguish between morpheme and word boundaries (how?).
 possibly by having a trailing space as part of a morphological analysis 
(but: this should not block inflection for plurality and accusative (+j and +n respectively).
* write report (maybe convert wiki to latex?). evaluation & conclusion.
write about Dasgupta (2008) & pragmatic motivation for assuming hierarchical phrase-structure trees.
* look at DOP{U+002A} / U-DOP

@ Evaluation

TBD. Tenfold testing of Monato treebank, with and without regard for morphology.

@ Tentative conclusion

Esperanto & DOP are awesome.

@ References

Bick, Eckhard (2007), ``Tagging and Parsing an Artificial Language: an
annotated web-corpus of Esperanto,'' in: _Proceedings of Corpus Linguistics_ ,
Birmingham, UK.  http://beta.visl.sdu.dk/pdf/CorpusLinguistics2007_esp.pdf

Bod, Rens & Scha, Remko (1996) ``Data-Oriented Language Processing: an
overview.'' Research reports, Institute for Logic, Language and Computation,
University of Amsterdam. http://dare.uva.nl/document/1144

Clancey, W.J. (1991), ``The biology of consciousness: Comparative review of
Israel Rosenfield, The Strange, Familiar, and Forgotten: An anatomy of
Consciousness and Gerald M. Edelman, Bright Air, Brilliant Fire: On the Matter
of the Mind,'' _Artificial Intelligence_ vol. 60, pp. 313--356

Gobbo, Federico (2009), ``Adpositional Grammars: a multilingual grammar
formalism for NLP,'' PhD dissertation, Universita degli Studi dell'Insubria.

Goodman, Joshua (1996), ``Efficient Algorithms for Parsing the DOP Model''.
_Proceedings Empirical Methods in Natural Language Processing_ pp. 143-152.
http://acl.ldc.upenn.edu/W/W96/W96-0214.pdf

Jackendoff, Ray (2003), ``Précis of Foundations of Language: Brain, Meaning,
Grammar, Evolution,'' Behavioral and Brain Sciences (2003), 26:6:651-665
Cambridge University Press.

Jansen, W. (2007). ``Woordvolgorde in het Esperanto: normen, taalgebruik en
universalia" (Word-order in Esperanto: norms, usage and universals). PhD
thesis, LOT Utrecht.

Jurafsky, D. & Martin, J.H. (2000), ``Speech & Language Processing An
introduction to natural language processing, computational linguistics, and
speech recognition,'' Pearson Education.

Haitao, Liu (2001), ``Creoles, Pidgins, and Planned Languages.'' Interface.
Journal of Applied Linguistics / Tijdschrift voor Toegepaste Linguïstiek 15 [2]. pp. 121--177.


Kalocsay, Kálmán & Waringhien, Gaston (1980), Plena Analiza Gramatiko de
Esperanto (Complete, analyzed Grammar of Esperanto), Rotterdam, Universala
Esperanto-Asocio.

Maat, Jaap (1999), ``Philosophical Languages in the Seventeenth Century:
Dalgarno, Wilkins, Leibniz,'' Amsterdam, Institute for Logic, Language and
Computation.

MacWhinney, B. (1987), ``Mechanisms of Language Acquisition,'' Lawrence Erlbaum Associates, NJ.

Miner, Ken (2006), ``Rimarkoj pri `En la komenco estas la vorto' de Geraldo
Mattos (fina versio),'' (Comments on `In the beginning was the word' by Geraldo
Mattos (final version)). http://www.sunflower.com/~miner/EKVO_package/ekvo.html

Miner, Ken (2008), ``La neebleco de priesperanto lingvoscienco,'' (The
impossibility of Esperanto linguistics). October 2008.
http://www.sunflower.com/~miner/LINGVISTIKO_package/lingvistiko.html
Also published in ``La arto labori kune : festlibro por Humphrey Tonkin'' (The
art of working together: Festschrift for Humphrey Tonkin). Roterdam, Universala
Esperanto Asocio, January 2010

Pinker, S. (1994). The language instinct: How the mind creates language. New York: W. Morrow.

Prescher, D., Scha, R., Sima`an, K., Zollmann, A., (2004) ``On the statistical
consistency of DOP estimators.'' In _Proceedings of the 14th Meeting of
COmputational Linguistics in the Netherlands_ Antwerp, Belgium.

Scha, Remko (1990), ``Taaltheorie en Taaltechnologie; Competence en
Performance'' (Language theory and language technology: Competence and
Performance), in Q.A.M. de kort and G.L.J. Leerdam (eds.),
_Computertoepassingen in de Neerlandistiek_ pp. 7-22, Almere: Landelijke
Vereniging van Neerlandici (LVVN-jaarboek). English translation
http://www.hum.uva.nl/computerlinguistiek/scha/IAAA/rs/cv.html

Schleyer, Johan Martin (1884), ``Volapük. Grammatik der Universalsprache für
alle gebildete Erdbewohner,'' Überlingen am Bodensee: Buchdruckerei August
Feyel, Buchhandlung Aug. Schoy. Third edition.

Schmid, Helmut (2004), ``Efficient Parsing of Highly Ambiguous Context-Free
Grammars with Bit Vectors,'' _Proceedings of the 20th International Conference
on Computational Linguistics_ (COLING 2004), Geneva, Switzerland.
http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/COLING04/BitPar.pdf

Schmid, Helmut, Arne Fitschen and Ulrich Heid: SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection, Proceedings of the IVth International Conference on Language Resources and Evaluation (LREC 2004), p. 1263-1266, Lisbon, Portugal. http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/LREC04/smor.pdf

Schubert, Klaus, 1989. "An unplanned development in planned languages", en Klaus Schubert, red., Interlinguistics: Aspects of the Science of Planned Languages [ = Trends in Linguistics: Studies and Monographs 42], Mouton de Gruyter.

Schubert, Klaus (1993), ``Semantic compositionality: Esperanto word-formation
for language technology.'' _Linguistics_ 31: 311-365.

Wells, John (1989), ``Lingvistikaj aspektoj de Esperanto,'' Universala
Esperanto Asocio, Rotterdam. Second edition.

Wennergren, Bertilo (2005), ``Plena Manlibro de Esperanta Gramatiko,''
(Complete handbook of Esperanto Grammar), version 13.0, 14th of April 2005.
Available online at http://bertilow.com/pmeg/.

Zamenhof, Dr. L. L. (1887/1968), ``Internationale Sprache. Vorrede und
Vollständiges Lehrbuch,'' Warschau, photographic reprint from 1968
(Saarbrücken: Artur E. Illtis). German translation of the original Russian
brochure.

Zamenhof, Dr. L. L. (1905/1963), ``Fundamento de Esperanto.'' Ninth edition
with Introduction, Notes and Linguistics comments, edited by Dr. A. Albault
(Esperantaj Francaj Eldonoj: Marmande, 1963).

Zollmann, Andreas & Sima'an, Khalil (2005), ``A Consistent and Efficient
Estimator for DOP.''  _Journal of Automota Languages and Combinatorics_ vol.
10, pp. 367.  http://staff.science.uva.nl/~simaan/D-Papers/JALCsubmit.pdf

@ Needed references

everything seems to be there.

@ Possible references 

Dasgputa, Probal (2008), ``Interlexical studies: a cognitive approach,'' talk
delivered on 18th of April 2008, Amsterdam Centre for Language and
Communication.

(from Miner 2006)

Sakaguchi, Alicja, 1996. Die Dichotomie "künstlich" vs. "natürlich" und das historische Phänomen einer funktionierenden Plansprache. Language Problems and Language Planning 20:1.

Gledhill, Christopher, 2000. The Grammar of Esperanto: A Corpus-Based Description. Lincom Europa.

Grimley-Evans, Edmundo, 1997. "Vortfarado", (Word derivation) La Brita Esperantisto, marto-aprilo 1997, ppĝ 57-59.