@ Enriching DOP with morphology: @ a pilot using a constructed language 0440949 _Andreas van Cranenburgh_ Cognitive Models of Language project @ Abstract Esperanto is a constructed language with a rich and regular morphology. It seems likely that taking its morphology into account when parsing syntax will improve accuracy. I will investigate the effects of considering morphological and phrase structure analysis as separate, autonomous steps, versus combining them into a single DOP model. I will assume a hierarchical representation for both syntax and morphology. Since there is no gold standard treebank with phrase structures for Esperanto, I will construct a small toy corpus for testing. Furthermore, experiments with U-Dop for both morphology and phrase structure are possible. Lastly, previous work with Esperanto has resulted in a highly successful (>95% precision on a small test corpus) constraint grammar (Bick 2007), and a formal model of morphology and syntax in the form of an adpositional grammar (Gobbo 2009); an adpositional grammar is a dependency grammar combining directed dependencies with the dimension of trajector/landmark from construction grammar. These provide a means of comparison and a potential treebank. @ Research questions: * Does making morphology transparent to syntax improve parsing results for syntax? * Are morphology and syntax autonomous? ie., is morphology opaque or transparent to syntax? These possibilities correspond to a modularist (cf. Pinker 1994, Jackendoff 2003) vs. an interactionist approach (cf. MacWhinney 1987). Here modularism refers to the functionalist hypothesis of the autonomy of syntax from other levels, both the stronger claim of processing autonomy, and its weaker form of representational autonomy. In effect the issue at stake here is the nature of the morphology-syntax interface. On the one hand there is the extreme of syntax seeing only a Part-of-Speech tag (and possibly the word as well if syntax is lexicalised), on the other hand there is the other extreme where morphology is literally a part of syntax that has been conveniently ignored in the majority of work in (computational) linguistics to date. Jackendoff's (2003) parallel archictecture suggest a compromise where interfaces of different autonomous levels are possible (e.g., phonology-semantics to deal with focus effects). * Are words the smallest units of syntax, or is it perhaps morphemes? Because this is a pilot project, only the first question shall be answered, but this may provide a hint as to the other questions. In addition the answer to the first question shall only concern Esperanto. Approach: * Construct a corpus of sentences annotated with phrase structures, and a lexicon of words annotated with morphological structures. The assumption is that while syntax may use information in morphology, morphology does not need information from syntax, hence the possibility of constructing a morphological corpus independent of the text corpus; note that this amounts to assuming that for the purpose of constructing a corpus the morphology is context-free. * Divide the corpus into training and testing, train on the former with DOP1 or DOP*; (Zollman 2005) (the latter only given a sufficiently large corpus) * Evaluate morphology: evaluate performance of morphology; should be good enough to continue with syntax * Morphology transparent to syntax: take treebank corpus, merge phrase structure trees with morphological analyses, construct a single DOP model * Morphology opaque to syntax: construct a DOP model for morphology, taking one word at a time, and a DOP model for syntax, producing phrase structure trees without morphology. Morphological structure and phrase structure can be parsed in parallel and independent of each other. @ Morphology Much work in Computational Linguistics focuses exclusively on syntax; this is a form of syntactocentrism, a term coined in Generative Linguistics (Jackendoff 2003). This also goes for Data-Oriented Parsing (DOP), although excursions into semantics have been made. In this project I will go in the other direction and turn to the stratum of morphology. Most accounts of morphology in Computational Linguistics seem to present the structure of words as a sequence of morpheme-feature pairs (e.g., Jurafsky & Martin 2000), as parsed by a (Stochastic) Finite State Transducer (cf., Schmid et al. 2004). However, due to the complexity and potentially unlimited productivity of morphology in Esperanto such a representation will necessarily contain only part of the structural information of words in Esperanto (more on this in the next section). Such an approach is to the representation of the present project what POS tagged sentences are to hierarchical phrase structure trees. Although the present project focuses on Esperanto, the method of adding morphology to DOP should generalize to other languages, especially to languages such as English which display only a very limited amount of morphological productivity and hence exhibit only a subset of the derivational complexity in morphologically richer languages. @ About Esperanto Esperanto is a constructed language (also referred to as a planned language). The term ``artificial language" that is sometimes employed is inappropriate, as its artificial design is just a point in time of its century long continuous usage and evolution. It is a spoken language with its own literature and culture, so while it may not be a ``natural language" strictly speaking (Gobbo (2009) uses the term Quasi-Natural Language), it is certainly a human language that performs all the communicative and expressive functions of Ethnic languages, albeit mostly as a second language used by a diverse and scattered speech community. Typologically Esperanto has the unique character of being a morphologically agglutinative and synthetic language with a vocabulary largely based on Romance languages (apart from some German & Russian words, and schematic function words as well). Its word formation is highly compositional (ie., its word formation is fully transparent). Its syntax is schematic (designed) and allows for a relatively free word-order through obligatory case marking, though in practice a default word-order of SVO has emerged, with systematic deviations, triggered by complex constituents and by pragmatics to express focus; these findings accord with relatively universal features found in natural languages (Jansen 2007). Cases are marked either through inflection in case of the accusative or through a set of prepositions intended to be unambiguous (e.g., the English preposition "with" translates in two ways in Esperanto, through the instrumentalis "per" and "kun", with as in together. Concerning prepositions, the initial intention was to express some rather vague relations such as "believing _in_ God" (which is neither spatial nor temporal, it would appear) with a semantically neutral preposition for an unspecified relation, the preposition "je"; however, this seems to have fallen in disuse, probably through interference from Ethnic languages. However, an interesting hypothesis could be that this reflects an evolutionary pressure for distinctions and ambiguities to correspond with the meanings that are actually expressed (the prior probability of wanting to express some meaning) -- while an abstruse philosophical treatise may theoretically discuss "believing" while residing spatially or temporally "in God", this possibility is vanishingly rare so that making the distinction is wasted effort. While Esperanto's morphology is agglutinative and synthetic (it has an index of agglutination of 1,0 and an average synthesis index (word-morphemo ratio) of 1.8-2, reported by Wells 1989), it is not poly-synthetic such as Inuit languages; single words cannot express what is denoted by a whole phrase in other languages, and grammatical roles are not marked, nor is the nature of the relation between elements that make up a word specified. Concerning the relations between morphemes, consider the Dutch word "zoektechnieken", which could translated as "techniques for search," though "for" is not specified in the Dutch word. In an agglutinative language invariant morphemes that express only a single grammatical meaning are concatenated unmodified, such that identifying the elements that make up a word is relatively easy (although ambiguities may arise through overlap; ie., when concatenating two smaller morphemes results in a string of characters that coincides with a larger morpheme). The process of word formation is completely productive and without exceptions; the only proviso is that a formation should make sense semantically when considering the meaning of its constituent elements. There is obligatory agreement in number and case within noun phrases. Verb paradigms are simple: tense is marked with the ending, person and number through the personal pronoun. Esperanto's productive morphology can be summarized using a regular grammar. The following is adapted from Schubert (1993; caveat lector: Schubert incorrectly characterizes this grammar as recursive), which in turn is based on Kalocsay's (1980) account. I have translated it into a regular grammar, proving that Esperanto's lexicon of word forms can be enumerated by a regular language; to my knowledge this is the first such description to date. The grammar for function words: {{{ function_word := adverb | preposition | numeral adverb := prefix adverb preposition := prefix preposition numeral := numeral numeral{U+002A} prefix := mal | ne | ... suffix := il | et | ... }}} Content words are a little more involved (ibid): {{{ word := prefix{U+002A} left{U+002A} right ending (epsilon | declension) left := right (epsilon | ending) right := prefix{U+002A} root suffix{U+002A} ending := o | a | e declension := j | n verb-ending := as | is | os | us | u root := akv | far | ... }}} In these rules, "prefix" and "suffix" refers to a closed class of affixes; "(verb-)ending" refers to a one or two-character ending marking the Part-of-Speech; "declension" refers to either a null marking (nominative, singular) or the accusative and/or plurality marking. Furthermore, "*" is the Kleene star, "|" is the alternation operator, and lastly concatenation is implied. This grammar incorporates three processes of word-formation in Esperanto: derivation (concatenating elements to form words), compounding (concatenating elements to words to form more complex words), and POS category change. The latter refers to nominalizations and other possible mappings between Parts-of-Speech. While this grammar should in all likelihood exhaust Esperanto's morphology, it is of little use for computational linguistics because of its ambiguity and flat structure. Whereas POS-tagging can be done practically error-free using a rule-based algorithm (save for proper names and foreign words), deeper morphological structure will depend on the morphemes in question, and possibly their semantics as well. However, in this project it is assumed that the latter does not play a major role as doing semantics is infeasible (it is my contention that semantics relies on extensive extra-linguistic world knowledge). We will assume that derivations and compound words are constructed in a stochastic process that can be leared from examples (words with their appropriate structure, that is). Another way in which the grammar falls short is that it does not consider the grammatical character of roots in Esperanto (Schubert 1993). Although initially controversial, the thesis that bare roots (without their grammatical endings) have a grammatical category to which they belong has by now been almost universally accepted in Esperantology. In effect this entails that roots in Esperanto belong to a prototypical semantic class (sometimes several). These classes are verbal, adjectival and noun-like (adverbial roots are part of the adjectival roots, arguably they are part of the ``qualities" class). The typical example is "MARTEL" and "TOND", roots for hammer and cutting, respectively. The category of the former is a noun and thus "martelo" means a hammer, and the derived "marteli" means to hammer. The latter is a verb root meaning, with "tondi" meaning to cut, and the derivation "tondilo" meaning a tool to cut or a scissor, requiring an affix to denote a tool derived from a verb (directly affixing a noun ending to the root would mean ``a cut"). Without recording the grammatical category of roots, a model of Esperanto morphology would not be able to predict the correct derivations. The present work glosses over a related feature of Esperanto roots, the fact that verbs are transitive or intransitive (valency), requiring an affix to change from the one to the other meaning. The reason for glossing over this aspect is that this information should become part of a more general account of argument structure (i.e., including prepositional arguments) that is beyond the scope of this project. Take these examples: (1) "La akvo bolas" (the water boils) (2) "Mi boligas la akvon" (I boil the water) (3) "Mi finis la libron" (I finished the book) (4) "La libro finiĝis" (the book finished) Sentences (1) and (3) contain the original verb, while (2) and (4) contain affixed verbs with a different subcategorization frame. This feature of Esperanto has been critized as being a needless distinction (common sense usually yields the correct meaning, as for example English demonstrates), as well as the rather arbitrary choices that have been made as to the transitivity, requiring a language user to memorize them by rote. It has also resulted in confusing paronyms such as ``pesi'' (to weigh something) and ``pezi'' (to weigh X kilos, to be heavy). It is however an unchangeable part of the language. @ About Data-Oriented Parsing Data-Oriented Parsing (Scha 1990; Bod & Scha 1996, henceforth DOP) is a computational framework for modeling natural language processing (NLP) and other hierarchical cognitive phenomena. Its basic assumptions are: * knowledge of language is made up of a corpus of concrete experiences rather than abstract rules; this concrete experience is stored in exemplars, pairings of surface forms and their structure. * when faced with a new sentence, all fragments of past experiences can be consulted to analyze the given sentence * fragments can be combined using one or more operations which obtain with a certain (estimated) probability Two crucial aspects are the representation used to describe the concrete experiences and the method for ranking the possible analyses. Most research in Computational Linguistics currently focuses on isolated sentences annotated with phrase-structures trees; this project will follow the same approach with the addition of morphological structure. Various methods for selecting the best parse tree exist for DOP; the best performing methods combine a notion of simplicity (the derivation requiring the least amount of fragments) with likelihood (estimated probability); e.g., the most likely from the n shorted derivations. It should be noted that Esperanto, as a free word-order language, is more suitably described using depedency structures. However, given extent of previous work on DOP with phrase-structure trees, I have opted to assume such hierarchical representantions instead. This is merely a pragmatically motivated assumption. Work on combining DOP and dependency structures is forthcoming. What makes DOP so promising is that if any computational approach to language can be said to successfully learn a language given enough data (ie., without recourse to innate knowledge), DOP is bound to be one of them. This is because the Data in Data-Oriented Parsing refers to exploiting all of the available data. Whereas more traditional methods in Computational Linguistics such as Probabilitistic Context-Free Grammars (PCFG) derive abstract rules from a treebank, throwing away valuable contextual information, DOP retains all exemplars and their fragments (modulo some potential pruning method corresponding to memory decay depending on usage and age etc.). This allows for the recognition of long-range dependencies such as in the construction "more X than Y." Also, compared to a PCFG, the statistical independence assumptions of DOP are less strong, because they can be spread over different derivations resulting in the same parse tree (ie., the assumptions made by each of the derivations of the most probable parse are corroborated by its other derivations). Prescher et al. (2004) observe that DOP combines the memory-based aspects of non-probabilistic machine learning techniques such as k-nearest neighbor with a probabilistic approach to deal with unseen (novel) exemplars; thus DOP provides a way to deal with the spectrum ranging from stock phrases that can be memorized by rote to completely novel sentences. The larger the fragments used in a derivation, the less indepedence assumptions need to be made; however, novel sentences can be parsed by backing off to smaller fragments. Thus, in the limit (as the corpus size approaches infinity) DOP does not make any independence assumptions at all. A fascinating parallel could be said to exist between DOP and the human immune system: [[[ "Edelman received the Nobel prize in 1972 for his model of the recognition processes of the immune system. Recognition of bacteria is based on competitive selection in a population of antibodies. This process has several intriguing properties (p. 78): 1) There is more than one way to recognize successfully any particular shape; 2) No two people have identical antibodies; 3) The system exhibits a form of memory at the cellular level (prior to antibody reproduction). Edelman extends this theory to a more general "science of recognition": By "recognition," I mean the continual adaptive matching or fitting of elements in one physical domain to novelty occurring in elements of another, more or less independent physical domain, a matching that occurs without prior instruction. [T]here is no explicit information transfer between the environment and organisms that causes the population to change and increase its fitness. (p. 74)" -- Clancey (1991) ]]] This general theory that is hinted at here is Edelman's Neural Darwinism, a theory of competition describing the development of the human brain and the development of consciousness. The "species" selected for might be mental categories, conceptualizations, linguistic exemplars, etc. DOP's notion of _spurious ambiguities_ (different ways of deriving the same parse tree) accords perfectly with 1). While DOP does not explicitely claim that "no two people have identical [exemplars]", it might very well be (which dramatically changes the scope of DOP from a potentially purely linguistic account modeling a language to a necessarily psychological one modeling an idiolect); certainly no two individuals will have the exact same corpus. I am unsure exactly how to interpret 3), but reliance on memory is certainly the defining trait of DOP (as opposed to other formalisms which are typically biased to computation over memory). @ DOP and Esperanto The appropriateness of DOP for Esperanto should be noted. In contrast with the earlier a priori, philosophical languages published as completed projects (Maat 1999), Esperanto was presented in a modest brochure (Zamenhof 1887) purporting to fully describe its grammar in 16 rules, along with examples of original and translated prose and poetry, inviting the reader to start building and using the language by following its examples. That Zamenhof summarized his language in 16 rules may well have been a nod to the rival constructed language Volapük (Schleyer 1884), a popular but highly complex language of bygone days purportedly communicated to its author by God. The complexity of Volapük is demonstrated by the fact that its verb paradigm contains 1584 conjugations, by combining tense, aspect, voice, person, number and gender, among others. Such features made Volapük difficult to learn and use, just as the philosophical languages. During the first Esperanto congress the _Fundamento_ (Zamenhof 1905) was ratified as the untouchable foundation of the language, containing the 16 grammar rules, a dictionary with 2600 words and translations in six languages, and a collection of exercises; all of these had been published at least a decade earlier and where already sanctioned through practice. In effect, the _Fundamento_ can be considered as the authoritative corpus on Esperanto, to which only new vocabulary is to be added as needed, provided that it follows its orthography. Concerning morphology in particular, Schubert (1993) notes, after referring to Zamenhof instruction of consulting the supplied dictionary of roots and affixes: [[[ "Apart from this recipe for deciphering Esperanto texts, Zamenhof did not tell the users of his language exactly HOW to build complex words. He relied on providing a vast number of models and examples" (emphasis in the original) ]]] Further on, Schubert notes: [[[ "Zamenhof may have intuitively felt the impossibility of describing a language exhaustively by means of rules. Such an insight would make his thinking very modern indeed. In any case he preferred to give examples rather than working out a detailed word grammar." ]]] This clearly justifies our intention of analyzing Esperanto using an exemplar-based model, not only pragmatically because of DOP's success, but historically as well, since it accords with Esperanto's emergence. An interesting sidenote is that in the years after its publication, Esperanto's word formation processes appear to have regularized (Schubert 1989), favoring new coinings such as "aspekti" (to appear) over Germanisms such as "elrigardi" (Wennergren 2005) (literally to look out) in the sense of to appear (Dutch "er uitzien", German "aussehen"), naturally the literal sense of looking out e.g. a window remains. Interestingly, this is the opposite of creolization where a pidgin acquires a relatively complex rule system; a more important argument against the creolization of Esperanto is that creolization is by definition driven by a newly formed, geographically homogenic community of native speakers, which Esperanto certainly does not have. Furthermore, if Esperanto were to be a pidgin (it is not; cf. Haitao 2001), it would be one of an extremely curious sort: a pidgin with an authoritative corpus and a language academy. As Miner (2008) remarks, the latter is something which Chomsky could have facetiously remarked, instead he has claimed (quite incorrectly) that Esperanto is not a language for it lacks a generative grammar, putatively because it "parasitizes" on other languages (footnote: paraphrased from an interview transcript available at http://www3.sympatico.ca/mlgr/chomsky.pdf); this clearly belies his ignorance of Esperanto, as well as being an obvious non-sequitur (perhaps Chomsky implicitly believes that _real_ languages develop _de novo_ without any interlinguistic interaction to speak of). @ Tag set The tag set for the hand-annotated corpora, inspired by the Penn-treebank is as follows: * Constituents: VP, PP, NP, N' (constituents that behave like a noun), NC (conjunction + NN/N'), NPC (conjunction + NP), VPC (conjunction + VP), SC (conjuction + S), S' (if/that + S). * Part-of-speech (simplified version of Penn tagset): NN, VB, PR, JJ, DT, RB, PRP, CC * Morphology, open class: N (noun), V (verb), J (adjectival), closed class: P (prepositional), A (affix), and auto-generated unique tags for all grammatical endings and declensions (o, j, n, etc.). The Monato treebank uses a different tag set, based on the {http://beta.visl.sdu.dk/visl/eo/index.php EspGram} constraint grammar. The POS tags of the morphology corpus should be adapted to fit those of the Monato treebank. Annotated example sentences: {{{ (S (S (NP (DT la) (N' (JJ venontajn) (N' (JJ apartajn) (NN pecojn)))) (VP (NP (PRP mi)) (VP (VBP donas)))) (S' (IN ke) (S (NP (DT la) (NN lernantoj)) (VP (VB povu) (VP (VP (VP (VB ripeti) (RB praktike)) (NP (NP (DT la) (NN regulojn)) (PP (IN de) (NP (DT l') (N' (NN gramatiko) (JJ internacia)))))) (VPC (CC kaj) (VP (VP (VB kompreni) (RB bone)) (NP (NP (NP (DT la) (NN signifon)) (NPC (CC kaj) (NP (DT la) (NN uzon)))) (PP (IN de) (NP (DT l') (N' (NN sufiksoj) (NC (CC kaj) (NN prefiksoj))))))))))))) (S (NP (NN amiko)) (VP (VB venis))) }}} {i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN amiko]] [VP [VB venis]]])} {i(/phpsyntaxtree/pngtree.php?data=[S [S [NP [DT la] [N' [JJ venontajn] [N' [JJ apartajn] [NN pecojn]]]] [VP [NP [PRP mi]] [VP [VBP donas]]]] [S' [IN ke] [S [NP [DT la] [NN lernantoj]] [VP [VB povu] [VP [VP [VP [VB ripeti] [RB praktike]] [NP [NP [DT la] [NN regulojn]] [PP [IN de] [NP [DT l'] [N' [NN gramatiko] [JJ internacia]]]]]] [VPC [CC kaj] [VP [VP [VB kompreni] [RB bone]] [NP [NP [NP [DT la] [NN signifon]] [NPC [CC kaj] [NP [DT la] [NN uzon]]]] [PP [IN de] [NP [DT l'] [N' [NN sufiksoj] [NC [CC kaj] [NN prefiksoj]]]]]]]]]]]]])} Annotated example words: {{{ (JJ (JJ (V (V (P en) (V konduk)) (V it)) a) j) (NN (N (J (J (A mal) (J riĉ)) (A eg)) (A ul)) o) (VB (P al) (VB (V glu) i)) }}} {i(/phpsyntaxtree/pngtree.php?data=[JJ [JJ [V [V [P en] [V konduk]] [V it]] a] j])} {i(/phpsyntaxtree/pngtree.php?data=[NN [N [J [J [A mal] [J rich]] [A eg]] [A ul]] o])} {i(/phpsyntaxtree/pngtree.php?data=[VB [P al] [VB [V glu] i]])} @ Implementation * Goodman reduction: {https://unstable.nl/andreas/dopg.py own implementation}, using {http://groups.google.com/group/nltk-dev/browse_thread/thread/86ca038723195978/c112b8d171b33d25 NLTK}. maybe add backoff DOP or DOP*; fast PCFG parsing using {http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html bitpar} (Schmid 2004), a bit vector based chart parser. Implementation details: low memory usage (248 MB), grammar is written directly to disk (although a ramdisk is preferable, like /tmp in most systems); this is made possible by deriving the rules using a generator object (a form of explicit lazy evaluation). Applying the Goodman reduction to a 2000 sentence treebank takes about 15 minutes and produces a grammar of about 850 MB. The process is fully CPU-bound, further speedup is possible through an optimizing compiler (psyco), or re-implementing key parts in C (eg. using Cython). A trade-off of speed for memory is possible by building up the grammar in memory instead of directly to disk; this requires 1.7 GB of memory, which will quickly become unmanageable with larger treebanks. Previously considered possibilities: * {http://staff.science.uva.nl/~simaan/dopdis/ dopdis} (C): already has Goodman reduction and DOP*; * {http://sourceforge.net/projects/lilian/ lilian} (Java): has Goodman reduction, no DOP*; also has U-DOP. * Gideon Borensztajn's {http://staff.science.uva.nl/~gideon/sourcecode/DOPParser.tar.gz DOPParser} (Java): has Goodman reduction @ Segmentation Before a morphological structure can be assigned to a word, it must be segmented into morphemes (similar to tokenization before parsing syntax). While it is claimed that in agglutinative languages in general and in Esperanto in particular it is ``trivial" to recover the segments that make up a word (eg. Schubert 1993), this is a rather informal remark which is not borne out in practice. Morpheme boundaries are not marked, and ambiguities may arise due to overlapping roots. I have devised a form of "Data-Oriented Segmentation" to expand the coverage of segmentation beyond that of the words in the morphology corpus. The algorithm works as follows: * take the set of segmented words in the corpus by reading off the leaves of their trees * construct a dictionary from positions to the set of morphemes occurring at that position * generate possible words by taking the cartesian product of all morphemes occurring at position 0 and 1, corresponding to all possible 2-morpheme words using the available vocabulary of roots. * repeat until position n where n is highest number of morphemes in the treebank to generate all possible words with n+1 morphemes. Unfortunately this algorithm suffers from overgeneration. This should be remedied by discarding any segmentations contradicting the initial set of (supervised) segmentations. An alternative method of generating segmentations: * take the set of segmented words in the corpus by reading off the leaves of their trees * construct a dictionary from number of morphemes to words with that number of morphemes * generate possible words with n morphemes by taking the pointwise cartesian product of all words with n morphemes (ie., cartpi(zip(words[n])) ) This still overgenerates, though less so (eg., word class, plural and accusative endings in the wrong order; it may be necessary to treat endings separately). A third way would be to use a bigram model and produce every possible sequence up till a certain length, which avoids these issues. A fourth way would be to employ the context-free grammar described above to generate all valid words up to a certain length given a collection of roots along with their categories. @ DOP model composition In order to produce a combined morphology-syntax model, it is necessary to be able to compose a DOP model and a treebank. This is defined in the following manner: * let M be a DOP model and S a treebank, where for example M contains morphology and S contains phrase structure trees. * the composition M o S yields a new DOP model by generating a new treebank S' based on the trees in the treebank S annotated with analyses of words parsed with M (assuming correct segmentation). * treebank S' is generated by iterating over the POS tags of the trees in S and substituting each POS tag with a tree from M. * the morphology-syntax model is obtained by instantiating a DOP model from S' Note that this procedure assumes that disambiguation of morphology is context-free and perfect, the most probable parse is used for decorating the syntax treebank. This assumption should be empirically verified. Example: {{{ S := {{ (S (NP (NN amiko)) (VP (VB venis))) } M := {{ (NN (N amik) o) (VB (V ven) is) } S o M = {{ (S (NP (NN (N amik) o)) (VP (VB (V ven) is))) } }}} S := {i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN amiko]] [VP [VB venis]]])} M := {i(/phpsyntaxtree/pngtree.php?data=[NN [N amik] o])} {i(/phpsyntaxtree/pngtree.php?data=[VB [V ven] is])} S o M = {i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN [N amik] o]] [VP [VB [V ven] is]]])} @ Corpora Toy corpora: * morphology: hand annotated list of 290 words, containing all closed class words and affixes, and various open class roots and derivations. Compiled from various more or less naturalistic sources (eg. Wennergren 2005, Miner 2006) * syntax: hand annotated list of 14 sentences (first paragraph of Zamenhof's Dua Libro). coverage of morphology is 100% with respect to this corpus. Treebanks: * morphology: semi-supervised corpus generated from dictionaries (TBD) * syntax: Monato treebank (Bick, personal communication), a corpus parsed with EspGram (Bick 2007). Number of sentences: 1995, tokens: 30,397, types: 9247. Average sentence length: 15.338. Resulting grammar is 859 MB. Have not been able to parse with it yet, because it requires too much memory (perhaps a packed parse forest chart parser is better?). Treebank requires preprocessing (TBD). @ Results on toy corpora Using a syntax and morphological corpus that do not contain the word "ven'as", but with a morphology model that can derive it from "don'as" and the past tense "ven'is": {{{ sentence: amiko venas morphology: (NN (N@222 amik) o) (p=0.00417101147028) (VB (V ven) as) (p=0.000334168755221) syntax: error Grammar does not cover some of the input words: "'venas'". morphology + syntax combined: ['amik', 'o', 'ven', 'as'] (S (NP@91 (NN (N amik) o)) (VP@94 (VB (V ven) as))) (p=1.12188584593e-28) }}} The corpus contains the plural "prefiksoj", which is inflected to an accusative here: {{{ sentence: mi donas prefikson morphology: (PRP@170 mi) (p=1.0) (VB (V@173 don) as) (p=0.0350877192982) (NN (NN (N@219 prefiks) o) n) (p=6.08906783983e-05) syntax: error Grammar does not cover some of the input words: "'prefikson'". morphology + syntax combined: ['mi', 'don', 'as', 'prefiks', 'o', 'n'] (S (NP@293 (PRP mi)) (VP (VB (V@22 don) as) (NP@256 (NN (NN (N@89 prefiks) o) n)))) (p=9.85999896556e-46) }}} However, it is perhaps unfair not to assign categories to unknown words. In the following results I let unknown words be assigned any open class POS tag (uniform probability for now). This is not an elegant solution because POS tags are transparently marked in Esperanto, so perfect tagging can be performed with a rule-based approach; the problem is that this should be integrated into the parsing algorithm, because the alternative of parsing sequences of POS tags instead of words is a cop out I do not want to make. Here is a large sentence from later in the "Dua Libro" (which is, fittingly, about compounding in Esperanto): {{{ sentence: Vortoj kunmetitaj estas kreataj per simpla kunligado de simplaj vortoj morphology: Vortoj (NN (NN (N@420 Vort) (NN_o@421 o)) (NN_j@825 j)) kunmetitaj (JJ (J (V kunmetit) (J_a@28 a)) (JJ_j@29 j)) estas (VB (V est) (VB_as as)) kreataj (JJ (J (V kreat) (J_a@28 a)) (JJ_j@29 j)) per (IN@1177 per) simpla (JJ (J simpl) (JJ_a@613 a)) kunligado (NN (J kunligad) (NN_o@855 o)) de (IN de) simplaj (JJ (J (V simpl) (J_a@28 a)) (JJ_j@29 j)) vortoj (NN (NN (N@420 vort) (NN_o@421 o)) (NN_j@825 j)) morphology + syntax combined: ['Vort', 'o', 'j', 'kunmetit', 'a', 'j', 'est', 'as', 'kreat', 'a', 'j', 'per', 'simpl', 'a', 'kunligad', 'o', 'de', 'simpl', 'a', 'j', 'vort', 'o', 'j'] (S (NP@128 (NN (NN (N@320 Vort) (NN_o@321 o)) (NN_j@27 j))) (VP (NP (JJ (J (V kunmetit) (J_a@11 a)) (JJ_j@12 j))) (VP (VB (V@255 est) (VB_as@256 as)) (VP (JJ (J (V kreat) (J_a@11 a)) (JJ_j@12 j)) (NP (NP (JJ (V (N per) (V simpl)) (JJ_a@171 a)) (NN (N kunligad) (NN_o@451 o))) (PP (IN de) (N\' (JJ (J (V simpl) (J_a@11 a)) (JJ_j@12 j)) (NN (NN (N@320 vort) (NN_o@321 o)) (NN_j@27 j))))))))) }}} {i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=[S [NP@128 [NN [NN [N@320 Vort] [NN_o@321 o]] [NN_j@27 j]]] [VP[NP [JJ [J [V kunmetit] [J_a@11 a]] [JJ_j@12 j]]][VP [VB [V@255 est] [VB_as@256 as]] [VP[JJ [J [V kreat] [J_a@11 a]] [JJ_j@12 j]][NP [NP[JJ [V [N per] [V simpl]] [JJ_a@171 a]][NN [N kunligad] [NN_o@451 o]]] [PP[IN de][N' [JJ [J [V simpl] [J_a@11 a]] [JJ_j@12 j]] [NN [NN [N@320 vort] [NN_o@321 o]] [NN_j@27 j]]]]]]]]])} (Translation: Compound words are created using simple concatenation of simple words [NB: words means roots here]) There are some mistakes in segmenting (kun-met-it, kre-at, per simpl-a, kun-lig-ad). The phrase structure has mistakes as well, eg. "vortoj kunmetitaj" is a constituent, "per simpla..." should be a PP but this is overlooked because it got an incorrect POS tag. But given that the syntax corpus contains only 14 sentences it is perhaps striking that a parse was produced at all. The modularist approach yields the following parse tree: {{{ syntax & morphology separate: Vortoj kunmetitaj estas kreataj per simpla kunligado de simplaj vortoj (S (NP (NN (NN (N Vort) (NN_o o)) (NN_j j)) (JJ (J (V kunmetit) (J_a a)) (JJ_j j))) (VP (VP (VB (V est) (VB_as as)) (NP (JJ (J (V kreat) (J_a a)) (JJ_j j)) (IN per))) (NP (NP (JJ (J simpl) (JJ_a a)) (NN (J kunligad) (NN_o o))) (PP (IN de) (N\' (JJ (J (V simpl) (J_a a)) (JJ_j j)) (NN (NN (N vort) (NN_o o)) (NN_j j))))))) }}} {i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=[S [NP[NN [NN [N Vort] [NN_o o]] [NN_j j]][JJ [J [V kunmetit] [J_a a]] [JJ_j j]]] [VP[VP [VB [V est] [VB_as as]] [NP [JJ [J [V kreat] [J_a a]] [JJ_j j]] [IN per]]][NP [NP [JJ [J simpl] [JJ_a a]] [NN [J kunligad] [NN_o o]]] [PP[IN de][N' [JJ [J [V simpl] [J_a a]] [JJ_j j]] [NN [NN [N vort] [NN_o o]] [NN_j j]]]]]]])} The morphology is identical, but syntactically the results are a little different, eg. the first noun and adjective are together in an NP. However, the preposition "per" appears oddly at the end of an NP, instead of introducing a PP (in the previous tree it ended up prefixing an NP because the model cannot distinguish the difference between word and morpheme boundary). @ Todo * parse bitpar chart output into NLTK (currently only most probable derivation; we need most probable parse and maybe shortest derivation, SL-DOP etc.) * automate testing & evaluation, apply to toy corpus * use Reta Vortaro / ergane Esperanto dictionary and root lists to induce segmentation / morphology model in a semi-supervised fashion. * check morphology coverage against vocabulary of Monato treebank * distinguish between morpheme and word boundaries (how?). possibly by having a trailing space as part of a morphological analysis (but: this should not block inflection for plurality and accusative (+j and +n respectively). * write report (maybe convert wiki to latex?). evaluation & conclusion. write about Dasgupta (2008) & pragmatic motivation for assuming hierarchical phrase-structure trees. * look at DOP{U+002A} / U-DOP @ Evaluation TBD. Tenfold testing of Monato treebank, with and without regard for morphology. @ Tentative conclusion Esperanto & DOP are awesome. @ References Bick, Eckhard (2007), ``Tagging and Parsing an Artificial Language: an annotated web-corpus of Esperanto,'' in: _Proceedings of Corpus Linguistics_ , Birmingham, UK. http://beta.visl.sdu.dk/pdf/CorpusLinguistics2007_esp.pdf Bod, Rens & Scha, Remko (1996) ``Data-Oriented Language Processing: an overview.'' Research reports, Institute for Logic, Language and Computation, University of Amsterdam. http://dare.uva.nl/document/1144 Clancey, W.J. (1991), ``The biology of consciousness: Comparative review of Israel Rosenfield, The Strange, Familiar, and Forgotten: An anatomy of Consciousness and Gerald M. Edelman, Bright Air, Brilliant Fire: On the Matter of the Mind,'' _Artificial Intelligence_ vol. 60, pp. 313--356 Gobbo, Federico (2009), ``Adpositional Grammars: a multilingual grammar formalism for NLP,'' PhD dissertation, Universita degli Studi dell'Insubria. Goodman, Joshua (1996), ``Efficient Algorithms for Parsing the DOP Model''. _Proceedings Empirical Methods in Natural Language Processing_ pp. 143-152. http://acl.ldc.upenn.edu/W/W96/W96-0214.pdf Jackendoff, Ray (2003), ``Précis of Foundations of Language: Brain, Meaning, Grammar, Evolution,'' Behavioral and Brain Sciences (2003), 26:6:651-665 Cambridge University Press. Jansen, W. (2007). ``Woordvolgorde in het Esperanto: normen, taalgebruik en universalia" (Word-order in Esperanto: norms, usage and universals). PhD thesis, LOT Utrecht. Jurafsky, D. & Martin, J.H. (2000), ``Speech & Language Processing An introduction to natural language processing, computational linguistics, and speech recognition,'' Pearson Education. Haitao, Liu (2001), ``Creoles, Pidgins, and Planned Languages.'' Interface. Journal of Applied Linguistics / Tijdschrift voor Toegepaste Linguïstiek 15 [2]. pp. 121--177. Kalocsay, Kálmán & Waringhien, Gaston (1980), Plena Analiza Gramatiko de Esperanto (Complete, analyzed Grammar of Esperanto), Rotterdam, Universala Esperanto-Asocio. Maat, Jaap (1999), ``Philosophical Languages in the Seventeenth Century: Dalgarno, Wilkins, Leibniz,'' Amsterdam, Institute for Logic, Language and Computation. MacWhinney, B. (1987), ``Mechanisms of Language Acquisition,'' Lawrence Erlbaum Associates, NJ. Miner, Ken (2006), ``Rimarkoj pri `En la komenco estas la vorto' de Geraldo Mattos (fina versio),'' (Comments on `In the beginning was the word' by Geraldo Mattos (final version)). http://www.sunflower.com/~miner/EKVO_package/ekvo.html Miner, Ken (2008), ``La neebleco de priesperanto lingvoscienco,'' (The impossibility of Esperanto linguistics). October 2008. http://www.sunflower.com/~miner/LINGVISTIKO_package/lingvistiko.html Also published in ``La arto labori kune : festlibro por Humphrey Tonkin'' (The art of working together: Festschrift for Humphrey Tonkin). Roterdam, Universala Esperanto Asocio, January 2010 Pinker, S. (1994). The language instinct: How the mind creates language. New York: W. Morrow. Prescher, D., Scha, R., Sima`an, K., Zollmann, A., (2004) ``On the statistical consistency of DOP estimators.'' In _Proceedings of the 14th Meeting of COmputational Linguistics in the Netherlands_ Antwerp, Belgium. Scha, Remko (1990), ``Taaltheorie en Taaltechnologie; Competence en Performance'' (Language theory and language technology: Competence and Performance), in Q.A.M. de kort and G.L.J. Leerdam (eds.), _Computertoepassingen in de Neerlandistiek_ pp. 7-22, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek). English translation http://www.hum.uva.nl/computerlinguistiek/scha/IAAA/rs/cv.html Schleyer, Johan Martin (1884), ``Volapük. Grammatik der Universalsprache für alle gebildete Erdbewohner,'' Überlingen am Bodensee: Buchdruckerei August Feyel, Buchhandlung Aug. Schoy. Third edition. Schmid, Helmut (2004), ``Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors,'' _Proceedings of the 20th International Conference on Computational Linguistics_ (COLING 2004), Geneva, Switzerland. http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/COLING04/BitPar.pdf Schmid, Helmut, Arne Fitschen and Ulrich Heid: SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection, Proceedings of the IVth International Conference on Language Resources and Evaluation (LREC 2004), p. 1263-1266, Lisbon, Portugal. http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/LREC04/smor.pdf Schubert, Klaus, 1989. "An unplanned development in planned languages", en Klaus Schubert, red., Interlinguistics: Aspects of the Science of Planned Languages [ = Trends in Linguistics: Studies and Monographs 42], Mouton de Gruyter. Schubert, Klaus (1993), ``Semantic compositionality: Esperanto word-formation for language technology.'' _Linguistics_ 31: 311-365. Wells, John (1989), ``Lingvistikaj aspektoj de Esperanto,'' Universala Esperanto Asocio, Rotterdam. Second edition. Wennergren, Bertilo (2005), ``Plena Manlibro de Esperanta Gramatiko,'' (Complete handbook of Esperanto Grammar), version 13.0, 14th of April 2005. Available online at http://bertilow.com/pmeg/. Zamenhof, Dr. L. L. (1887/1968), ``Internationale Sprache. Vorrede und Vollständiges Lehrbuch,'' Warschau, photographic reprint from 1968 (Saarbrücken: Artur E. Illtis). German translation of the original Russian brochure. Zamenhof, Dr. L. L. (1905/1963), ``Fundamento de Esperanto.'' Ninth edition with Introduction, Notes and Linguistics comments, edited by Dr. A. Albault (Esperantaj Francaj Eldonoj: Marmande, 1963). Zollmann, Andreas & Sima'an, Khalil (2005), ``A Consistent and Efficient Estimator for DOP.'' _Journal of Automota Languages and Combinatorics_ vol. 10, pp. 367. http://staff.science.uva.nl/~simaan/D-Papers/JALCsubmit.pdf @ Needed references everything seems to be there. @ Possible references Dasgputa, Probal (2008), ``Interlexical studies: a cognitive approach,'' talk delivered on 18th of April 2008, Amsterdam Centre for Language and Communication. (from Miner 2006) Sakaguchi, Alicja, 1996. Die Dichotomie "künstlich" vs. "natürlich" und das historische Phänomen einer funktionierenden Plansprache. Language Problems and Language Planning 20:1. Gledhill, Christopher, 2000. The Grammar of Esperanto: A Corpus-Based Description. Lincom Europa. Grimley-Evans, Edmundo, 1997. "Vortfarado", (Word derivation) La Brita Esperantisto, marto-aprilo 1997, ppĝ 57-59.