\documentclass[10pt,a4paper]{article}
%\usepackage{linguex}
\usepackage{verbatim}
\usepackage{amsmath}
\usepackage[pdftex]{graphicx}
\usepackage[english]{babel}
\usepackage{synttree}
\usepackage[colorlinks=true, linkcolor=black, urlcolor=blue, pdfborder={0 0 0}]{hyperref} %pdfborderstyle={/S/U/W 1}
%\usepackage{fullpage}
%\usepackage[utf8x]{inputenc}

\makeatletter
\def\s@btitle{\relax}
\def\subtitle#1{\gdef\s@btitle{#1}}
\def\@maketitle{%
  \newpage
  \null   \vskip 2em%
  \begin{center}%
  \let \footnote \thanks     
    {\LARGE \@title \par}%
         \if\s@btitle\relax
         \else\typeout{[subtitle]}%
             \vskip .5pc
             \begin{large}%
                 \textsl{\s@btitle}% 
                 \par 
             \end{large}% 
         \fi
   \vskip 1.5em%
   {\large  
    \lineskip .5em%
    \begin{tabular}[t]{c}%
     \@author 
    \end{tabular}\par}%   
   \vskip 1em% 
   {\large \@date}%
  \end{center}%   
  \par 
  \vskip 1.5em} 
\makeatother

\begin{document}

\title{Enriching DOP with morphology}

\subtitle{a pilot using a constructed language}

\author{Andreas van Cranenburgh\footnote{acranenb@science.uva.nl, 0440949}}

\maketitle

\begin{center}
Cognitive Models of Language project, April 2010, \\
Master of Logic, University of Amsterdam 
\end{center}

\vspace{3em}
\abstract{
Esperanto is a constructed language with a rich and regular morphology.  It
seems likely that taking its morphology into account when parsing syntax will
improve accuracy. I will investigate the effects of considering morphological
and phrase structure analysis as separate, autonomous steps, versus combining
them into a single DOP model. I will assume a hierarchical representation for
both syntax and morphology.

Since there is no gold standard treebank with phrase structures for Esperanto,
I will construct a small toy corpus for testing. Furthermore, experiments with
U-DOP for both morphology and phrase structure are possible.

Lastly, previous work with Esperanto has resulted in a highly successful
($>95\%$ precision on a small test corpus) constraint grammar (Bick 2007), and
a formal model of morphology and syntax in the form of an adpositional grammar
(Gobbo 2009); an adpositional grammar is a dependency grammar combining
directed dependencies with the dimension of trajector/landmark from
construction grammar. These provide a means of comparison and a potential
treebank.}
\footnote{{\bf Acknowledgements}: I wish to thank the following people (in
reverse chronological order): Federico Sangati for practical advice on DOP, Ken
Miner for advice on morphology and suggesting the application of DOP to
Esperanto, Eckhard Bick for the Monato treebank, Rens Bod for teaching me
Data-Oriented Parsing, and last but not least Wim Jansen for teaching me
Esperanto.}

\newpage

\tableofcontents

\section{Introduction}
\subsection{Research questions}

\begin{itemize}
\item Does making morphology transparent to syntax improve parsing results for
syntax?
\item Are morphology and syntax autonomous? ie., is morphology opaque or
transparent to syntax? These possibilities correspond to a modularist (cf.
Pinker 1994, Jackendoff 2003) vs. an interactionist approach (cf.  MacWhinney
1987). Here modularism refers to the functionalist hypothesis of the autonomy
of syntax from other levels, both the stronger claim of processing autonomy,
and its weaker form of representational autonomy. In effect the issue at stake
here is the nature of the morphology-syntax interface. On the one hand there is
the extreme of syntax seeing only a Part-of-Speech tag (and possibly the word
as well if syntax is lexicalised), on the other hand there is the other extreme
where morphology is literally a part of syntax that has been conveniently
ignored in the majority of work in (computational) linguistics to date.
Jackendoff's (2003) parallel architecture suggest a compromise where
interfaces of different autonomous levels are possible (e.g.,
phonology-semantics to deal with focus effects).
\item Are words the smallest units of syntax, or is it perhaps morphemes?
\end{itemize}

Because this is a pilot project, only the first question shall be answered,
but this may provide a hint as to the other questions. In addition the answer
to the first question shall only concern Esperanto.

Approach:

\begin{itemize}
\item Construct a corpus of sentences annotated with phrase structures, and a 
lexicon of words annotated with morphological structures. The assumption is
that while syntax may use information in morphology, morphology does not need
information from syntax, hence the possibility of constructing a morphological
corpus independent of the text corpus; note that this amounts to assuming that
for the purpose of constructing a corpus the morphology is context-free.
\item Divide the corpus into training and testing, train on the former with DOP1 or 
DOP*; (Zollman 2005) (the latter only given a sufficiently large corpus)
\item Evaluate morphology: evaluate performance of morphology; should be good 
enough to continue with syntax
\item Morphology transparent to syntax: take treebank corpus, merge phrase 
structure trees with morphological analyses, construct a single DOP model
\item Morphology opaque to syntax: construct a DOP model for morphology, taking one 
word at a time, and a DOP model for syntax, producing phrase structure trees
without morphology.  Morphological structure and phrase structure can be parsed
in parallel and independent of each other.
\end{itemize}

\section{Background}
\subsection{Morphology}

Much work in Computational Linguistics focuses exclusively on syntax; this is
a form of syntactocentrism, a term coined in Generative Linguistics 
(Jackendoff 2003). This also goes for Data-Oriented Parsing (DOP), although
excursions into semantics have been made. In this project I will go in the
other direction and turn to the stratum of morphology. Most accounts of
morphology in Computational Linguistics seem to present the structure of words
as a sequence of morpheme-feature pairs (e.g., Jurafsky \& Martin 2000),
as parsed by a (Stochastic) Finite State Transducer (cf., Schmid et al. 2004).

However, due to the complexity and potentially unlimited productivity of
morphology in Esperanto such a representation will necessarily contain only
part of the structural information of words in Esperanto (more on this in the
next section). Such an approach is to the representation of the present project
what POS tagged sentences are to hierarchical phrase structure trees. Although
the present project focuses on Esperanto, the method of adding morphology to
DOP should generalize to other languages, especially to languages such as
English which display only a very limited amount of morphological productivity
and hence exhibit only a subset of the derivational complexity in
morphologically richer languages.

\subsection{About Esperanto}

Esperanto is a constructed language (also referred to as a planned language).
The term ``artificial language" that is sometimes employed is inappropriate, as
its artificial design is just a point in time of its century long continuous
usage and evolution. It is a spoken language with its own literature and culture, so while it
may not  be a ``natural language" strictly speaking (Gobbo (2009) uses the term
Quasi-Natural Language), it is certainly a human language that performs all the
communicative and expressive functions of Ethnic languages, albeit mostly as a
second language used by a diverse and scattered speech community. 

Typologically Esperanto has the unique character of being a morphologically
agglutinative and synthetic language with a vocabulary largely based on
Romance languages (apart from some German \& Russian words, and schematic
function words as well). Its word formation is extremely compositional (ie., its
word formation is fully transparent); I go as far as to contend that it is the
most compositional spoken language in use today. Its syntax is schematic
(designed) and allows for a relatively free word-order through obligatory case
marking, although in practice a default word-order of SVO has emerged, with
systematic deviations, triggered by complex (heavy) constituents and by
pragmatics to express focus; these findings accord with relatively universal
features found in natural languages (Jansen 2007). Cases are marked, viz.
through a null-marking for the nominative and direct object, an inflection in
case of the accusative, and through a set of prepositions initially intended
to be unambiguous (e.g., the English preposition ``with" translates in two ways
in Esperanto, through the instrumentalis "per" or with "kun," meaning
together. 

The qualification ``relatively'' is a commonly made one for the freeness of
word-order. To be specific, it refers here to the fact that within
constituents word-order is fixed for determiners and prepositions and negation
and degree particles, while being free for adjectives and nouns. The order of
constituents has a higher amount of freedom, but the order of prepositional
phrases does reflect their argument structure. Lastly wh-question formation
co-occurs with a word-order transformation, ie., if the wh-constituent is in
the accusative it is moved to sentence initial position.  Strangely enough
this does not happen for polar questions, which are marked with a
polar-forming morpheme. This leaves Esperanto in the perhaps uncommon position
of marking one type of question with word-order (since the wh-words also
denote relative pronouns), and the other with a morpheme. We should also
consider the ambiguities introduced by relaxing word-order. Since the
accusative is marked through a declension with obligatory agreement, it is
trivial to distinguish subjects and objects. However, boundaries between other
constituents such as the nominative and the direct object or the end of
prepositional phrases are unmarked, and can result in ambiguity due to
the aforementioned underspecification.

Concerning prepositions, the initial intention was to express some rather vague
relations such as ``believing {\em in} God" (which is neither spatial nor
temporal, it would appear) with a semantically neutral preposition for an
unspecified relation, the preposition "je"; however, this seems to have fallen
in disuse, probably through interference from Ethnic languages. However, an
interesting hypothesis could be that this reflects an evolutionary pressure for
distinctions and ambiguities to correspond with the meanings that are actually
expressed (the prior probability of wanting to express some meaning) -- while
an abstruse philosophical treatise may theoretically discuss "believing" while
residing spatially or temporally "in God", this possibility is vanishingly rare
so that making the distinction is wasted effort.

While Esperanto's morphology is agglutinative and synthetic (it has an index of
agglutination of 1,0 and an average synthesis index (word-morphemo ratio) of
1.8-2, reported by Wells 1989), it is not poly-synthetic such as Inuit
languages; single words cannot express what is denoted by a whole phrase
in other languages, and grammatical roles are not marked, nor is the nature of
the relation between elements that make up a word specified. It also does not
feature incorporation such as in Catalan. Concerning the underspecification of
relations between morphemes, consider the Dutch word "zoektechnieken", which
could translated as "techniques for search," though "for" is not specified in
the Dutch word. In an agglutinative language invariant morphemes that express
only a single grammatical meaning are concatenated unmodified, such that
identifying the elements that make up a word is relatively easy (although
ambiguities may arise through overlap; ie., when concatenating two smaller
morphemes results in a string of characters that coincides with a larger
morpheme). The process of word formation is completely productive and without
exceptions; the only proviso is that a formation should make sense
semantically when considering the meaning of its constituent elements (ie.,
the principle of compositionality modulo the Gricean maxim of manner). 

There is obligatory agreement in number and declension within noun phrases.
Verb paradigms are simple: tense is marked with the ending, person and number
solely through the subject.

Esperanto's productive morphology can be summarized using a regular
grammar. The following is adapted from Schubert (1993; caveat lector: Schubert
incorrectly characterizes this grammar as recursive), which in turn is based
on Kalocsay's (1980) account. I have translated it into a regular grammar,
proving that the word forms in the lexicon of Esperanto can be enumerated by a
regular language; to my knowledge this is the first such description to date.
The grammar for function words: 

\begin{verbatim}
function_word := adverb | preposition | numeral
adverb := prefix adverb
preposition := prefix preposition
numeral := numeral numeral*
prefix := mal | ne | ...
suffix := il | et | ...
\end{verbatim}

Content words are a little more involved (ibid):

%\begin{verbatim}
\texttt{
word := prefix* left* right ending ($\epsilon$ | declension) \\
left := right ($\epsilon$ | ending) \\
right := prefix* root suffix* \\
ending := o | a | e \\
declension := j | n \\
verb-ending := as | is | os | us | u \\
root := akv | far | ...
}
%\end{verbatim}

In these rules, ``prefix" and ``suffix" refers to a closed class of affixes;
``(verb-)ending" refers to a one or two-character ending marking the
Part-of-Speech; "declension" refers to either a null marking (nominative,
singular) or the accusative and/or plurality marking. Furthermore,
``\texttt{*}" is the Kleene star, ``\texttt{|}" is the alternation operator,
and lastly concatenation is implied. This grammar incorporates three processes
of word-formation in Esperanto: derivation (concatenating elements to form
words), compounding (concatenating elements to words to form more complex
words), and POS category change.  The latter refers to nominalizations and
other possible mappings between Parts-of-Speech.

While this grammar should in all likelihood exhaust Esperanto's morphology, 
it is of little use for computational linguistics because of its ambiguity
and flat structure.  Whereas POS-tagging can be done practically error-free
using a rule-based algorithm (save for proper names and foreign words), deeper
morphological structure will depend on the morphemes in question, and possibly
their semantics as well. However, in this project it is assumed that the latter
does not play a major role as doing semantics is infeasible (it is my
contention that semantics relies on extensive extra-linguistic world
knowledge). We will assume that derivations and compound words are constructed
in a stochastic process that can be leared from examples (words with their
appropriate structure, that is).

Another way in which the grammar falls short is that it does not consider the
grammatical character of roots in Esperanto (Schubert 1993). Although initially
controversial, the thesis that bare roots (without their grammatical endings)
have a grammatical category to which they belong has by now been almost
universally accepted in Esperantology. In effect this entails that roots in
Esperanto belong to a prototypical semantic class (sometimes several).  These
classes are verbal, adjectival and noun-like (adverbial roots are part of the
adjectival roots, arguably they are part of the ``qualities" class). The
typical example is "MARTEL" and "TOND", roots for hammer and cutting,
respectively. The category of the former is a noun and thus "martelo" means a
hammer, and the derived "marteli" means to hammer. The latter is a verb root
meaning, with "tondi" meaning to cut, and the derivation "tondilo" meaning a
tool to cut or a scissor, requiring an affix to denote a tool derived from a
verb (directly affixing a noun ending to the root would mean ``a cut"). Without
recording the grammatical category of roots, a model of Esperanto morphology
would not be able to predict the correct derivations.

The present work glosses over a related feature of Esperanto roots, the fact
that verbs are transitive or intransitive (valency), requiring an affix to
change from the one to the other meaning. The reason for glossing over this
aspect is that this information should become part of a more general account of
argument structure (i.e., including prepositional arguments) that is beyond the
scope of this project. Take these examples:

(1) ``La akvo bolas" (the water boils)

(2) ``Mi boligas la akvon" (I boil the water)

(3) ``Mi finis la libron" (I finished the book)

(4) ``La libro fini\^gis" (the book finished)

Sentences (1) and (3) contain the original verb, while (2) and (4) contain
affixed verbs with a different subcategorization frame. This feature of
Esperanto has been critized as being a needless distinction (common sense
usually yields the correct meaning, as for example English demonstrates),
as well as the rather arbitrary choices that have been made as to the
transitivity, requiring a language user to memorize them by
rote. It has also resulted in confusing paronyms such as ``pesi'' (to weigh
something) and ``pezi'' (to weigh $X$ kilos, to be heavy). It is however an
unchangeable part of the language.

\subsection{About Data-Oriented Parsing}

Data-Oriented Parsing (Scha 1990; Bod \& Scha 1996, henceforth DOP) is a
computational framework for modeling natural language processing (NLP) and
other hierarchical cognitive phenomena. Its basic assumptions are:

\begin{itemize}
\item knowledge of language is made up of a corpus of concrete experiences
rather than abstract rules; this concrete experience is stored in
exemplars, pairings of surface forms and their structure.
\item when faced with a new sentence, all fragments of past experiences can be
consulted to analyze the given sentence
\item fragments can be combined using one or more operations which obtain with a
certain (estimated) probability
\end{itemize}

Two crucial aspects are the representation used to describe the concrete
experiences and the method for ranking the possible analyses. Most research in
Computational Linguistics currently focuses on isolated sentences annotated
with phrase-structures trees; this project will follow the same approach with
the addition of morphological structure. Various methods for selecting the best
parse tree exist for DOP; the best performing methods combine a notion of
simplicity (the derivation requiring the least amount of fragments) with
likelihood (estimated probability); e.g., the most likely from the n shorted
derivations.

It should be noted that Esperanto, as a free word-order language, is more
suitably described using depedency structures. However, given extent of
previous work on DOP with phrase-structure trees, I have opted to assume such
hierarchical representantions instead. This is merely a pragmatically motivated
assumption. 

What makes DOP so promising is that if any computational approach to language
can be said to successfully learn a language given enough data (ie., without
recourse to innate knowledge), DOP is bound to be one of them.  This is because
the Data in Data-Oriented Parsing refers to exploiting all of the available
data. Whereas more traditional methods in Computational Linguistics such as
Probabilitistic Context-Free Grammars (PCFG) derive abstract rules from a
treebank, throwing away valuable contextual information, DOP retains all
exemplars and their fragments (modulo some potential pruning method
corresponding to memory decay depending on usage and age etc.). This allows for
the recognition of long-range dependencies such as in the construction "more X
than Y." Also, compared to a PCFG, the statistical independence assumptions of
DOP are less strong, because they can be spread over different derivations
resulting in the same parse tree (ie., the assumptions made by each of the
derivations of the most probable parse are corroborated by its other
derivations). Prescher et al. (2004) observe that DOP combines the memory-based
aspects of non-probabilistic machine learning techniques such as k-nearest
neighbor with a probabilistic approach to deal with unseen (novel) exemplars;
thus DOP provides a way to deal with the spectrum ranging from stock phrases
that can be memorized by rote to completely novel sentences. The larger the
fragments used in a derivation, the less independence assumptions need to be
made; however, novel sentences can be parsed by backing off to smaller
fragments. Thus, in the limit (as the corpus size approaches infinity) DOP does
not make any independence assumptions at all. 

A fascinating parallel could be said to exist between DOP and the human immune
system:

\begin{quote}
	``Edelman received the Nobel prize in 1972 for his model of the
	recognition processes of the immune system. Recognition of bacteria is
	based on competitive selection in a population of antibodies. This
	process has several intriguing properties (p. 78): 
	
	\begin{enumerate} % enumerate with 1) style
	\item There is more than one way to recognize successfully any particular shape;
	\item No two people have identical antibodies;
	\item The system exhibits a form of memory at the cellular level (prior to
	antibody reproduction).
	\end{enumerate}

	Edelman extends this theory to a more general ``science of recognition": 

	By ``recognition," I mean the continual adaptive matching or fitting of
	elements in one physical domain to novelty occurring in elements of
	another, more or less independent physical domain, a matching that
	occurs without prior instruction. [T]here is no explicit information
	transfer between the environment and organisms that causes the
	population to change and increase its fitness. (p. 74)" 
	-- Clancey (1991)
\end{quote}

This general theory that is hinted at here is Edelman's Neural Darwinism,
a theory of competition describing the development of the human brain and
the development of consciousness. The "species" selected for might be mental
categories, conceptualizations, linguistic exemplars, etc. 

DOP's notion of {\em spurious ambiguities} (different ways of deriving the same
parse tree) accords perfectly with 1).  While DOP does not explicitely claim
that ``no two people have identical [exemplars]", it might very well be (which
dramatically changes the scope of DOP from a potentially purely linguistic
account modeling a language to a necessarily psychological one modeling an
idiolect); certainly no two individuals will have the exact same corpus. I am
unsure exactly how to interpret 3), but reliance on memory is certainly the
defining trait of DOP (as opposed to other formalisms which are typically
biased to computation over memory).

\subsection{DOP and Esperanto}

I attribute the idea of applying Data-Oriented Parsing to Esperanto to
Ken Miner (2006a):

\begin{quote}
	``E\^c se ni disvolvus stokastajn alirojn bazitajn sur Datum-Orientita
	Pritraktado (DOP), ankorau necesus denaskaj parolantoj por validumi
	tiajn modelojn. Kiam temas pri la normala lingvistiko, ne eblas eskapi
	la neceson de denaskaj parolantoj kiel fina kontrolo.''

	Even if we develop stochastic approaches based on Data-Oriented
	Parsing (DOP), native speakers would still be necessary for evaluating
	such models. When we speak of normal linguistics, it is impossible
	to escape the necessity of native speakers as the ultimate arbiters.
\end{quote}

The quote is from a rather gloomy article on the lack of negative evidence
for Esperanto, and the resulting impossibility of doing real linguistics (as
opposed to the parochial ``Esperantology''). Note that ``native speakers''
refers here specifically to speakers who use Esperanto in their day-to-day
life with their peers, not in the more narrow sense of a language taught by
parents. Native speakers in the latter sense exist but play a marginal role in
the Esperanto movement, native speakers in the former sense do not exist and
would violate the relative neutrality of Esperanto as an international
language. I personally do not think this lack of evidence makes linguistics on
Esperanto problematic, because gramaticality judgements and semantic
intuitions are philosophically problematic no matter how many native speakers
are available to supply them. While it is correct that there can be no
negative evidence about the grammaticality or felicity of an Esperanto
construction, the same goes for writing a poem in English: as long as the poem
is required to be novel and original it needs to be composed with recourse to
some creative ``estimator'' which judges whether a novel combination of words
makes sense; positive evidence from corpora is of little value to this task,
because it is biased to rehashing previously learned constructions, although
it is undoubtedly a precondition.  Such a creative estimator must prune the
potentially infinite space of low probability events according to a subjective
aesthetic ranking and threshold. Incidentally, it should be noted that
Esperanto has a startingly rich tradition of translated and original poetry,
ranging from its very inception up to the present day.  Poetry has been one
of the driving forces in coining neologisms, because of their affective
connotations. It may be desired to express an antonymic meaning without
evoking its opposite through the presence of its morpheme, compare ``malgaja''
(un-cheerful, sad) and ``trista''; similarly it may be dangerous to refer to
``maldekstra'' (left) in a noisy environment, as opposed to its proposed
neologism ``live.''

The appropriateness of DOP for Esperanto should be noted. In contrast with the
earlier a priori, philosophical languages published as completed projects
(Maat 1999), Esperanto was presented in a modest brochure (Zamenhof 1887)
purporting to fully describe its grammar in 16 rules, along with examples of
original and translated prose and poetry, inviting the reader to start
building and using the language by following its examples. That Zamenhof
summarized his language in 16 rules may well have been a nod to the rival
constructed language Volap\"uk (Schleyer 1884), a popular but highly complex
language of bygone days purportedly communicated to its author by God.  The
complexity of Volap\"uk is demonstrated by the fact that its verb paradigm
contains 1584 conjugations, by combining tense, aspect, voice, person, number
and gender, among others. Such features made Volap\"uk difficult to learn and
use, just as the philosophical languages. During the first Esperanto congress
the {\em Fundamento} (Zamenhof 1905) was ratified as the untouchable foundation of
the language, containing the 16 grammar rules, a dictionary with 2600 words
and translations in six languages, and a collection of exercises; all of these
had been published at least a decade earlier and where already sanctioned
through practice. In effect, the {\em Fundamento} can be considered as the
authoritative corpus on Esperanto, to which only new vocabulary is to be added
as needed, provided that it follows its orthography. Concerning morphology in
particular, Schubert (1993) notes, after referring to Zamenhof instruction of
consulting the supplied dictionary of roots and affixes:

\begin{quote}
	``Apart from this recipe for deciphering Esperanto texts,
	Zamenhof did not tell the users of his language exactly HOW to
	build complex words. He relied on providing a vast number of models and
	examples" (emphasis in the original)
\end{quote}

Further on, Schubert notes:

\begin{quote}
	``Zamenhof may have intuitively felt the impossibility of describing a
	language exhaustively by means of rules. Such an insight would make his
	thinking very modern indeed. In any case he preferred to give examples
	rather than working out a detailed word grammar."
\end{quote}

This clearly justifies our intention of analyzing Esperanto using an
exemplar-based model, not only pragmatically because of DOP's success, but
historically as well, since it accords with Esperanto's emergence. An
interesting sidenote is that in the years after its publication, Esperanto's
word formation processes appear to have regularized (Schubert 1989), favoring
new coinings such as "aspekti" (to appear) over Germanisms such as "elrigardi"
(Wennergren 2005) (literally to look out) in the sense of to appear (Dutch "er
uitzien", German "aussehen"), naturally the literal sense of looking out e.g.
a window remains.  Interestingly, this is the opposite of creolization where a
pidgin acquires a relatively complex rule system; a more important argument
against the creolization of Esperanto is that creolization is by definition
driven by a newly formed, geographically homogeneous community of native
speakers, which Esperanto certainly does not have. Furthermore, if Esperanto
were to be a pidgin (it is not; cf.  Haitao 2001), it would be one of an
extremely curious sort: a pidgin with an authoritative corpus and a language
academy. As Miner (2008) remarks, the latter is something which Chomsky could
have facetiously remarked, instead he has claimed (quite incorrectly) that
Esperanto is not a language for it lacks a generative grammar, putatively
because it "parasitizes" on other languages\footnote{paraphrased from an
interview transcript available at \url{http://www3.sympatico.ca/mlgr/chomsky.pdf}};
this clearly belies his ignorance of Esperanto, as well as being an obvious
non-sequitur (perhaps Chomsky implicitly believes that {\em real} languages
develop {\em de novo} without any interlinguistic interaction to speak of).

\section{Practice}
\subsection{Tag set}

The tag set for the hand-annotated corpora, inspired by the Penn-treebank is as follows:

\begin{itemize}
\item Constituents: VP, PP, NP, N' (constituents that behave like a noun), 
NC (conjunction + NN/N'), NPC (conjunction + NP), VPC (conjunction + VP), 
SC (conjuction + S), S' (if/that + S). 
\item Part-of-speech (simplified version of Penn tagset): NN, VB, PR, JJ, DT, RB, PRP, CC
\item Morphology, open class: N (noun), V (verb), J (adjectival), 
closed class: P (prepositional), A (affix), and auto-generated unique tags for
all grammatical endings and declensions (o, j, n, etc.).
\end{itemize}

The Monato treebank uses a different tag set, based on the 
\href{http://beta.visl.sdu.dk/visl/eo/index.php}{EspGram} constraint 
grammar. The POS tags of the morphology corpus should be adapted 
to fit those of the Monato treebank.

Annotated example sentences:

\begin{verbatim}
(S (NP (NN amiko)) (VP (VB venis)))
\end{verbatim}

%{i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN amiko]] [VP [VB venis]]])}
\synttree [S [NP [NN amiko]] [VP [VB venis]]]
%\includegraphics[scale=0.5]{eoimg/tree1}


\begin{verbatim}
(S (S (NP (DT la) (N' (JJ venontajn) (N' (JJ apartajn) (NN pecojn)))) 
(VP (NP (PRP mi)) (VP (VBP donas)))) (S' (IN ke) (S (NP (DT la) (NN lernantoj)) 
(VP (VB povu) (VP (VP (VP (VB ripeti) (RB praktike)) (NP (NP (DT la) (NN regulojn)) 
(PP (IN de) (NP (DT l') (N' (NN gramatiko) (JJ internacia)))))) 
(VPC (CC kaj) (VP (VP (VB kompreni) (RB bone)) (NP (NP (NP (DT la) (NN signifon)) 
(NPC (CC kaj) (NP (DT la) (NN uzon)))) (PP (IN de) 
(NP (DT l') (N' (NN sufiksoj) (NC (CC kaj) (NN prefiksoj)))))))))))))
\end{verbatim}

%{i(/phpsyntaxtree/pngtree.php?fontsize=6\&data=[S [S [NP [DT la] [N' [JJ venontajn] [N' [JJ apartajn] [NN pecojn]]]] [VP [NP [PRP mi]] [VP [VBP donas]]]] [S' [IN ke] [S [NP [DT la] [NN lernantoj]] [VP [VB povu] [VP [VP [VP [VB ripeti] [RB praktike]] [NP [NP [DT la] [NN regulojn]] [PP [IN de] [NP [DT l'] [N' [NN gramatiko] [JJ internacia]]]]]] [VPC [CC kaj] [VP [VP [VB kompreni] [RB bone]] [NP [NP [NP [DT la] [NN signifon]] [NPC [CC kaj] [NP [DT la] [NN uzon]]]] [PP [IN de] [NP [DT l'] [N' [NN sufiksoj] [NC [CC kaj] [NN prefiksoj]]]]]]]]]]]]])97}
\synttree [S [S [NP [DT la] [N' [JJ venontajn] [N' [JJ apartajn] [NN pecojn]]]] [VP [NP [PRP mi]] [VP [VBP donas]]]] [S' [IN ke] [S [NP [DT la] [NN lernantoj]] [VP [VB povu] [VP [VP [VP [VB ripeti] [RB praktike]] [NP [NP [DT la] [NN regulojn]] [PP [IN de] [NP [DT l'] [N' [NN gramatiko] [JJ internacia]]]]]] [VPC [CC kaj] [VP [VP [VB kompreni] [RB bone]] [NP [NP [NP [DT la] [NN signifon]] [NPC [CC kaj] [NP [DT la] [NN uzon]]]] [PP [IN de] [NP [DT l'] [N' [NN sufiksoj] [NC [CC kaj] [NN prefiksoj]]]]]]]]]]]]]
%\includegraphics[scale=0.4, angle=90]{eoimg/tree2}
%\includegraphics[width=17cm]{eoimg/tree2}

Some annotated example words:

\begin{verbatim}
(JJ (JJ (V (V (P en) (V konduk)) (V it)) a) j)

(NN (N (J (J (A mal) (J riĉ)) (A eg)) (A ul)) o)

(VB (P al) (VB (V glu) i))
\end{verbatim}

%{i(/phpsyntaxtree/pngtree.php?data=[JJ [JJ [V [V [P en] [V konduk]] [V it]] a] j])} {i(/phpsyntaxtree/pngtree.php?data=[NN [N [J [J [A mal] [J rich]] [A eg]] [A ul]] o])} {i(/phpsyntaxtree/pngtree.php?data=[VB [P al] [VB [V glu] i]])}
\synttree [JJ [JJ [V [V [P en] [V konduk]] [V it]] a] j]
\synttree [NN [N [J [J [A mal] [J rich]] [A eg]] [A ul]] o]
\synttree [VB [P al] [VB [V glu] i]]
%\includegraphics[scale=0.5]{eoimg/tree3}
%\includegraphics[scale=0.5]{eoimg/tree4}
%\includegraphics[scale=0.5]{eoimg/tree5}

\subsection{Implementation}

\begin{itemize}
\item Goodman reduction: \href{http://www.github.com/andreasvc/eodop}{own implementation}, using 
\href{http://groups.google.com/group/nltk-dev/browse_thread/thread/86ca038723195978/c112b8d171b33d25}{NLTK}. 
maybe add backoff DOP or DOP*; fast PCFG parsing using 
\href{http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html}{bitpar} 
(Schmid 2004), a bit vector based chart parser.
\end{itemize}

In order to apply the Goodman reduction to an arbitrary treebank, the reduction
has been generalized to deal with arbitrary trees (not just trees in Chomsky
normal form). This is done by translating subtrees of the form 
($A$ $B_1$ ... $B_n$) to rules of the form $A \rightarrow B_1 ... B_n$ with
relative frequency:

\[
\frac{\displaystyle\prod_{m = 0}^n(
\text{$freq(B_m)$ if $B_m$ has an id else $1$})}{freq(A)}
\] 

\vspace{2em}
In order to fully separate terminals from non-terminals, all terminals are
assigned an unique tag if they don't have one yet.

Previously considered possibilities:

\begin{itemize}
\item \href{http://staff.science.uva.nl/~simaan/dopdis/}{dopdis} (C): already
has Goodman reduction and DOP*;
\item \href{http://sourceforge.net/projects/lilian/}{lilian} (Java): has
Goodman reduction, no DOP*; also has U-DOP.
\item Gideon Borensztajn's \href{http://staff.science.uva.nl/~gideon/sourcecode/DOPParser.tar.gz}{DOPParser} (Java): has Goodman reduction
\end{itemize}

\subsection{Segmentation}

Before a morphological structure can be assigned to a word, it must be
segmented into morphemes (similar to tokenization before parsing syntax). While
it is claimed that in agglutinative languages in general and in Esperanto in
particular it is ``trivial" to recover the segments that make up a word (eg.
Schubert 1993), this is a rather informal remark which is not borne out in
practice.  Morpheme boundaries are not marked, and ambiguities may arise due to
overlapping roots.

I have devised a form of ``Data-Oriented Segmentation" to expand the
coverage of segmentation beyond that of the words in the morphology corpus. The
algorithm works as follows:

\begin{itemize}
\item take the set of segmented words in the corpus by reading off the leaves of their trees
\item construct a dictionary from positions to the set of morphemes occurring at that position
\item generate possible words by taking the cartesian product of all morphemes occurring
 at position 0 and 1, corresponding to all possible 2-morpheme words using the available
 vocabulary of roots.
\item repeat until position n where n is highest number of morphemes in the treebank to
 generate all possible words with n+1 morphemes.
\end{itemize}

Unfortunately this algorithm suffers from overgeneration. This should be
remedied by discarding any segmentations contradicting the initial set of
(supervised) segmentations. 

An alternative method of generating segmentations:

\begin{itemize}
\item take the set of segmented words in the corpus by reading off the leaves of their trees
\item construct a dictionary from number of morphemes to words with that number of morphemes
\item generate possible words with n morphemes by taking the pointwise cartesian product of
 all words with $n$ morphemes (ie., cartpi(zip(words[n])) )
\end{itemize}

This still overgenerates, though less so (eg., word class, plural and
accusative endings in the wrong order; it may be necessary to treat endings
separately). A third way would be to use a bigram model and produce every
possible sequence up till a certain length, which avoids these issues. A fourth
way would be to employ the context-free grammar described above to generate all
valid words up to a certain length given a collection of roots along with their
categories.

\subsection{DOP model composition}

In order to produce a combined morphology-syntax model, it is necessary to be
able to compose a DOP model and a treebank. This is defined in the following
manner:

\begin{itemize}
\item let $M$ be a DOP model and $S$ a treebank, where for example $M$ contains
      morphology and $S$ contains phrase structure trees.
\item the composition $M$ o $S$ yields a new DOP model by generating a new
      treebank $S'$ based on the trees in the treebank S annotated with
      analyses of words parsed with M (assuming correct segmentation).
\item treebank $S'$ is generated by iterating over the POS tags of the trees in
      $S$ and substituting each POS tag with a tree from $M$.
\item the morphology-syntax model is obtained by instantiating a DOP model from
      $S'$.
\end{itemize}

Note that this procedure assumes that disambiguation of morphology is
context-free and perfect, the most probable parse is used for decorating the
syntax treebank.  This assumption should be empirically verified. 

Example:

\begin{verbatim}
S := \{ (S (NP (NN amiko)) (VP (VB venis))) \}
M := \{ (NN (N amik) o) (VB (V ven) is) \}

S o M = \{ (S (NP (NN (N amik) o)) (VP (VB (V ven) is))) \}
\end{verbatim}

S := %{i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN amiko]] [VP [VB venis]]])}
\synttree[S [NP [NN amiko]] [VP [VB venis]]]
%\includegraphics[scale=0.5]{eoimg/tree6}
M := %{i(/phpsyntaxtree/pngtree.php?data=[NN [N amik] o])}  {i(/phpsyntaxtree/pngtree.php?data=[VB [V ven] is])}
\synttree [NN [N amik] o]
\synttree [VB [V ven] is]
%\includegraphics[scale=0.5]{eoimg/tree7}
%\includegraphics[scale=0.5]{eoimg/tree8}
S o M = %{i(/phpsyntaxtree/pngtree.php?data=[S [NP [NN [N amik] o]] [VP [VB [V ven] is]]])}
\synttree [S [NP [NN [N amik] o]] [VP [VB [V ven] is]]]
%\includegraphics[scale=0.5]{eoimg/tree9}

\subsection{Corpora}

Toy corpora:

\begin{itemize}
\item Morphology: hand annotated list of 290 words, containing all closed class
      words and affixes, and various open class roots and derivations. Compiled
      from various more or less naturalistic sources (e.g., Wennergren 2005,
      Miner 2006b).
\item Syntax: hand annotated list of 14 sentences (first paragraph of
      Zamenhof's Dua Libro). Coverage of morphology is 100\% with respect to
      this corpus.
\end{itemize}

Treebanks:

\begin{itemize}
\item morphology: semi-supervised corpus generated from dictionaries (TBD)
\item syntax: Monato treebank (Bick, personal communication, a corpus parsed
      with EspGram (Bick 2007).  Number of sentences: 1995, tokens: 30,397,
      types: 9247. Average sentence length: 15.338. Resulting grammar is 5 GB.
      Treebank requires preprocessing, a basic filter was applied to prune
      parse trees whose leaves do not agree with the original input sentence;
      also, unique POS tags are inserted for punctuation.
\end{itemize}

\section{Results}
\subsection{Results on toy corpora}

Using a syntax and morphological corpus that do not contain the word ``ven'as",
but with a morphology model that can derive it from ``don'as" and the past
tense ``ven'is":

\begin{verbatim}
sentence: amiko venas
morphology:
(NN (N amik) o) (p=0.00417101147028)
(VB (V ven) as) (p=0.000334168755221)
syntax:
error Grammar does not cover some of the input words: "'venas'".
morphology + syntax combined:
['amik', 'o', 'ven', 'as']
(S (NP (NN (N amik) o)) (VP (VB (V ven) as))) (p=1.12188584593e-28)
\end{verbatim}

The corpus contains the plural ``prefiksoj," which is inflected to an accusative here:

\begin{verbatim}
sentence: mi donas prefikson
morphology:
(PRP mi) (p=1.0)
(VB (V don) as) (p=0.0350877192982)
(NN (NN (N prefiks) o) n) (p=6.08906783983e-05)
syntax:
error Grammar does not cover some of the input words: "'prefikson'".
morphology + syntax combined:
['mi', 'don', 'as', 'prefiks', 'o', 'n']
(S
  (NP (PRP mi))
  (VP
    (VB (V don) as)
    (NP (NN (NN (N prefiks) o) n)))) (p=9.85999896556e-46)
\end{verbatim}

However, it is perhaps unfair not to assign categories to unknown words. 
In the following results I let a deterministic finite state automaton
assign the right POS tags to unknown words, and use a list of possible
morpheme tags with uniform probabilities to tag unknown morphemes (for words
with a single root the morpheme tagging will default to the POS tag marked
by the ending, which will usually be correct).

Here is a large sentence from later in the ``Dua Libro" (which is, fittingly,
about word formation in Esperanto):

\begin{verbatim}
sentence: Vortoj kunmetitaj estas kreataj per simpla kunligado de simplaj vortoj
morphology:
Vortoj (NN (NN (N Vort) (NN_o o)) (NN_j j))
kunmetitaj (JJ (J (V kunmetit) (J_a a)) (JJ_j j))
estas (VB (V est) (VB_as as))
kreataj (JJ (J (V kreat) (J_a a)) (JJ_j j))
per (IN per)
simpla (JJ (J simpl) (JJ_a a))
kunligado (NN (J kunligad) (NN_o o))
de (IN de)
simplaj (JJ (J (V simpl) (J_a a)) (JJ_j j))
vortoj (NN (NN (N vort) (NN_o o)) (NN_j j))
morphology + syntax combined:
['Vort', 'o', 'j', 'kunmetit', 'a', 'j', 'est', 'as', 'kreat', 'a', 'j', 
 'per', 'simpl', 'a', 'kunligad', 'o', 'de', 'simpl', 'a', 'j', 'vort', 'o', 'j']
(S
  (NP (NN (NN (N Vort) (NN_o o)) (NN_j j)))
  (VP
    (NP (JJ (J (V kunmetit) (J_a a)) (JJ_j j)))
    (VP
      (VB (V est) (VB_as as))
      (VP
        (JJ (J (V kreat) (J_a a)) (JJ_j j))
        (NP
          (NP
            (JJ (V (N per) (V simpl)) (JJ_a a))
            (NN (N kunligad) (NN_o o)))
          (PP
            (IN de)
            (N\'
              (JJ (J (V simpl) (J_a a)) (JJ_j j))
              (NN (NN (N vort) (NN_o o)) (NN_j j)))))))))
\end{verbatim}

%{i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=[S [NP [NN [NN [N Vort] [NN_o o]] [NN_j j]]] [VP[NP [JJ [J [V kunmetit]  [J_a a]] [JJ_j j]]][VP [VB [V est] [VB_as as]] [VP[JJ [J [V kreat] [J_a a]] [JJ_j j]][NP [NP[JJ [V [N per] [V simpl]] [JJ_a a]][NN [N kunligad] [NN_o o]]] [PP[IN de][N' [JJ [J [V simpl] [J_a a]] [JJ_j j]] [NN [NN [N vort] [NN_o o]] [NN_j j]]]]]]]]])}
%{i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=
\synttree [S [NP [NN [NN [N Vort] [NN\_o o]] [NN\_j j]]] [VP[NP [JJ [J [V kunmetit]  [J\_a a]] [JJ\_j j]]][VP [VB [V est] [VB\_as as]] [VP[JJ [J [V kreat] [J\_a a]] [JJ\_j j]][NP [NP[JJ [V [N per] [V simpl]] [JJ\_a a]][NN [N kunligad] [NN\_o o]]] [PP[IN de][N' [JJ [J [V simpl] [J\_a a]] [JJ\_j j]] [NN [NN [N vort] [NN\_o o]] [NN\_j j]]]]]]]]]
%\includegraphics[scale=0.5,angle=90]{eoimg/tree10}

(Translation: Derived words are created using simple concatenation of simple
words [NB: words means roots here])

There are some mistakes in segmenting (kun-met-it, kre-at, per simpl-a,
kun-lig-ad).  The phrase structure has mistakes as well, eg. ``vortoj
kunmetitaj" is a constituent, ``per simpla..." should be a PP but this is
overlooked because it got an incorrect POS tag. But given that the syntax
corpus contains only 14 sentences it is perhaps striking that a parse was
produced at all.

The modularist approach yields the following parse tree:

\begin{verbatim}
syntax \& morphology separate:
Vortoj kunmetitaj estas kreataj per simpla kunligado de simplaj vortoj 
(S
  (NP
    (NN (NN (N Vort) (NN_o o)) (NN_j j))
    (JJ (J (V kunmetit) (J_a a)) (JJ_j j)))
  (VP
    (VP
      (VB (V est) (VB_as as))
      (NP (JJ (J (V kreat) (J_a a)) (JJ_j j)) (IN per)))
    (NP
      (NP (JJ (J simpl) (JJ_a a)) (NN (J kunligad) (NN_o o)))
      (PP
        (IN de)
        (N\'
          (JJ (J (V simpl) (J_a a)) (JJ_j j))
          (NN (NN (N vort) (NN_o o)) (NN_j j)))))))
\end{verbatim}

%{i(https://unstable.nl/phpsyntaxtree/pngtree.php?data=[S [NP[NN [NN [N Vort] [NN_o o]] [NN_j j]][JJ [J [V kunmetit] [J_a a]] [JJ_j j]]] [VP[VP [VB [V est] [VB_as as]] [NP [JJ [J [V kreat] [J_a a]] [JJ_j j]] [IN per]]][NP [NP [JJ [J simpl] [JJ_a a]] [NN [J kunligad] [NN_o o]]] [PP[IN de][N' [JJ [J [V simpl] [J_a a]] [JJ_j j]] [NN [NN [N vort] [NN_o o]] [NN_j j]]]]]]])}
\synttree [S [NP[NN [NN [N Vort] [NN\_o o]] [NN\_j j]][JJ [J [V kunmetit] [J\_a a]] [JJ\_j j]]] [VP[VP [VB [V est] [VB\_as as]] [NP [JJ [J [V kreat] [J\_a a]] [JJ\_j j]] [IN per]]][NP [NP [JJ [J simpl] [JJ\_a a]] [NN [J kunligad] [NN\_o o]]] [PP[IN de][N' [JJ [J [V simpl] [J\_a a]] [JJ\_j j]] [NN [NN [N vort] [NN\_o o]] [NN\_j j]]]]]]]
%\includegraphics[scale=0.5,angle=90]{eoimg/tree11}

The morphology is identical, but syntactically the results are a little
different, eg. the first noun and adjective are together in an NP. However,
the preposition ``per" appears oddly at the end of an NP, instead of
introducing a PP (in the previous tree it ended up prefixing an NP because the
model cannot distinguish the difference between word and morpheme boundary).

That the deterministic finite state automaton is working can be seen from
the following non-sense input:

\begin{verbatim}
sentence: tiadelaradon teluro didelas
morphology:
tiadelaradon (NN (NN (N tiadelarad) (NN_o o)) (NN_n n))
teluro (NN (N telur) (NN_o o))
didelas (VB (V didel) (VB_as as))
morphology + syntax combined:
['tiadelarad', 'o', 'n', 'telur', 'o', 'didel', 'as']
(S
  (NP (NN (NN (N tiadelarad) (NN_o o)) (NN_n n)))
  (VP (NP (NN (N telur) (NN_o o))) (VP (VB (V didel) (VB_as as)))))
syntax \& morphology separate:
(S
  (NP
    (NN (NN (N tiadelarad) (NN_o o)) (NN_n n))
    (NN (N telur) (NN_o o)))
  (VP (VB (V didel) (VB_as as))))
\end{verbatim}

As can be seen, the words and roots receive the correct POS tags, which
additionally is not derived from the default SVO order.
The DOP model where morphology is opaque to syntax considers the two nouns
to be a single noun phrase, which could have been trivially excluded by
attending to the morphology (but perhaps an accusative category label would
have been enough).

\begin{comment}
\subsection{Todo}

\item parse bitpar chart output into NLTK (currently only most probable derivation; 
  we need $n$ most probable parses and maybe shortest derivation, SL-DOP etc.)
\item use Reta Vortaro / ergane Esperanto dictionary and root lists 
  to induce segmentation / morphology model in a semi-supervised fashion.
\item check morphology coverage against vocabulary of Monato treebank
\item distinguish between morpheme and word boundaries (how?).
  possibly by having a trailing space as part of a morphological analysis 
  (but: this should not block inflection for plurality and accusative (+j and +n respectively).
\item finish report (convert wiki to latex?). evaluation \& conclusion.
  write about Dasgupta (2008) \& is there work on DOP + dependencies? 
  mention DLT as older exemplar model.
\item look at DOP* / U-DOP
\end{comment}


\subsection{Evaluation}

TBD. 

\begin{enumerate}
\item Construct all possible test sets of 2 sentences from the toy corpus of 14
sentences, evaluate with evalb (or extend toy corpus to the complete
Fundamento).
\item Tenfold testing of Monato treebank, with and without regard for morphology.
\end{enumerate}

\subsection{Conclusion}

We have described a regular grammar that enumerates the word forms of
Esperanto's lexicon, which can be used to automatically segment word strings.
Using a DOP model the resulting sequence of morphemes and tags can be analysed
and assigned a hierarchical structure. The resulting DOP model can either be
merged with a syntactic treebank into a combined DOP model, or mapped to the
leaves of the parse trees produced by a syntactic model, to obtain tree
structures with both phrasal and morphological constituents.

We described an implementation using NLTK of the Goodman reduction that is
generalized to arbitrary trees, which outputs a grammar that can be parsed by
the efficient chart parser Bitpar. Using a list of open class tags and a
deterministic finite state automaton we can assign tags to unknown words and
morphemes.

The resulting system has been applied to a toy corpus of morphology and
syntax, showing the advantage of merging morphology and syntax treebanks
before constructing a DOP model. Evaluation with a larger syntactic treebank,
as well as the induction of morphology tags from dictionaries remains to be
done.


\section{References}

\begin{description}
\item[Bick], Eckhard (2007), ``Tagging and Parsing an Artificial Language: an
annotated web-corpus of Esperanto,'' in: {\em Proceedings of Corpus
Linguistics}, Birmingham, UK.
\url{http://beta.visl.sdu.dk/pdf/CorpusLinguistics2007_esp.pdf}

\item[Bod], Rens \& Scha, Remko (1996) ``Data-Oriented Language Processing: an
overview.'' Research reports, Institute for Logic, Language and Computation,
University of Amsterdam. \url{http://dare.uva.nl/document/1144}

\item[Clancey], W.J. (1991), ``The biology of consciousness: Comparative review
of Israel Rosenfield, The Strange, Familiar, and Forgotten: An anatomy of
Consciousness and Gerald M. Edelman, Bright Air, Brilliant Fire: On the Matter
of the Mind,'' {\em Artificial Intelligence} vol. 60, pp. 313--356

\item[Gobbo], Federico (2009), ``Adpositional Grammars: a multilingual grammar
formalism for NLP,'' PhD dissertation, Universita degli Studi dell'Insubria.

\item[Goodman], Joshua (1996), ``Efficient Algorithms for Parsing the DOP Model''. {\em Proceedings Empirical Methods in Natural Language Processing} pp. 143-152.
\url{http://acl.ldc.upenn.edu/W/W96/W96-0214.pdf}

\item[Jackendoff], Ray (2003), ``Précis of Foundations of Language: Brain, Meaning,
Grammar, Evolution,'' Behavioral and Brain Sciences (2003), 26:6:651-665
Cambridge University Press.

\item[Jansen], W. (2007). ``Woordvolgorde in het Esperanto: normen, taalgebruik en
universalia" (Word-order in Esperanto: norms, usage and universals). PhD
thesis, LOT Utrecht.

\item[Jurafsky], D. \& Martin, J.H. (2000), ``Speech \& Language Processing An
introduction to natural language processing, computational linguistics, and
speech recognition,'' Pearson Education.

\item[Haitao], Liu (2001), ``Creoles, Pidgins, and Planned Languages.'' Interface.
Journal of Applied Linguistics / Tijdschrift voor Toegepaste Linguïstiek 15 [2]. pp. 121--177.

\item[Kalocsay], Kálmán \& Waringhien, Gaston (1980), Plena Analiza Gramatiko de
Esperanto (Complete, analyzed Grammar of Esperanto), Rotterdam, Universala
Esperanto-Asocio.

\item[Maat], Jaap (1999), ``Philosophical Languages in the Seventeenth Century:
Dalgarno, Wilkins, Leibniz,'' Amsterdam, Institute for Logic, Language and
Computation.

\item[MacWhinney], B. (1987), ``Mechanisms of Language Acquisition,'' Lawrence Erlbaum Associates, NJ.

\item[Miner], Ken (2006a), ``Tranchitaj frazoj kaj la probleme pri negative
evidento'' (Distituents and the problem of negative evidence). March 2006.
\url{http://www.sunflower.com/~miner/NEGATIVA_package/negativa.html}

\item[Miner], Ken (2006b), ``Rimarkoj pri `En la komenco estas la vorto' de Geraldo
Mattos (fina versio),'' (Comments on `In the beginning was the word' by Geraldo
Mattos (final version)). \url{http://www.sunflower.com/~miner/EKVO_package/ekvo.html}

\item[Miner], Ken (2008), ``La neebleco de priesperanto lingvoscienco,'' (The
impossibility of Esperanto linguistics). October 2008.
\url{http://www.sunflower.com/~miner/LINGVISTIKO_package/lingvistiko.html}
Also published in ``La arto labori kune : festlibro por Humphrey Tonkin'' (The
art of working together: Festschrift for Humphrey Tonkin). Roterdam, Universala
Esperanto Asocio, January 2010

\item[Pinker], S. (1994). The language instinct: How the mind creates language. New York: W. Morrow.

\item[Prescher], D., Scha, R., Sima`an, K., Zollmann, A., (2004) ``On the statistical
consistency of DOP estimators.'' In {\em Proceedings of the 14th Meeting of
Computational Linguistics in the Netherlands}, Antwerp, Belgium.

\item[Scha], Remko (1990), ``Taaltheorie en Taaltechnologie; Competence en
Performance'' (Language theory and language technology: Competence and
Performance), in Q.A.M. de kort and G.L.J. Leerdam (eds.), {\em Computertoepassingen in de Neerlandistiek} pp. 7-22, Almere: Landelijke
Vereniging van Neerlandici (LVVN-jaarboek). English translation
\url{http://www.hum.uva.nl/computerlinguistiek/scha/IAAA/rs/cv.html}

\item[Schleyer], Johan Martin (1884), ``Volapük. Grammatik der Universalsprache für
alle gebildete Erdbewohner,'' Überlingen am Bodensee: Buchdruckerei August
Feyel, Buchhandlung Aug. Schoy. Third edition.

\item[Schmid], Helmut (2004), ``Efficient Parsing of Highly Ambiguous Context-Free
Grammars with Bit Vectors,'' {\em Proceedings of the 20th International Conference
on Computational Linguistics} (COLING 2004), Geneva, Switzerland.
\url{http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/COLING04/BitPar.pdf}

\item[Schmid], Helmut, Arne Fitschen and Ulrich Heidi (2004), SMOR: A German Computational Morphology Covering Derivation, Composition, and Inflection, Proceedings of the IVth International Conference on Language Resources and Evaluation (LREC 2004), p. 1263-1266, Lisbon, Portugal. \url{http://www.ims.uni-stuttgart.de/www/projekte/gramotron/PAPERS/LREC04/smor.pdf}

\item[Schubert], Klaus, 1989. "An unplanned development in planned languages", en Klaus Schubert, red., Interlinguistics: Aspects of the Science of Planned Languages [ = Trends in Linguistics: Studies and Monographs 42], Mouton de Gruyter.

\item[Schubert], Klaus (1993), ``Semantic compositionality: Esperanto word-formation
for language technology.'' {\em Linguistics} 31: 311-365.

\item[Wells], John (1989), ``Lingvistikaj aspektoj de Esperanto,'' Universala
Esperanto Asocio, Rotterdam. Second edition.

\item[Wennergren], Bertilo (2005), ``Plena Manlibro de Esperanta Gramatiko,''
(Complete handbook of Esperanto Grammar), version 13.0, 14th of April 2005.
Available online at \url{http://bertilow.com/pmeg/}.

\item[Zamenhof], Dr. L. L. (1887/1968), ``Internationale Sprache. Vorrede und
Vollständiges Lehrbuch,'' Warschau, photographic reprint from 1968
(Saarbrücken: Artur E. Illtis). German translation of the original Russian
brochure.

\item[Zamenhof], Dr. L. L. (1905/1963), ``Fundamento de Esperanto.'' Ninth edition
with Introduction, Notes and Linguistics comments, edited by Dr. A. Albault
(Esperantaj Francaj Eldonoj: Marmande, 1963).

\item[Zollmann], Andreas \& Sima'an, Khalil (2005), ``A Consistent and Efficient
Estimator for DOP.''  {\em Journal of Automota Languages and Combinatorics} vol.
10, pp. 367.  \url{http://staff.science.uva.nl/~simaan/D-Papers/JALCsubmit.pdf}

\end{description}

\end{document}

\subsection{Needed references}

everything seems to be there.

\subsection{Possible references }

Dasgputa, Probal (2008), ``Interlexical studies: a cognitive approach,'' talk
delivered on 18th of April 2008, Amsterdam Centre for Language and
Communication.

DLT: Distributed Language Translation project.

(from Miner 2006a)

Sakaguchi, Alicja, 1996. Die Dichotomie "künstlich" vs. "natürlich" und das historische Phänomen einer funktionierenden Plansprache. Language Problems and Language Planning 20:1.

Gledhill, Christopher, 2000. The Grammar of Esperanto: A Corpus-Based Description. Lincom Europa.

Grimley-Evans, Edmundo, 1997. "Vortfarado", (Word derivation) La Brita Esperantisto, marto-aprilo 1997, pp. 57-59.

