MNR Esperanto Stemmer
The MNR Esperanto Stemmer is a
stemming algorithm
useful for searching databases of Esperanto text. It conflates words
along three dimensions only: mood, number, and role (hence “MNR”).
Notably, affixed words are not conflated with their unaffixed form.
This allows one to search for precise terms while ignoring grammatical
variation which does not change the meaning.
The conflation criteria and algorithm are described in detail below.
Conflation criteria
The MNR stemmer conflates words along the following three dimensions:
- Mood:
- All verb inflections are considered equivalent: ‑i, ‑u,
‑as, ‑is, ‑os, ‑us.
- All active participle forms are considered equivalent, when in the
final position, regardless of the part of speech: ‑ant‑,
‑int‑, ‑ont‑, ‑unt‑.
- All passive participle forms are considered equivalent, when in
the final position, regardless of the part of speech: ‑at‑,
‑it‑, ‑ot‑, ‑ut‑.
- (Note that these three classes of forms are not
considered equivalent to each other.)
- Number:
- Singular and plural forms are considered equivalent: ‑a(n),
‑aj(n); ‑o(n), ‑oj(n); ‑iu(n), ‑iuj(n).
- Role:
- Accusative and non‐accusative forms are considered equivalent:
‑a(j), ‑a(j)n; ‑e, ‑en; ‑o(j), ‑o(j)n;
‑iu(j), ‑iu(j)n.
Additionally, abbreviated forms (la, l’; ‑o, ‑’) are considered
equivalent to the words they abbreviate.
Notably, parts of speech are not conflated: ‑a,
‑e, ‑i, ‑o. Neither are roots without endings
conflated with roots with endings, even if the part of speech remains
the same (so for example, plu and plue are not
conflated).
Finally, roots without endings – which do not follow standard
word‐building rules – are never conflated with invalid derivations
thereof. For example, the invalid derivation den and the valid
root de are not conflated. This permits the MNR stemmer to be
used to check spelling.
Algorithm
The input to the MNR stemmer algorithm is a single lowercase
Esperanto word. The algorithm outputs a string which is
representative of the set of words which conflate with the input word.
Several steps of the algorithm refer to vowels. A vowel is exactly
a, e, i, o, or u, not j or ŭ.
Several steps of the algorithm refer to ‑io and ‑iu
words. The ‑io words are exactly io, ĉio,
kio, tio, alio, nenio, and kelkio,
and the ‑iu words are exactly iu, ĉiu,
kiu, tiu, aliu, neniu, and
kelkiu.
The main algorithm proceeds as follows.
- If the input word is l’ or la, output la and terminate.
- At most one of the following conditions will be true of the input word.
If one is, modify the input word as indicated.
- The input word ends with ‑’, the prefix contains a
vowel, the input word is not an ‑io word, and the prefix
is not maltr‑. Replace this suffix with ‑o.
- The input word ends with ‑as, ‑is, ‑os,
‑us, or ‑u, the prefix contains a vowel, the input
word is not minus, unu, or an ‑iu word, and
the prefix is not il‑, malpl‑, or on‑.
Replace this suffix with ‑i.
- The input word ends with ‑aj, ‑oj, ‑an,
‑en, or ‑on, the prefix contains a vowel, and the
input word is not amen, tamen, disden,
ekden, maltroj, or maltron. Remove the
last letter of this suffix.
- The input word ends with ‑ajn or ‑ojn, the
prefix contains a vowel, and the input word is not
maltrojn. Remove the last two letters of this
suffix.
- The input word ends with ‑j, ‑n, or ‑jn, and the prefix
is an ‑iu word. Remove this suffix.
- If the (possibly modified) input word is a known root with a participle
ending, output the input word and terminate.
- If the input word ends with ‑(n)ta, ‑(n)te, ‑(n)ti, or
‑(n)to, and all of the following are true:
- the prefix is at least two letters long
- the last letter of the prefix is i, o, or u
- a vowel occurs prior to the last letter of the prefix
then replace the last letter of the prefix with a.
- Output the (possibly modified) input word and terminate.