MNR Esperanto Stemmer

The MNR Esperanto Stemmer is a stemming algorithm useful for searching databases of Esperanto text. It conflates words along three dimensions only: mood, number, and role (hence “MNR”). Notably, affixed words are not conflated with their unaffixed form. This allows one to search for precise terms while ignoring grammatical variation which does not change the meaning.

The conflation criteria and algorithm are described in detail below.

Conflation criteria

The MNR stemmer conflates words along the following three dimensions:

Mood:
All verb inflections are considered equivalent: ‑i, ‑u, ‑as, ‑is, ‑os, ‑us.
All active participle forms are considered equivalent, when in the final position, regardless of the part of speech: ‑ant‑, ‑int‑, ‑ont‑, ‑unt‑.
All passive participle forms are considered equivalent, when in the final position, regardless of the part of speech: ‑at‑, ‑it‑, ‑ot‑, ‑ut‑.
(Note that these three classes of forms are not considered equivalent to each other.)
Number:
Singular and plural forms are considered equivalent: ‑a(n), ‑aj(n); ‑o(n), ‑oj(n); ‑iu(n), ‑iuj(n).
Role:
Accusative and non‐accusative forms are considered equivalent: ‑a(j), ‑a(j)n; ‑e, ‑en; ‑o(j), ‑o(j)n; ‑iu(j), ‑iu(j)n.

Additionally, abbreviated forms (la, l’; ‑o, ‑’) are considered equivalent to the words they abbreviate.

Notably, parts of speech are not conflated: ‑a, ‑e, ‑i, ‑o. Neither are roots without endings conflated with roots with endings, even if the part of speech remains the same (so for example, plu and plue are not conflated).

Finally, roots without endings – which do not follow standard word‐building rules – are never conflated with invalid derivations thereof. For example, the invalid derivation den and the valid root de are not conflated. This permits the MNR stemmer to be used to check spelling.

Algorithm

The input to the MNR stemmer algorithm is a single lowercase Esperanto word. The algorithm outputs a string which is representative of the set of words which conflate with the input word.

Several steps of the algorithm refer to vowels. A vowel is exactly a, e, i, o, or u, not j or ŭ.

Several steps of the algorithm refer to ‑io and ‑iu words. The ‑io words are exactly io, ĉio, kio, tio, alio, nenio, and kelkio, and the ‑iu words are exactly iu, ĉiu, kiu, tiu, aliu, neniu, and kelkiu.

The main algorithm proceeds as follows.

  1. If the input word is l’ or la, output la and terminate.
  2. At most one of the following conditions will be true of the input word. If one is, modify the input word as indicated.
  3. If the (possibly modified) input word is a known root with a participle ending, output the input word and terminate.
  4. If the input word ends with ‑(n)ta, ‑(n)te, ‑(n)ti, or ‑(n)to, and all of the following are true: then replace the last letter of the prefix with a.
  5. Output the (possibly modified) input word and terminate.