Describing Linde dictionary of Polish for retro-digitisation purposes
Joanna Bili ´nska
University of Warsaw
Dictionary
• author: Samuel Bogumił Linde
• 6 volumes
• 1807-1814 (1st ed.), 1854-1861 (2nd ed.)
• the first monolingual dictionary of Polish
• both descriptive and normative
• dictionary of Polish with translations into Ger- man, Slavic languages and other languages = multilingualism
• excellent world’s reception
• impact on other languages lexicography
• used by historians, librarians, lexicographers, linguists
Figure 1: Sample page
• mainly alphabetical with nested entries: one can find also derivates, diminutives, etc. within an entry → the word order is not strict
• alphabetic order different from the contempo- rary
• Polish diacritical marks ignored when ordering lemmas
• almost no explanation of the abbreviations
• entry words often separated into two lines be- cause of hyphenation → in retro-digitised ver- sion — requires an index (cf. [2])
Structure
• ca. 5000 two-column pages
• ca. 5400 characters per page
• > 90 languages/dialects (cf. [4], [3])
• as a corpus: about 7 mln tokens ← corpus retro-digitisation (cf. [7])
• many scripts: Latin, Cyrillic (two kinds), Frak- tur, Hebrew, Greek
Microstructure
Figure 2: Beginning of a sample entry
• lemma,
• (orthographic variants),
• grammar information,
• foreign languages translations,
• senses,
• quotations,
• other senses, including phraseology,
• metaphoric senses,
• derivates (with description — subentries),
• other derivates (mainly prephixed) as cross- references without description.
Punctuation and typography analysis — an excerpt (cf. [4])
• dot (.) — end of bigger parts of the entry (eg.
gramamr information, foreign languages trans- lations) and abbreviations
• semicolon (;) — ends smaller parts of the entry (eg. one language translations);
• brackets — () for author’s comments; [ ] for 2nd ed. editors comments;
• pause ( — ) — starts morphology information, divides senses, sometimes metaphorical senses;
• section marks (§) — metaphorical senses;
• asterix (*) — unused lemmas, outdated, neolo- gisms;
• two asterixes (**) — poetical lemmas,
• double oblique hyphen before some comments
Foreign language parts detection algorythm
For latin script the algorythm looks as follows (cf.
[4]):
• text in italics (= abbreviation), – capital letter at the beginning, – dot at the end,
• foreign example (normal font),
• semicolon,
• dot after whole part of foreign language exam- ples
However, there are also foreign language parts in other scripts:
Figure 3: Various scripts
Abbreviations analysis
Figure 4: Expanding abbreviations
Many tags for one language/dialect, eg. Weg., W˛eg., Hung., Hungar., Hng., Hg., Ung., Ungar., w˛egiersk. for Hungarian, cf. [3].
Figure 5: Expanding language tags
References
[1] Bie´n, J.S. (2011) Efficient search in hidden text of large DjVu documents. In: Advanced Language Technologies for Digital Li- braries. Lecture Notes in Computer Science(Theoretical Computer Science and General Issues) (6699). Springer, pp. 1-14. http:
//bc.klf.uw.edu.pl/177/
[2] Bie´n, J.S., Elektroniczny indeks haseł do słownika Lin- dego (druga wersja wst˛epna), https://bitbucket.org/
jsbien/ilindecsv/wiki/Home.md
[3] Bili´nska, J., (2016), Dialekty i j˛ezyki obce w Słowniku j˛ezyka pol- skiego Samuela Bogumiła Lindego – zestawienie na podstawie wydania drugiego, "Prace Filologiczne", LXVIII, p. 27-42.
[4] Bili´nska, J., Analiza i leksykograficzny opis struktury słownika Lin- dego na potrzeby dygitalizacji, upublished Ph.D. thesis, Warszawa 2013, http://bc.klf.uw.edu.pl/347/.
[5] Bili´nska, J., Sostavlenie pereˇcnâ sokrašˇcenij nazvanij âzykov v ramkah proekta digitalizacii «Slovarâ pol’skogo âzyka» S. B.
Linde [w:] Informacionnye tehnologii i pismennoe nasledie.
El’Manuscript-2012. IV nauˇcnaâ konferenciâ, Petrozavodsk, I˙zewsk 2012, p. 34-43, http://bc.klf.uw.edu.pl/301/.
[6] Linde, S.B., Słownik j˛ezyka polskiego, 2nd ed., Lwów 1854-1860, http://kpbc.umk.pl/publication/8173.
[7] Linde, S.B., Słownik j˛ezyka polskiego, corpus digitisation with search engine, https://szukajwslownikach.uw.edu.
pl/en/slownik-lindego-nowy/.