Półautomatyczne tworzenie sieci słowotwórczych
1 Mateusz LangoInstytut Informatyki, Politechnika Poznańska
31 października 2017
1
Lango M., ˇSevˇc´ikov´a M., ˇZabokrtsk´y Z.: Semi-Automatic Construction of Word-Formation Networks (for Polish and Spanish), International Conference on Language Resources and Evaluation, Miyazaki, Japan (submitted)
Endangered species / Endangered languages
”By the end of the 21st century, 90% of the world’s languages could have been wiped out”
Wymysi¨
oeryś
the native language of Wilamowice inhabitants until 1945 currently 25 native speakers (+ 60 L2 speakers)
in 2007 the Library of Congress added the Wymysi¨oeryś
language to the register of languages (ISO: wym)
Polish Wymysi¨oeryś German Dutch
zima wynter Winter winter
siedem zyjwa sieben zeven
obraz u¨obroz Bild beeld
Endangered languages and technology
80% of the world’s languages have fewer than 100 000 speakers
small language communities ⇒ small resources lack of technologies for a language ⇒ lower ”status”
Processing a language
collection of corpora
creation of morphological rules ...
Processing a language: morphology
„the first crucial first step in NLP” fighting the curse of dimensionality discovering the meaning of new words
distinctness 141
distinct 35323
unconcerned 340
concern 26080
applications: information retrieval, distributed representation, ...
Goals of the research
language description and documentation development of NLP technologies
Agenda
1 Introduction to derivational morphology
2 Learning of the morphology of a natural language (selected meth-ods)
3 Semi-automatic construction of word-formation networks
Levels of Language Analysis
Phonology Morphology Syntax Semantics PragmaticsMorphology
Morphology
Morphology is concerned with the elements that compose words and the organization of these elements into hierarchical structures.
Morpheme
A morpheme is the smallest unit of language that carries meaning Two main fields are traditionally recognized within morphology:
in-flectional and derivational morphology
ręka ręce ręki ręczny odręczny leworęczny
Infllectional vs Derivational Morphology
a possibility to construct a given form
kot → kota nauczać → nauczyciel
telefon → telefonu śpiewać → śpiewak
prezentacja → prezentacji stać, zastanawiać się, → ...
level of regularity
Linguistic typology
isolating languages [Ja][być][środek][kraj][człowiek] [jeden][dzień] [cztery][dzień] agglutination alternation fusional language polysynthetic languagesLinguistic typology
isolating languages agglutination
gel m iyor mus¸ um
¨
o˘gret m iyor mus¸ um
uyu m uyor mus¸ um
alternation fusional language polysynthetic languages
Linguistic typology
isolating languages agglutination alternation ”*k*t*b*” kataba on napisał kitab książka katib pisarz aktub ja piszę naktub my piszemy yaktub on piszetaktub ona pisze
fusional language polysynthetic languages
Linguistic typology
isolating languages agglutination alternation fusional language
kobieci-e, kobiet-om, kot-u : dative case pisz-ę : singular, present tense, ...
Linguistic typology
isolating languages agglutination alternation fusional language polysynthetic languages TuntussuqatarniksaitengqiggtuqNie powiedział nam znów, że idzie polować na renifery. An-e-raj-ki-śći
Oni cię zabiją
Language Universals
All languages have verbs and nouns
All spoken languages have consonants and vowels If a language has inflection, it always has derivation ...
Word-formation mechanics
2 main mechanisms of word formation: derivation
compounding
Many ways for derivating a word:
prefix: interesowny → bezinteresowny, zespolony → hiperzespolony
suffix: urzekać → urzekający, fryzjer → fryzjerstwo circumfix: machen → ge-mach-t, rojo → a-roj-ar infix: kup-i-ć → kup-owa-ć, dostać → dosta-wa-ć transfiks: kitab → katib
Why it is difficult?
diversity in languages false friendsundoable undo-able un-doable
mord mord-ka recursivity unconcernednesses un-concern-ed-ness-es alternations enjoy→enjoy-ment manage→manage-ment argue →argu-(e)-ment gazete→gazeteci gitar→gitarcı sanat→sanatc¸ı dolecieć→dolatywać exceptions cat→cats ox→oxen
Learning of morphology
Border & Frequency Group & Abstract Features & Classes Rule-based methods
Border & Frequency methods
Letter Successor Varieties (Hafer and Weiss, 1974)
Given a set of words D, the letter successor variety of a string x of length j is defined as the number of distinct letters that occupy the j + 1st position in words that begin with x in D:
LSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednioLSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednioLSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednio ⇒ {z, n}LSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednio ⇒ {z, n} ⇒ LSV (”nieodpowied ”) = 2LSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ...LSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ...LSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ... ⇒ {a, b, c, d , e, f ...}LSV - example of calculation
LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ... ⇒ {a, b, c, d , e, f ...} ⇒ LSV (”nie”) = 27Other measures
Letter Predecessor Variety (LPV) Letter Successor Entropy
LSE (x ) = − X i ∈alphabet |”xi ∗ ” ∈ D| |”x ∗ ” ∈ D| log |”xi ∗ ” ∈ D| |”x ∗ ” ∈ D| heuristics: cutoff
Border & Frequency methods: MAP (Creutz & Lagus, 07)
arg max
L P(L|corpus) = arg maxL P(corpus|L)P(L)
P(corpus|L) = N Y i =1 P(wi|L) = N Y i =1 X s∈segmentacje(wi) P(s|L) = N Y i =1 X s∈segmentacje(wi) P(Ci 1|Ci 0) |s| Y j =1 P(mij|Cij)P(Ci (j +1)|Cij)
Border & Frequency methods: MAP (Creutz & Lagus, 07)
arg max
L P(L|corpus) = arg maxL P(corpus|L)P(L)
P(corpus|L) = N Y i =1 P(wi|L) = N Y i =1 X s∈segmentacje(wi) P(s|L) = N Y i =1 X s∈segmentacje(wi) P(Ci 1|Ci 0) |s| Y j =1 P(mij|Cij)P(Ci (j +1)|Cij)
Border & Frequency methods: MAP (Creutz & Lagus, 07)
arg max
L P(L|corpus) = arg maxL P(corpus|L)P(L)
P(corpus|L) = N Y i =1 P(wi|L) = N Y i =1 X s∈segmentacje(wi) P(s|L) = N Y i =1 X s∈segmentacje(wi) P(Ci 1|Ci 0) |s| Y j =1 P(mij|Cij)P(Ci (j +1)|Cij)
Classifier as a measure of similarity (De Pauw et al., 2007)
generate all possible substrings of each word
#słowo# #s, #sł, #sło, #słow, #słowo, s, sł, sło, słow,
słowo, słowo#, ł, ...
use substrings as features and whole words as classes for a probabilistic classifier
reclassify training set and look at probable classes (words) prefixes are extracted by a special clustering algorithm
Rule-based approaches
methods which creates a set of rules between the base word and its derivatives
Language-Independent Approach of Baranes and Sagot (2014)
extraction of preliminary rules
generation of (base word, derivative) pairs extraction of final rules which include e.g. POS final generation of derivational links
Language-Independent Approach: Step 1
rules of the form „prefix1 → prefix2 followed by letter c”
rules are generated by analysing all possible pairs of words which share significant common part
word 1 word 2 rule
subtitle title (sub → )t
laughs laughing h(s → ing )
subset set (sub → )s
subset laughs
-Language-Independent Approach: Step 2
apply all applicable prefixal rules apply all applicable suffixal rules
apply all applicable pairs of prefixal rule and suffixal rule
Pair Prefix Suffix
appreciable→ depreciate (ap → de)p i(able → ate)
appreciable→ precis (ap → )p i(able → s)
appreciable→ appreciably - l(e → y )
demoded→ modest (de → )m d(ed → est)
Language-Independent Approach: Step 3 & 4
inflectional rules are discarded new rules of the form
(prefix , suffix , POS , morph.feat.) → (prefix , sufix , POS , infl .class) only rules occurring more than 80 times are preserved
( , , ADJ) → (un, ly , ADV ) fortunate→unfortunately
Language-Independent Approach: Evaluation
language correct pairs
English 98%
German 98%
French 89%
Motivation
lack of word-formation resources for Polish2 and many other
languages e.g. Spanish
insufficient accuracy of proposed approaches for languages with highly productive derivation morphology
idea: use supervised approach
2some information about derivation can be extracted from the Polish
How to construct a WFN?
Context is needed
Learning to rank
originally proposed for ranking query results in the information retrieval systems
many approaches:
pointwise pairwise listwise
WFN construction
perform k-NN search for each word zarobkowy: zarabiać, zarobek, zarobić, ...
apply a method of machine-learned ranking on the candidate set
zarobkowy: zarobek (1.2), zarabiać (0.8), zarobić (0.5), ... pick the best candidate
Overview of the pipeline
1 Generation of frequent subsequences
2 Merging frequent subsequences into regular expressions
3 Generation of possible parents for each lemma
Sequential pattern mining
one of the most important topics in frequent pattern mining the task is to extract all frequent subsequences with the support greater than a specified threshold
in our case we treat lexicon as a database of sequences (words)
we used SPADE algorithm with min. support 1% ⇒ 27K frequent patterns Pattern Support n,i,e 87053 o,w,y 27099 c, z, n, o, ś, ć 7570 d, z, o, ś, ć 4792
Converting frequent patterns into regular expressions
frequent pattern „n,i,e” ⇒ ^*n*i*e*$ making expressions more specific
delete one of the * from the expression recalculate support
accept new expression if the support is higher than 95% of the original support ^*n*i*e*$ ⇒ ^nie*$ Pattern RegExp n,i,e ^nie*$ o,w,y ^*owy$ c, z, n, o, ś, ć ^*cz*ność$ d, z, o, ś, ć ^*d*z*ość$
Filtering frequent pattens
Problem: some regular expressions are redundant (they cover almost the same set of words)
RegExp Support
^*z*ność$ 7547
^*cz*ność$ 7543
Solution:
convert each regular expression to a binary feature calculate phi coefficient between corresponding features if φ is greater than 95% drop less specific regular expressions
Proxinette measure
1 add a special character at the beginning and at the end of
each word e.g. #wiek#
2 split the word into all possible substrings of length > 3
(#wiek, #wie, #wi, wiek#, wiek, wie,..)
3 create a bi-partile graph in which the words are connected to
Proxinette measure
1 a weight is added to each edge which is equal to 1
d where d is
Overview of the pipeline
1 Generation of frequent subsequences
2 Merging frequent subsequences into regular expressions
3 Generation of possible parents for each lemma
Experimental setup
language resources
Morfeusz SGJP - Polish lexicon Leffe - Spanish lexicon
software
SPMF data mining library for frequent sequence mining xgboost - implementation of Gradient Boosting Trees (supports learning-to-rank)
Results for Polish
Classification Ranking
Precision 88,3% 98,8%
Recall 37,9% 38,2%
Word-Formation Network evaluation
Polish Spanish Precision 97% 85% (94.9%) Recall 29% 44% # Connections 53 487 18 491 Vocabulary 261 822 159 035Merging with WordNet
we extracted 53 types of relations related with derivation from Słowosieć (the Polish WordNet) e.g. diminutives, femininity, inhabitant, derivationality
by applying these connections to our lexicon 52K connections can be created
finally, there is above 91,5 K connections in the network 94.5% precision, 47% recall
WordNet: atom → poatomowy
Polish Word-Formation Network
the biggest resource about Polish derivational morphology free & publicly available (soon :)
Drawbacks of our approach
not fully automatic root words?
lack of negative examples in the training set
Current work
analysis of the inconsistencies between WordNet connections and our connections
translation of the Czech DeriNet to Polish
use of discovered connections as training examples
comparison of the structures of word-formation networks for Czech and Latin