31października2017 MateuszLango Półautomatycznetworzeniesiecisłowotwórczych

(1)

Półautomatyczne tworzenie sieci słowotwórczych

1 Mateusz Lango

Instytut Informatyki, Politechnika Poznańska

31 października 2017

1

Lango M., ˇSevˇcíková M., ˇZabokrtský Z.: Semi-Automatic Construction of Word-Formation Networks (for Polish and Spanish), International Conference on Language Resources and Evaluation, Miyazaki, Japan (submitted)

(2)

(3)

(4)

Endangered species / Endangered languages

”By the end of the 21st century, 90% of the world’s languages could have been wiped out”

(5)

(6)

Wymysi¨

oeryś

the native language of Wilamowice inhabitants until 1945 currently 25 native speakers (+ 60 L2 speakers)

in 2007 the Library of Congress added the Wymysi¨oeryś

language to the register of languages (ISO: wym)

Polish Wymysi¨oeryś German Dutch

zima wynter Winter winter

siedem zyjwa sieben zeven

obraz u¨obroz Bild beeld

(7)

Endangered languages and technology

80% of the world’s languages have fewer than 100 000 speakers

small language communities ⇒ small resources lack of technologies for a language ⇒ lower ”status”

(8)

Processing a language

collection of corpora

creation of morphological rules ...

(9)

Processing a language: morphology

„the first crucial first step in NLP” fighting the curse of dimensionality discovering the meaning of new words

distinctness 141

distinct 35323

unconcerned 340

concern 26080

applications: information retrieval, distributed representation, ...

(10)

(11)

Goals of the research

language description and documentation development of NLP technologies

(12)

Agenda

1 Introduction to derivational morphology

2 Learning of the morphology of a natural language (selected meth-ods)

3 Semi-automatic construction of word-formation networks

(13)

Levels of Language Analysis

Phonology Morphology Syntax Semantics Pragmatics

(14)

Morphology

Morphology is concerned with the elements that compose words and the organization of these elements into hierarchical structures.

Morpheme

A morpheme is the smallest unit of language that carries meaning Two main fields are traditionally recognized within morphology:

in-flectional and derivational morphology

ręka ręce ręki ręczny odręczny leworęczny

(15)

Infllectional vs Derivational Morphology

a possibility to construct a given form

kot → kota nauczać → nauczyciel

telefon → telefonu śpiewać → śpiewak

prezentacja → prezentacji stać, zastanawiać się, → ...

level of regularity

(16)

Linguistic typology

isolating languages [Ja][być][środek][kraj][człowiek] [jeden][dzień] [cztery][dzień] agglutination alternation fusional language polysynthetic languages

(17)

Linguistic typology

isolating languages agglutination

gel m iyor mus¸ um

¨

o˘gret m iyor mus¸ um

uyu m uyor mus¸ um

alternation fusional language polysynthetic languages

(18)

Linguistic typology

isolating languages agglutination alternation ”*k*t*b*” kataba on napisał kitab książka katib pisarz aktub ja piszę naktub my piszemy yaktub on pisze

taktub ona pisze

fusional language polysynthetic languages

(19)

Linguistic typology

isolating languages agglutination alternation fusional language

kobieci-e, kobiet-om, kot-u : dative case pisz-ę : singular, present tense, ...

(20)

Linguistic typology

isolating languages agglutination alternation fusional language polysynthetic languages Tuntussuqatarniksaitengqiggtuq

Nie powiedział nam znów, że idzie polować na renifery. An-e-raj-ki-śći

Oni cię zabiją

(21)

Language Universals

All languages have verbs and nouns

All spoken languages have consonants and vowels If a language has inflection, it always has derivation ...

(22)

(23)

Word-formation mechanics

2 main mechanisms of word formation: derivation

compounding

Many ways for derivating a word:

prefix: interesowny → bezinteresowny, zespolony → hiperzespolony

suffix: urzekać → urzekający, fryzjer → fryzjerstwo circumfix: machen → ge-mach-t, rojo → a-roj-ar infix: kup-i-ć → kup-owa-ć, dostać → dosta-wa-ć transfiks: kitab → katib

(24)

Why it is difficult?

diversity in languages false friends

undoable undo-able un-doable

mord mord-ka recursivity unconcernednesses un-concern-ed-ness-es alternations enjoy→enjoy-ment manage→manage-ment argue →argu-(e)-ment gazete→gazeteci gitar→gitarcı sanat→sanatc¸ı dolecieć→dolatywać exceptions cat→cats ox→oxen

(25)

Learning of morphology

Border & Frequency Group & Abstract Features & Classes Rule-based methods

(26)

Border & Frequency methods

Letter Successor Varieties (Hafer and Weiss, 1974)

Given a set of words D, the letter successor variety of a string x of length j is defined as the number of distinct letters that occupy the j + 1st position in words that begin with x in D:

(27)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednio

(28)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednio

(29)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednio ⇒ {z, n}

(30)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nieodpowied”) nieodpowiedzialny nieodpowiedzialnie nieodpowiedniość nieodpowiedność nieodpowiedzialność nieodpowiedny nieodpowiedni nieodpowiednio ⇒ {z, n} ⇒ LSV (”nieodpowied ”) = 2

(31)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ...

(32)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ...

(33)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ... ⇒ {a, b, c, d , e, f ...}

(34)

LSV - example of calculation

LSV (x ) = |{i : ”xi ∗ ” ∈ D, |i | = 1}| LSV(”nie”) niejunacki niegermanizatorsko nieepistolarność niemateriałowo niewęzłowatość niezawadność niepulsująco niesiemieniatość ... ⇒ {a, b, c, d , e, f ...} ⇒ LSV (”nie”) = 27

(35)

Other measures

Letter Predecessor Variety (LPV) Letter Successor Entropy

LSE (x ) = − X i ∈alphabet |”xi ∗ ” ∈ D| |”x ∗ ” ∈ D| log |”xi ∗ ” ∈ D| |”x ∗ ” ∈ D| heuristics: cutoff

(36)

Border & Frequency methods: MAP (Creutz & Lagus, 07)

arg max

L P(L|corpus) = arg maxL P(corpus|L)P(L)

(37)

Border & Frequency methods: MAP (Creutz & Lagus, 07)

arg max

(38)

Border & Frequency methods: MAP (Creutz & Lagus, 07)

arg max

(39)

Classifier as a measure of similarity (De Pauw et al., 2007)

generate all possible substrings of each word

#słowo# #s, #sł, #sło, #słow, #słowo, s, sł, sło, słow,

słowo, słowo#, ł, ...

use substrings as features and whole words as classes for a probabilistic classifier

reclassify training set and look at probable classes (words) prefixes are extracted by a special clustering algorithm

(40)

Rule-based approaches

methods which creates a set of rules between the base word and its derivatives

Language-Independent Approach of Baranes and Sagot (2014)

extraction of preliminary rules

generation of (base word, derivative) pairs extraction of final rules which include e.g. POS final generation of derivational links

(41)

Language-Independent Approach: Step 1

rules of the form „prefix1 → prefix2 followed by letter c”

rules are generated by analysing all possible pairs of words which share significant common part

word 1 word 2 rule

subtitle title (sub → )t

laughs laughing h(s → ing )

subset set (sub → )s

subset laughs

(42)

-Language-Independent Approach: Step 2

apply all applicable prefixal rules apply all applicable suffixal rules

apply all applicable pairs of prefixal rule and suffixal rule

Pair Prefix Suffix

appreciable→ depreciate (ap → de)p i(able → ate)

appreciable→ precis (ap → )p i(able → s)

appreciable→ appreciably - l(e → y )

demoded→ modest (de → )m d(ed → est)

(43)

Language-Independent Approach: Step 3 & 4

inflectional rules are discarded new rules of the form

(prefix , suffix , POS , morph.feat.) → (prefix , sufix , POS , infl .class) only rules occurring more than 80 times are preserved

( , , ADJ) → (un, ly , ADV ) fortunate→unfortunately

(44)

Language-Independent Approach: Evaluation

language correct pairs

English 98%

German 98%

French 89%

(45)

Motivation

lack of word-formation resources for Polish2 and many other

languages e.g. Spanish

insufficient accuracy of proposed approaches for languages with highly productive derivation morphology

idea: use supervised approach

2_{some information about derivation can be extracted from the Polish}

(46)

(47)

(48)

How to construct a WFN?

(49)

(50)

(51)

(52)

Context is needed

(53)

Learning to rank

originally proposed for ranking query results in the information retrieval systems

many approaches:

pointwise pairwise listwise

(54)

WFN construction

perform k-NN search for each word zarobkowy: zarabiać, zarobek, zarobić, ...

apply a method of machine-learned ranking on the candidate set

zarobkowy: zarobek (1.2), zarabiać (0.8), zarobić (0.5), ... pick the best candidate

(55)

Overview of the pipeline

1 Generation of frequent subsequences

2 Merging frequent subsequences into regular expressions

3 _{Generation of possible parents for each lemma}

(56)

Sequential pattern mining

one of the most important topics in frequent pattern mining the task is to extract all frequent subsequences with the support greater than a specified threshold

in our case we treat lexicon as a database of sequences (words)

we used SPADE algorithm with min. support 1% ⇒ 27K frequent patterns Pattern Support n,i,e 87053 o,w,y 27099 c, z, n, o, ś, ć 7570 d, z, o, ś, ć 4792

(57)

Converting frequent patterns into regular expressions

frequent pattern „n,i,e” ⇒ ^*n*i*e*$ making expressions more specific

delete one of the * from the expression recalculate support

accept new expression if the support is higher than 95% of the original support ^*n*i*e*$ ⇒ ^nie*$ Pattern RegExp n,i,e ^nie*$ o,w,y ^*owy$ c, z, n, o, ś, ć ^*cz*ność$ d, z, o, ś, ć ^*d*z*ość$

(58)

Filtering frequent pattens

Problem: some regular expressions are redundant (they cover almost the same set of words)

RegExp Support

^*z*ność$ 7547

^*cz*ność$ 7543

Solution:

convert each regular expression to a binary feature calculate phi coefficient between corresponding features if φ is greater than 95% drop less specific regular expressions

(59)

Proxinette measure

1 _{add a special character at the beginning and at the end of}

each word e.g. #wiek#

2 _{split the word into all possible substrings of length > 3}

(#wiek, #wie, #wi, wiek#, wiek, wie,..)

3 _{create a bi-partile graph in which the words are connected to}

(60)

Proxinette measure

1 _{a weight is added to each edge which is equal to} 1

d where d is

(61)

(62)

(63)

(64)

(65)

Overview of the pipeline

1 Generation of frequent subsequences

2 Merging frequent subsequences into regular expressions

3 _{Generation of possible parents for each lemma}

(66)

Experimental setup

language resources

Morfeusz SGJP - Polish lexicon Leffe - Spanish lexicon

software

SPMF data mining library for frequent sequence mining xgboost - implementation of Gradient Boosting Trees (supports learning-to-rank)

(67)

Results for Polish

Classification Ranking

Precision 88,3% 98,8%

Recall 37,9% 38,2%

(68)

Word-Formation Network evaluation

Polish Spanish Precision 97% 85% (94.9%) Recall 29% 44% # Connections 53 487 18 491 Vocabulary 261 822 159 035

(69)

Merging with WordNet

we extracted 53 types of relations related with derivation from Słowosieć (the Polish WordNet) e.g. diminutives, femininity, inhabitant, derivationality

by applying these connections to our lexicon 52K connections can be created

finally, there is above 91,5 K connections in the network 94.5% precision, 47% recall

WordNet: atom → poatomowy

(70)

Polish Word-Formation Network

the biggest resource about Polish derivational morphology free & publicly available (soon :)

(71)

Drawbacks of our approach

not fully automatic root words?

lack of negative examples in the training set

(72)

Current work

analysis of the inconsistencies between WordNet connections and our connections

translation of the Czech DeriNet to Polish

use of discovered connections as training examples

comparison of the structures of word-formation networks for Czech and Latin

(73)