Arabic Morphosyntactic Raw Text Part of Speech Tagging System

(1)

Warsaw University

Faculty of Mathematics, Informatics and Mechanics

Ahmed Hussein Aliwy

Arabic Morphosyntactic Raw Text Part of Speech Tagging System

summary of PhD dissertation

Supervisor Prof. dr hab. Jerzy Tyszkiewicz Institute of Informatics University of Warsaw

January 2013

(2)

Introduction and Overview

The general topic of this dissertation is morphosyntactic part of speech tagging (abbreviated POS tagging) for Arabic. This topic has long and rich history for other languages, mainly for English.

POS Tagging provides fundamental information about word forms used in sentences of natural language. The method of utilizing this information varies depending on the particular NLP application (information retrieval, machine translation, etc.), in which it is used.

Tagging is a source of many research challenges. These challenges depend very much on the language under consideration. In this dissertation we consider Arabic, a highly inflected language.

Although Arabic is generally quite regular and there are very few irregular forms, very rich and complicated structure of inflection, which in many cases changes the structure of the words, causes high degree of complexity of tagging. The other hard problem is the lack of Arabic language resources, corpora and other tools. Tokenization schemes are also a source of problems in tagging.

The main contribution of the dissertation is a new, complete tagging system, which starts with raw texts (assumed to be error-free) and finally produces tagging of the text with part-of-speech (POS) tagging, grammatical features and lemma for each word.

Apart from the complete system, several parts of it are of independent interest:

1. A new, comprehensive tagest for Arabic has been designed for the system, based on a careful analysis of the existing ones and remedying their shortcomings. It can be used independently of the whole system.

2. Tokenization application, which is independent of the remaining parts of the system and can be used as a stand-alone tool in other contexts.

3. The morphological analyser produces Arabic lemma rather than stem or root, as in most of other systems. Lemma is a legal word in Arabic, unlike the latter two. However, the analyser is not independent of the whole system and cannot be separated from it.

4. While building our tagger, we have created a new method of combining several taggers into a single hybrid one. We call it master-slaves. The master tagger is always a hidden Markov model tagger, and we can attach any number of slaves to it. While tagging a sentence, the outputs of slave taggers modify the probabilities of the master tagger, which finally tags that sentence.

Introduction to Arabic language

Arabic is a Semitic language with rich templatic morphology. An Arabic word may be composed

of a stem (consisting of a consonantal root and a template), plus affixes and clitics. The affixes

(3)

include inflectional markers for tense, gender, and/or number. The clitics can include some (but not all) prepositions, conjunctions, determiners, possessive pronouns and pronouns. Some of them are proclitic (attached at the beginning of the stem) and some enclitic (attached at the end of the stem).

The following is an example of the morphological segments in the word (ىهتبنسحثو) which means and by their virtues:

Arabic texts can be either a vowelled text such as the language of Qur’an or children’s books or unvowelled, used in newspapers, books, and media. For unvowelled Arabic, there are many possible morphological analyses corresponding to alternative vowelings, so tagging is more like full-blown “understanding” of the text.

The level of difficulty of word construction depends on the language type. In English, the word- formation is easy but in Arabic it is more difficult because the word is inflected according to number, type, gender, etc.

Morphology, which is important in my dissertation, is a branch of grammar which examines the forms of words as well as the principles of word-formation and inflection. Here, word formation is the creation of a new word, which can be classified according to the method of construction:

 By using derivational rules, which relate a lexeme to another lexeme (change them from one syntactic category into words of another syntactic category or from one meaning to another)

 Compound: two or more words attached together to make them into one word

The complete system

The complete system consists of the following main components, which will be described in the following sections (see figure 1):

1. Tagset 2. Tokenizer

3. Morphological analyser

4. Tagger

(4)

Figure 1. The complete tagging system.

Tagset

Tagset is a set of tags (symbols) representing information about part of speech and about values of grammatical categories (case, gender, etc.) of word forms.

In the dissertation 10 important Arabic tagsets are compared and their limitations indicated. Then we construct a new Arabic tagset avoiding those limits. The design is intended for Arabic language only and is not based on solutions for other languages. It is a multilevel tagset compatible with classical and modern standard Arabic. The noun class has three levels (fixed POS types, grammatical features and changed POS types), verb class has two levels (POS types and grammatical features) and particles have two levels (working and meaning). We also introduce the notion of tagset interleaving. Figures 2 to 7 present the main classes of this tagset.

Word and sentence Segmentation Running

text

Words and Sentences boundaries

Text Normalization

Analyzing and extracting Lemma lemmas

&

features

&

POS

Toke niza ti on

Tagging Tagset

Getting tokens Normalized text Consecutive

morphemes ( Inflected word

&

Clitics)

Arabic Language resources

Designing a new Arabic tagset

Dictionary

One tag for each word (POS +Features) and Lemma

Building Dictionary

(5)

Figure 2: Main POS

Figure 3a: Arabic noun classes

Gender: Masculine Feminine Common Number: Singular Plural Dual Case: Nominative Accusative Genitive

Structured Yes No

Figure 3b: features of noun

Figure 4a: Verb classes رًُضنا

Pronoun

NPrn

وبهفتسا ىسا

Interrogative NInt

حربشلاإ

Demonstrative

NDem لىصىي

Relative

NRel

رُغصتنا

Reduced NRed

خَبنكنا

Allusive NAlv

ىسا سنجنا

Common

NNou

ىهعنا ىسا Proper

NPrp

ءبًسلاا خسًخنا

Five nouns

NFiv

دذعنا ىسا

Numeral

Nouns

معف ىسا

Verbal NVrb

خفصنا

Adjective

ٍهصلاا

Cardinal NNmc

ٍجُترتنا

Ordinal NNmo

يرخا

Other NAdo

خجسننا

Genealogical NAdg

فرصتًنا رُغ فرضنا

Constant Adverb NAdv

Noun N Verb V Particle P Residual R Punctuation Pnc Word

Verb

Past Pst Present Prt Imperative Imv

(6)

Gender: Masculine Feminine Common (كرتشي) Number: Singular Plural Dual

Person: First Second Third

Mood: Nominative Accusative Jussive Non

Certainty Yes No

Structured Yes No

Voice Passive Active

Figure 4b: Verbal attributes

وزجهن رجهن فطعهن تصنهن خهيبع رُغ

وا مًعنا نع خفىفكي

خسننا ) عفرو تصن(

نع خفبك مًعنا

for Jussive

Jus

For Reduction (preposition)

Red

For Conjunction

Cnj

For Accusative

Acu

Not working Or Preventive

Non

Copier Cop

Prevent Prv

Figure 5a: The classes of particles (working)

Figure 5b: Particles meaning

Without meaning

ًنعي بهن سُن Particle

meaning

Exceptive

ءبنثتسا

Linking

ظثر ظثر

Interrogative

وبهفتسا Future

لبجقتسا

Exclamation تجعت

Definition فَرعتنا لا

Simile هُجشتنا

Request تهط Realization

قُقحت

Explanation & details

رُسفت و مُصفت مُصفت و رُسفت

Caution تجس

Certainty

ذُكىت ذُكىت

Answer

ةاىج ةاىج

Increasing & decreasing

رُثكتو مُهقتj

Vocative

ءاذن

Negative

ٍهنو ٍفن

Adverbial

خُفرظ

Conditional

طرش

Surprise

خئجبفي

Subordinating

ٌرذصي Particle

Working

(7)

Figure 6 Residuals classes

Figure 7 Syntactic classes of noun (level three).

Tokenization

Tokenization is the task of separating out words (morphemes) from running text. One of the mophemes typically corresponds to the word stem, and there are also inflectional morphemes.

This task accomplished by the following:

 Orthographic normalization to reduce noise in the data.

 Sentence segmentation: the process of splitting running text into sentences.

 Word segmentation: the process of splitting sentences into words.

 Word tokenization: the process of splitting words into morphemes.

We propose a hybrid of unsupervised methods for Arabic tokenization. After getting words from sentences by segmentation procedure based on punctuation and white spaces, we use our own tool to produce all possible tokenizations for each word. Then, manually written rules and statistical methods are applied to solve ambiguities which might emerge, i.e., choose the correct one among multiple possible tokenizations for one word. The output is one tokenization for each word. The statistical method was trained using 29k manually tokenized words (the raw text was taken from Al- Watan 2004 corpus). The final accuracy of tokenization is 98.83%.

This part of the system does not depend on the other stages and can be used as a stand-alone application, and as such incorporated into other systems for natural language processing.

معبف

(Subject of a verb)

هث لىعفي (Object of a verb)

فرصتي فرض (Adverb)

يدبني (Vocative) معبف تئبن

(Passive subject representative

قهطي لىعفي (Cognate)

لبح (Circumstantial

accusative)

هُنا فبضي (Possessive construction) أذتجي

(Subject)

ههجلأ لىعفي (Accusative of

purpose)

زًُُت (Specification),

تعن, لذث (Apposition) رجخ

(Predicate of a subject)

هعي لىعفي (Commutative object)

ًنثتسي (Excepted)

X NOT USED Residual

Symbol RSym

Abbreviations and Acronym RAbc

Not Classified RNcl

(8)

Lemmatization and Analyzing

Lemmatization and analyzing is the second preprocessing step of the whole tagging system which we propose. The output of this stage is a lemma (the canonical form, dictionary form, or citation form), POS and features in case of nouns and verbs, meaning and working in case of particles. Each word may have more than one analysis. At this stage ambiguities are not resolved and therefore for each item several possible analyses are generated, covering all possible combinations of parts of speech and features for each word.

We suppose that the input to lemmatization and analyzing process is a word without clitics (inflected word alone) or clitcs alone; however, the analyzer has access to the context of the item to be analyzed and makes use of it. Lemmatization and analyzing deal with known and unknown words in different ways:

1. Known words are in the dictionary, which provides the lemma and features. The dictionary is described separately below.

2. Unknown words are processed by using rule-based approach. The rules depend on clitics, affixes, context, word pattern and word structure.

Our analyzer cannot be used independently, because it is specialized for the needs of the complete system. Because we use it for subsequent tagging, we evaluate its accuracy measuring how often the true analysis is among all analyses produced. For doing this evaluation we used a small corpus of 16k words, manually annotated by a single analysis for each word, correct for this particular use of that word. In the test, for 99.67% of words, the correct analysis was among those produced by the analyzer. On the other hand, in a manual verification of the output of the analyzer, only 0.1% of all produced analyses were grammatically incorrect.

Tagging (disambiguation)

We use two techniques of combining taggers, which are stacking and master-slaves techniques.

The latter method is our independent contribution, described separately below.

The taggers used by these techniques are manually created rule-based tagger, Hidden Markov Model (HMM) tagger, Brill tagger and Maximum Match (MM) tagger. HMM, Brill and MM taggers are combined into master-slaves hybrid tagger, with HMM as the master and the other two as slaves. Rules-based tagger is combined using stacking or, equivalently, as a special slave, with the master-slaves tagger.

We also use a rule-based tagger for increasing the accuracy and eliminating unwanted tags. It did

not produce a unique tagging for each word, but rather a set of tags, which it considered possible.

(9)

Relatively few rules are used in it, and not all features of Arabic are taken into account.

Constructing the rules for this tagger was one of the most time consuming tasks. Implementation of these rules was not an easy task, if compared to the implementation of the statistical methods. The rule-based tagger was attached to the master-slaves tagger as a third, special slave with factor 0, which means that the tags it eliminated were completely removed from the HMM tagger.

The final accuracy of the hybrid tagger was measured to be 92.86% in a 10-fold cross-validation on a corpus of 45 files of total size 29k words. The reader should remember that our tagest with several thousand tags is used, and the training corpus was relatively small, therefore the accuracy cannot be as high as in the cases of taggers using small tagsets and large corpuses.

Dictionary

Dictionary is used in the morphological analyzer; it plays a role similar to the dictionaries used in Buckwalter Analyzer. Dictionary provides lemma, POS and features for each word stored in it.

Verbs: This dictionary consists of 6000 verbs inflected in all possible forms according to the templates used by Al-Dahdah, with added certainty and jussive case. Then all these inflections are sorted and encoded in away such that we can find them efficiently.

Particles: We have a complete list of all Arabic particles with their workings and meanings, and therefore the analyzing process is a search problem in cases of verbs and particles.

Noun: the noun class includes adjectives in Arabic. Their forms were collected from many resources. We added inflections and derivations as feminine (if applicable), numbers, genealogically (Yaa Alnasabi) and reduced noun. There are many classes of nouns which are closed sets, for example question nouns, numerals nouns and so on. The dictionary is updatable.

Master-slaves hybrid taggers

Independently of the construction of the whole system, we have devised a new method for combining taggers, which is master-slaves technique. HMM tagger is always the master tagger and does the final tagging. However, its internal probabilities, learned from the training corpus, are modified according to the results obtained on the slave taggers on the actual sentence to be tagged.

The modification is quite simple, the probabilities of tags not used by the slave taggers are multiplied by a constant factor smaller than 1, thereby privileging the tags which were used.

Therefore, in fact, each sentence is tagged by a slightly different HMM tagger.

We did experiments with master-slaves hybrid tagggers not only on our Arabic tagest, but also

on English, to see if this technique works for other languages. The Brown corpus was used, with

(10)

Brill tagger and Maximum Match tagger as slaves of the HMM tagger. The accuracies obtained by using this hybrid tagger are shown in Figure 8. They demonstrate that this very simple method of combining taggers increases the overall accuracy above that of the unmodified HMM tagger, and even above the level of the best accuracy achieved by the component taggers alone.

Figure 8: Accuracy of using HMM, Brill and MM in master-slaves combination.

The second tagging technique was “written rule-based tagging”. A few hundred of written rules were used in this tagger. Most of these rules were taken from Arabic traditional books. It was used for eliminating unwanted tags for specific words, and making the next tagger more accurate.

The two taggers were combined as stacking multi-taggers system where the output of the first tagger is fed to the second. The accuracy after adding the rule-based tagger increased from 90.05 to 92.86 %.

Published papers:

Ahmed H. Aliwy (2013): Comparing Arabic tagsets and Designing a New One. to appear in: Lingwistyka Stosowana [Applied Linguistics] nr (7) 2013, University of Warsaw, Poland.

Ahmed H. Aliwy (2012): Tokenization as Preprocessing for Arabic Tagging System . In proceeding of International Conference on Knowledge and Education Technology (ICKET 2012), Paris, 2012. Published in International Journal of Information and Education Technology Vol.2(4): 348-353.

Ahmed H. Aliwy (2012): Arabic Language Analyzer with Lemma Extraction and Rich Tagset. In Proceeding of JapTAL 2012, Japan. Lecture Notes in Computer Science vol. 7614, pp. 168-179.