Warsaw University
Faculty of Mathematics, Informatics and Mechanics
Ahmed Hussein Aliwy
Arabic Morphosyntactic Raw Text Part of Speech Tagging System
summary of PhD dissertation
Supervisor Prof. dr hab. Jerzy Tyszkiewicz Institute of Informatics University of Warsaw
January 2013
Introduction and Overview
The general topic of this dissertation is morphosyntactic part of speech tagging (abbreviated POS tagging) for Arabic. This topic has long and rich history for other languages, mainly for English.
POS Tagging provides fundamental information about word forms used in sentences of natural language. The method of utilizing this information varies depending on the particular NLP application (information retrieval, machine translation, etc.), in which it is used.
Tagging is a source of many research challenges. These challenges depend very much on the language under consideration. In this dissertation we consider Arabic, a highly inflected language.
Although Arabic is generally quite regular and there are very few irregular forms, very rich and complicated structure of inflection, which in many cases changes the structure of the words, causes high degree of complexity of tagging. The other hard problem is the lack of Arabic language resources, corpora and other tools. Tokenization schemes are also a source of problems in tagging.
The main contribution of the dissertation is a new, complete tagging system, which starts with raw texts (assumed to be error-free) and finally produces tagging of the text with part-of-speech (POS) tagging, grammatical features and lemma for each word.
Apart from the complete system, several parts of it are of independent interest:
1. A new, comprehensive tagest for Arabic has been designed for the system, based on a careful analysis of the existing ones and remedying their shortcomings. It can be used independently of the whole system.
2. Tokenization application, which is independent of the remaining parts of the system and can be used as a stand-alone tool in other contexts.
3. The morphological analyser produces Arabic lemma rather than stem or root, as in most of other systems. Lemma is a legal word in Arabic, unlike the latter two. However, the analyser is not independent of the whole system and cannot be separated from it.
4. While building our tagger, we have created a new method of combining several taggers into a single hybrid one. We call it master-slaves. The master tagger is always a hidden Markov model tagger, and we can attach any number of slaves to it. While tagging a sentence, the outputs of slave taggers modify the probabilities of the master tagger, which finally tags that sentence.
Introduction to Arabic language
Arabic is a Semitic language with rich templatic morphology. An Arabic word may be composed
of a stem (consisting of a consonantal root and a template), plus affixes and clitics. The affixes
include inflectional markers for tense, gender, and/or number. The clitics can include some (but not all) prepositions, conjunctions, determiners, possessive pronouns and pronouns. Some of them are proclitic (attached at the beginning of the stem) and some enclitic (attached at the end of the stem).
The following is an example of the morphological segments in the word (ىهتبنسحثو) which means and by their virtues:
Arabic texts can be either a vowelled text such as the language of Qur’an or children’s books or unvowelled, used in newspapers, books, and media. For unvowelled Arabic, there are many possible morphological analyses corresponding to alternative vowelings, so tagging is more like full-blown “understanding” of the text.
The level of difficulty of word construction depends on the language type. In English, the word- formation is easy but in Arabic it is more difficult because the word is inflected according to number, type, gender, etc.
Morphology, which is important in my dissertation, is a branch of grammar which examines the forms of words as well as the principles of word-formation and inflection. Here, word formation is the creation of a new word, which can be classified according to the method of construction:
By using derivational rules, which relate a lexeme to another lexeme (change them from one syntactic category into words of another syntactic category or from one meaning to another)
Compound: two or more words attached together to make them into one word
The complete system
The complete system consists of the following main components, which will be described in the following sections (see figure 1):
1. Tagset 2. Tokenizer
3. Morphological analyser
4. Tagger
Figure 1. The complete tagging system.
Tagset
Tagset is a set of tags (symbols) representing information about part of speech and about values of grammatical categories (case, gender, etc.) of word forms.
In the dissertation 10 important Arabic tagsets are compared and their limitations indicated. Then we construct a new Arabic tagset avoiding those limits. The design is intended for Arabic language only and is not based on solutions for other languages. It is a multilevel tagset compatible with classical and modern standard Arabic. The noun class has three levels (fixed POS types, grammatical features and changed POS types), verb class has two levels (POS types and grammatical features) and particles have two levels (working and meaning). We also introduce the notion of tagset interleaving. Figures 2 to 7 present the main classes of this tagset.
Word and sentence Segmentation Running
text
Words and Sentences boundaries
Text Normalization
Analyzing and extracting Lemma lemmas
&
features
&
POS
Toke niza ti on
Tagging Tagset
Getting tokens Normalized text Consecutive
morphemes ( Inflected word
&
Clitics)
Arabic Language resources
Designing a new Arabic tagset
Dictionary
One tag for each word (POS +Features) and Lemma
Building Dictionary
Figure 2: Main POS
Figure 3a: Arabic noun classes
Gender: Masculine Feminine Common Number: Singular Plural Dual Case: Nominative Accusative Genitive
Structured Yes No
Figure 3b: features of noun
Figure 4a: Verb classes رًُضنا
Pronoun
NPrn
وبهفتسا ىسا
Interrogative NInt
حربشلاإ
Demonstrative
NDem لىصىي
Relative
NRel
رُغصتنا
Reduced NRed
خَبنكنا
Allusive NAlv
ىسا سنجنا
Common
NNou
ىهعنا ىسا Proper
NPrp
ءبًسلاا خسًخنا
Five nouns
NFiv
دذعنا ىسا
Numeral
Nouns
معف ىسا
Verbal NVrb
خفصنا
Adjective
ٍهصلاا
Cardinal NNmc
ٍجُترتنا
Ordinal NNmo
يرخا
Other NAdo
خجسننا
Genealogical NAdg
فرصتًنا رُغ فرضنا
Constant Adverb NAdv
Noun N Verb V Particle P Residual R Punctuation Pnc Word
Verb
Past Pst Present Prt Imperative Imv
Gender: Masculine Feminine Common (كرتشي) Number: Singular Plural Dual
Person: First Second Third
Mood: Nominative Accusative Jussive Non
Certainty Yes No
Structured Yes No
Voice Passive Active
Figure 4b: Verbal attributes
وزجهن رجهن فطعهن تصنهن خهيبع رُغ
وا مًعنا نع خفىفكي
خسننا ) عفرو تصن(
نع خفبك مًعنا
for Jussive
Jus
For Reduction (preposition)
Red
For Conjunction
Cnj
For Accusative
Acu
Not working Or Preventive
Non
Copier Cop
Prevent Prv
Figure 5a: The classes of particles (working)
Figure 5b: Particles meaning
Without meaning
ًنعي بهن سُن Particle
meaning
Exceptive
ءبنثتسا
Linking
ظثر ظثر
Interrogative
وبهفتسا Future
لبجقتسا
Exclamation تجعت
Definition فَرعتنا لا
Simile هُجشتنا
Request تهط Realization
قُقحت
Explanation & details
رُسفت و مُصفت مُصفت و رُسفت
Caution تجس
Certainty
ذُكىت ذُكىت
Answer
ةاىج ةاىج
Increasing & decreasing
رُثكتو مُهقتj
Vocative
ءاذن
Negative
ٍهنو ٍفن
Adverbial
خُفرظ
Conditional
طرش
Surprise
خئجبفي
Subordinating
ٌرذصي Particle
Working
Figure 6 Residuals classes
Figure 7 Syntactic classes of noun (level three).
Tokenization
Tokenization is the task of separating out words (morphemes) from running text. One of the mophemes typically corresponds to the word stem, and there are also inflectional morphemes.
This task accomplished by the following:
Orthographic normalization to reduce noise in the data.
Sentence segmentation: the process of splitting running text into sentences.
Word segmentation: the process of splitting sentences into words.
Word tokenization: the process of splitting words into morphemes.
We propose a hybrid of unsupervised methods for Arabic tokenization. After getting words from sentences by segmentation procedure based on punctuation and white spaces, we use our own tool to produce all possible tokenizations for each word. Then, manually written rules and statistical methods are applied to solve ambiguities which might emerge, i.e., choose the correct one among multiple possible tokenizations for one word. The output is one tokenization for each word. The statistical method was trained using 29k manually tokenized words (the raw text was taken from Al- Watan 2004 corpus). The final accuracy of tokenization is 98.83%.
This part of the system does not depend on the other stages and can be used as a stand-alone application, and as such incorporated into other systems for natural language processing.
معبف
(Subject of a verb)
هث لىعفي (Object of a verb)
فرصتي فرض (Adverb)
يدبني (Vocative) معبف تئبن
(Passive subject representative
قهطي لىعفي (Cognate)
لبح (Circumstantial
accusative)
هُنا فبضي (Possessive construction) أذتجي
(Subject)
ههجلأ لىعفي (Accusative of
purpose)
زًُُت (Specification),
تعن, لذث (Apposition) رجخ
(Predicate of a subject)
هعي لىعفي (Commutative object)
ًنثتسي (Excepted)
X NOT USED Residual
Symbol RSym
Abbreviations and Acronym RAbc
Not Classified RNcl