<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Arabic Search Query Object [ASQO]</title>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
</head>
<body bgcolor="#f8f8f8" leftmargin="0" topmargin="0" dir="ltr">
<table width="80%" align="center" cellpadding="5" cellspacing="2">
<tr>
<td align="justify">
<DIV align=center><FONT face="Arial, Helvetica, sans-serif" color="#990000"><I><B>Issues in Arabic Search and
Retrieval</B></I></FONT></DIV>
<FONT face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><BR>With the
exception of the Qur'an and pedagogical texts, Arabic is generally written
without vowels or other graphic symbols that indicate how a word is
pronounced. The reader is expected to fill these in from context. Some of
the graphic symbols include <I>sukuun</I>, which is placed over a
consonant to indicate that it is not followed by a vowel; <I>shadda</I>,
written over a consonant to indicate it is doubled; and <I>hamza</I>, the
sign of the glottal stop, which can be written above or below <SPAN
lang=ar-sa>ا</SPAN><FONT face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular
size=-1> </FONT><FONT face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular
size=-1>(<I>alif</I>) at the beginning of a word, or on<SPAN lang=ar-sa> ا
</SPAN>(<I>alif</I>),<SPAN lang=ar-sa> Ù </SPAN>(<I>waaw</I>),<SPAN
lang=ar-sa> Ù </SPAN>(<I>yaa'</I>), or by itself on the line elsewhere.
Also, common spelling differences regularly appear, including the use
of<SPAN lang=ar-sa> Ù </SPAN>(<I>haa'</I>) for<SPAN lang=ar-sa> Ø©
</SPAN>(<I>taa' marbuuta</I>) and<SPAN lang=ar-sa> Ù </SPAN>(<I>alif
maqsuura</I>) for<SPAN lang=ar-sa> Ù </SPAN>(<I>yaa'</I>). These features
of written Arabic, which are also seen in Hebrew as well as other
languages written with Arabic script (such as Farsi, Pashto, and Urdu),
make analyzing and searching texts quite challenging. In addition, Arabic
morphology and grammar are quite rich and present some unique issues for
information retrieval applications.<BR><BR>There are essentially three
ways to search an Arabic text with Arabic queries: literal, stem-based or
root-based.<BR><BR>A literal search, the simplest search and retrieval
method, matches documents based on the search terms exactly as the user
entered them. The advantage of this technique is that the documents
returned will without a doubt contain the exact term for which the user is
looking. But this advantage is also the biggest disadvantage: many, if not
most, of the documents containing the terms in different forms will be
missed. Given the many ambiguities of written Arabic, the success rate of
this method is quite low. For example, if the user searches for<SPAN
lang=ar-ea> ÙØªØ§Ø¨ </SPAN>(<I>kitaab</I>, book), he or she will not find
documents that only contain<SPAN lang=ar-ea> Ø£ÙÙØªØ§Ø¨
</SPAN>(<I>`al-kitaabu</I>, <I>the</I> book).<BR><BR>Stem-based searching,
a more complicated method, requires some normalization of the original
texts and the queries. This is done by removing the vowel signs, unifying
the <I>hamza</I> forms and removing or standardizing the other signs.
Additionally, grammatical affixes and other constructions which attach
directly to words, such as conjunctions, prepositions, and the definite
article, should be identified and removed. Finally, regular and irregular
plural forms need to be identified and reduced to their singular forms.
Performing this type of stemming leads to more successful searches, but
can be problematic due to over-generation or incorrect generation of
stems.<BR><BR>A third method for searching Arabic texts is to index and
search for the root forms of each word. Since most verbs and nouns in
Arabic are derived from triliteral (or, rarely, quadriliteral) roots,
identifying the underlying root of each word theoretically retrieves most
of the documents containing a given search term regardless of form.
However, there are some significant challenges with this approach.
Determining the root for a given word is extremely difficult, since it
requires a detailed morphological, syntactic and semantic analysis of the
text to fully disambiguate the root forms. The issue is complicated
further by the fact that not all words are derived from roots. For
example, loan words (words borrowed from another language) are not based
on root forms, although there are even exceptions to this rule. For
example, some loans that have a structure similar to triliteral roots,
such as the English word <I>film</I>, are handled grammatically as if they
were root-based, adding to the complexity of this type of search. Finally,
the root can serve as the foundation for a wide variety of words with
related meanings. The root<SPAN lang=ar-ea> ٠ت ب </SPAN>(<I>ktb</I>) is
used for many words related to writing, including<SPAN lang=ar-ea> ÙØªØ¨
</SPAN>(<I>kataba</I>, to write),<SPAN lang=ar-ea> ÙØªØ§Ø¨
</SPAN>(<I>kitaab</I>, book),<SPAN lang=ar-ea> Ù
ÙØªØ¨ </SPAN>(<I>maktab</I>,
office), and<SPAN lang=ar-ea> ÙØ§ØªØ¨ </SPAN>(<I>kaatib</I>, author). But the
same root is also used for regiment/battalion,<SPAN lang=ar-ea> ÙØªÙبة
</SPAN>(<I>katiiba</I>). As a result, searching based on root forms
results in very high recall, but precision is usually quite
low.<BR><BR>While search and retrieval of Arabic text will never be an
easy task, relying on linguistic analysis tools and methods can help make
the process more successful. Ultimately, the search method you choose
should depend on how critical it is to retrieve every conceivable instance
of a word or phrase and the resources you have to process search returns
in order to determine their true relevance.<BR><BR><I>This sidebar
reprinted from #51 Volume 13 Issue 7 of</I> <A
href="http://www.multilingual.com/FMPro?-db=back%20issues&-lay=CGI&-token=now&-format=ourPublication/currentIssue.htm&-sortfield=Magazine&-sortorder=descending&-max=1&-find">MultiLingual
Computing & Technology</A> <I>published by <A
href="http://www.multilingual.com/">MultiLingual Computing, Inc.</A>, 319
North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax:
208-263-6310.</I> <BR><BR></FONT> </td>
</tr>
</table>
</body>
</html>