Location: PHPKode > scripts > Arabic MySQL Query > arabic-mysql-query/docs/readme.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<title>Arabic Search Query Object [ASQO]</title>
	<META http-equiv=Content-Type content="text/html; charset=UTF-8">
<body bgcolor="#f8f8f8" leftmargin="0" topmargin="0" dir="ltr">
<table width="80%" align="center" cellpadding="5" cellspacing="2">
		<td align="justify">
	   <DIV align=center><FONT face="Arial, Helvetica, sans-serif" color="#990000"><I><B>Issues in Arabic Search and
		<FONT face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular size=-1><BR>With the 
      exception of the Qur'an and pedagogical texts, Arabic is generally written 
      without vowels or other graphic symbols that indicate how a word is 
      pronounced. The reader is expected to fill these in from context. Some of 
      the graphic symbols include <I>sukuun</I>, which is placed over a 
      consonant to indicate that it is not followed by a vowel; <I>shadda</I>, 
      written over a consonant to indicate it is doubled; and <I>hamza</I>, the 
      sign of the glottal stop, which can be written above or below <SPAN 
      lang=ar-sa>ا</SPAN><FONT face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular 
      size=-1>&nbsp; </FONT><FONT face=Arial,Helvetica,Geneva,Swiss,SunSans-Regular 
      size=-1>(<I>alif</I>) at the beginning of a word, or on<SPAN lang=ar-sa> ا 
      </SPAN>(<I>alif</I>),<SPAN lang=ar-sa> و </SPAN>(<I>waaw</I>),<SPAN 
      lang=ar-sa> ي </SPAN>(<I>yaa'</I>), or by itself on the line elsewhere. 
      Also, common spelling differences regularly appear, including the use 
      of<SPAN lang=ar-sa> ه </SPAN>(<I>haa'</I>) for<SPAN lang=ar-sa> ة 
      </SPAN>(<I>taa' marbuuta</I>) and<SPAN lang=ar-sa> ى </SPAN>(<I>alif 
      maqsuura</I>) for<SPAN lang=ar-sa> ي </SPAN>(<I>yaa'</I>). These features 
      of written Arabic, which are also seen in Hebrew as well as other 
      languages written with Arabic script (such as Farsi, Pashto, and Urdu), 
      make analyzing and searching texts quite challenging. In addition, Arabic 
      morphology and grammar are quite rich and present some unique issues for
      information retrieval applications.<BR><BR>There are essentially three 
      ways to search an Arabic text with Arabic queries: literal, stem-based or 
      root-based.<BR><BR>A literal search, the simplest search and retrieval 
      method, matches documents based on the search terms exactly as the user 
      entered them. The advantage of this technique is that the documents 
      returned will without a doubt contain the exact term for which the user is 
      looking. But this advantage is also the biggest disadvantage: many, if not 
      most, of the documents containing the terms in different forms will be 
      missed. Given the many ambiguities of written Arabic, the success rate of 
      this method is quite low. For example, if the user searches for<SPAN 
      lang=ar-ea> كتاب </SPAN>(<I>kitaab</I>, book), he or she will not find 
      documents that only contain<SPAN lang=ar-ea> ألكتاب 
      </SPAN>(<I>`al-kitaabu</I>, <I>the</I> book).<BR><BR>Stem-based searching, 
      a more complicated method, requires some normalization of the original 
      texts and the queries. This is done by removing the vowel signs, unifying 
      the <I>hamza</I> forms and removing or standardizing the other signs. 
      Additionally, grammatical affixes and other constructions which attach 
      directly to words, such as conjunctions, prepositions, and the definite 
      article, should be identified and removed. Finally, regular and irregular 
      plural forms need to be identified and reduced to their singular forms. 
      Performing this type of stemming leads to more successful searches, but 
      can be problematic due to over-generation or incorrect generation of 
      stems.<BR><BR>A third method for searching Arabic texts is to index and 
      search for the root forms of each word. Since most verbs and nouns in 
      Arabic are derived from triliteral (or, rarely, quadriliteral) roots, 
      identifying the underlying root of each word theoretically retrieves most 
      of the documents containing a given search term regardless of form. 
      However, there are some significant challenges with this approach. 
      Determining the root for a given word is extremely difficult, since it 
      requires a detailed morphological, syntactic and semantic analysis of the 
      text to fully disambiguate the root forms. The issue is complicated 
      further by the fact that not all words are derived from roots. For 
      example, loan words (words borrowed from another language) are not based 
      on root forms, although there are even exceptions to this rule. For 
      example, some loans that have a structure similar to triliteral roots, 
      such as the English word <I>film</I>, are handled grammatically as if they 
      were root-based, adding to the complexity of this type of search. Finally, 
      the root can serve as the foundation for a wide variety of words with 
      related meanings. The root<SPAN lang=ar-ea> ك ت ب </SPAN>(<I>ktb</I>) is 
      used for many words related to writing, including<SPAN lang=ar-ea> كتب 
      </SPAN>(<I>kataba</I>, to write),<SPAN lang=ar-ea> كتاب 
      </SPAN>(<I>kitaab</I>, book),<SPAN lang=ar-ea> مكتب </SPAN>(<I>maktab</I>, 
      office), and<SPAN lang=ar-ea> كاتب </SPAN>(<I>kaatib</I>, author). But the 
      same root is also used for regiment/battalion,<SPAN lang=ar-ea> كتيبة 
      </SPAN>(<I>katiiba</I>). As a result, searching based on root forms 
      results in very high recall, but precision is usually quite 
      low.<BR><BR>While search and retrieval of Arabic text will never be an 
      easy task, relying on linguistic analysis tools and methods can help make 
      the process more successful. Ultimately, the search method you choose 
      should depend on how critical it is to retrieve every conceivable instance 
      of a word or phrase and the resources you have to process search returns 
      in order to determine their true relevance.<BR><BR><I>This sidebar 
      reprinted from #51 Volume 13 Issue 7 of</I> <A 
      Computing &amp; Technology</A> <I>published by <A 
      href="http://www.multilingual.com/">MultiLingual Computing, Inc.</A>, 319 
      North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 
      208-263-6310.</I> <BR><BR></FONT> 	      </td>
Return current item: Arabic MySQL Query