Location: PHPKode > projects > Community Learning Network > cln/modules/phpDigSearch/documentation/phpdig-doc-en.html
<title>PhpDig.net - PhpDig Documentation</title>
<style type="text/css">
body, td {
	background-color: #EEEEEE;
	color: #000000;
	font: 10pt verdana, geneva, lucida, 'lucida grande', arial, helvetica, sans-serif;
	margin: 10px 10px 10px 10px;
	padding: 0px;
a:link {
	color: #006699;
a:visited {
	color: #336699;
a:hover, a:active {
	color: #C00000;
.cdtext {
	color: #008000;
	font: 10pt courier, 'courier new', monospace;

<strong><a href="http://www.phpdig.net/index.php">PhpDig.net</a> - Documentation</strong>

<p><i>Last update : 2005-01-16 - Read, read, read!</i></p>
<a name="toc"></a>

<b><u> Table of contents</u></b>
<li><a href="#toc1"><u>1. Where to find the lastest version</u></a></li>
<li><a href="#toc2"><u>2. PhpDig features</u></a></li>
<li><a href="#toc3"><u>3. Installation</u></a></li>
<li><a href="#toc4"><u>4. Configuration</u></a></li>
<li><a href="#toc5"><u>5. Update PhpDig</u></a></li>
<li><a href="#toc6"><u>6. Indexing with web interface</u></a></li>
<li><a href="#toc7"><u>7. Indexing by command line interface</u></a></li>
<li><a href="#toc8"><u>8. Templates</u></a></li>
<li><a href="#toc9"><u>9. Insert PhpDig into a website</u></a></li>
<li><a href="#toc10"><u>10. Getting help with PhpDig</u></a></li>


<a name="toc1"></a><b>1. Where to find the lastest version</b>
<p>At this link: <a href="http://www.phpdig.net/">http://www.phpdig.net/</a></p>

<a name="toc2"></a><b>2. PhpDig features</b>
<p><b>2.1. HTTP spidering</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

PhpDig follows HREF links as shown by any web browser to find the pages to index. 
Links can also be in AreaMap, frames, or simple like window.open() or window.location() JavaScript. PhpDig 
supports redirections and indexes by following links. PhpDig does not traverse directories or database 
tables to index content.
By default, PhpDig does not go outside of the domain you define for the indexing. Various index options 
are choosen by the user, including a parameter to extend indexing to subdomains and a parameter to 
limit the indexing to a specific directory.
You can limit indexing so that the maximum links found is ((X * Y) + 1) where X is links and Y is depth. 
Alternatively, you can index just one page, or you can set options to index a greater number of pages.
Any HTML content is indexed, for example from static HTML pages to dynamic HTML pages produced from say PHP 
scripts. PhpDig searches the Mime-Type of the document, and can be set to auto-index via a cron job.

<p><b>2.2. Full-text indexing</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
PhpDig indexes all words of a document, but you can avoid common words by defining such words in a text file. 
Underscores and other characters can be part of a word. Words in the title can have a more important weight in 
ranking results.

Note that the MySQL FULLTEXT index is different from the PhpDig full-text indexing. The MySQL FULLTEXT index 
is a table index used with MyISAM tables. PhpDig does full-text indexing of page content but does not use the 
MySQL FULLTEXT index for searches.

<p><b>2.3. Indexed file types</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
PhpDig indexes HTML and text files by itself. PhpDig could index PDF, MS-Word, MS-Excel, and MS-PowerPoint files 
if you install external binaries on the server for this purpose. PhpDig is configured to use <b>catdoc</b>, 
<b>xls2csv</b>, <b>pstotext</b> or <b>pdftotext</b>, and <b>ppt2text</b> programs.

- You can find catdoc and xls2csv at this link: 
<a href="http://www.45.free.net/~vitus/ice/catdoc/">http://www.45.free.net/~vitus/ice/catdoc/</a>
- You can find pstotext at this link: 
<a href="http://research.compaq.com/SRC/virtualpaper/pstotext.html">http://research.compaq.com/SRC/virtualpaper/pstotext.html</a>
- You can find pdftotext at this link: 
<a href="http://public.planetmirror.com/pub/xpdf/">http://public.planetmirror.com/pub/xpdf/</a>
- You can query for ppt2text at this link: 
<a href="http://www.google.com/search?q=ppt2text">http://www.google.com/search?q=ppt2text</a>

The author of PhpDig does not offer support for the binary programs. Contact the authors of those programs 
if you have trouble with compiling and/or installing them.
Of course, you can use other binary programs to extract text from PDF, MS-Word, MS-Excel, and MS-PowerPoint files.
To demonstrate the external binaries feature, you can search 
<a href="http://www.phpdig.net/demo/search.php?query_string=hamlet" target="_blank">Hamlet</a> 
(tragedy, Shakespeare, from MS-Word format) or 
<a href="http://www.phpdig.net/demo/search.php?query_string=l%27avare" target="_blank">L'Avare</a> 
(comedy, Molière, from PDF format).

<p><b>2.4. Other features</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

PhpDig tries to read a <i>robots.txt</i> file at the server web root, and considers <i>META robots</i> tags too. 
The <i>last-modified</i> header value is stored in the database to avoid redundant indexing. Also, the META 
<i>revisit-after</i> tag is considered.
PhpDig can spider sites served on another port other than the default 80 but spidering 443 https:// may be met with 
limited success. Sites that are password protected with a .htaccess file can be indexed if you give the robot a valid 
username and password such as http://username:hide@address.com but <b>be careful</b>!

This .htaccess related feature could let an unauthorized user read protected information, and the username and 
password are sent in plain text. It is recommended that you create a specific instance of PhpDig, protected by the same 
credentials as the restricted site, and index within the protected area.
If desired, PhpDig can store textual content of indexed documents in files. In this case, relevant extracts from found 
pages are displayed in the search results with highlighted search keys. Otherwsie, a chunk of text as specified in the 
config file is stored in a database table.

<p><b>2.5. Display templates</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
PhpDig comes with a template system that lets the search page fit into the look of an existing site. Making a 
template consists only of inserting a few XML-like tags into an HTML page. See the templates that came with PhpDig 
for examples. Also, see section 8 for further information about different templating options, and see section 9 for 
how to insert PhpDig into a website.

<p><b>2.6. Limits</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

Because of the time consuming indexing process, PHP must not be safe_mode configured and the server that performs the 
index must not timeout. Also, the PHP allow_url_fopen option must be enabled. It doesn't matter for the search queries.
You can <i>try</i> to circumvent safe_mode, should it be enabled, by a) using distant indexing with MySQL TCP 
connection and FTP connection, or b) launching the indexing process in a shell command such as a <i>cron</i> job.
Spidering and indexing is a bit slow, as there is a decent amount of processing needed to index pages. 
On the other hand, search queries are fast enough, even in a somewhat extended context. 
You may find that, by indexing via shell using say cron, the process is somewhat faster.

<a name="toc3"></a><b>3. Installation</b>

<p><b>3.1. Prerequisites</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
PhpDig requires a web server (preferably Apache) with PHP (module or CGI) and a MySQL database. PHP is to 
have safe_mode set to off and allow_url_fopen set to on. Make sure your web server is not set to timeout 
quickly, as indexing can take some time. PhpDig can work in a shard hosting environment, but note that it 
can take a fair amount of CPU time so your host may kill the process or become unhappy with you.
PhpDig has reportedly worked with the following OS/setups:
Gentoo Linux, kernel/2.4.20, Apache/2.0.48, mod_php/4.3.3, MySQL/4.0.16<br>
Linux, kernel/2.4.18, Apache/2.0.44, OpenSSL/0.9.6g, PHP/4.3.0<br>
Linux, kernel/2.4.22, Apache/1.3.29, mod_ssl/2.8.16, OpenSSL/0.9.7b, PHP/4.3.4<br>

Linux, kernel/2.4.3, Apache/1.3.23, mod_ssl/2.8.7, PHP/4.1.2<br>
Linux Red Hat/9.0, Apache/2.0.48, PHP/4.3.4, MySQL/4.0.17<br>
Mac OS X/10.3, Apache/1.3.28, PHP/4.3.2, MySQL/4.0.12<br>
OpenBSD/3.4/Sparc64, Apache/1.3.29, mod_ssl/2.8.16-1.3.29, mod_perl/1.28, OpenSSL/0.9.7c, PHP/4.3.4<br>
Windows 2000 Server, Apache/1.3.20, PHP/4.1.1<br>
Windows 2000 Server, Apache/2.0.44, PHP/4.3.1<br>
Windows 2003 Server, IIS/6, PHP/4.3.2, MySQL/4.0.15
Note that if your OS/setup is for example a CGI loadbalanced cluster of servers, it may not possible to index sites 
on the cluster as there cannot be a connection back to the loadbalanced address. Also note that PhpDig is a web spider 
and search engine, meaning that you may have to edit you hosts file with something like " www.domain.com" in 
order to get PhpDig to crawl on localhost.

<p><b>3.2. Fresh install</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

1) Check your phpinfo page to make sure safe_mode is off and allow_url_fopen is on.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; What is a phpinfo page? Run this script: &lt;?php phpinfo(); ?&gt;<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; PhpDig will not fully function if safe_mode is on or allow_url_fopen is off.<br><br>

2) Unzip the archive and make a new directory on your web account to hold PhpDig.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; This new directory, whatever you called it, will be called [PHPDIG_DIR] here.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; If an open_basedir restriction is in place, make sure to create the directory in the correct place.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; An open_basedir restriction? Some PHP files are limited to within a directory. Check your phpinfo page.<br><br>

3) Open config.php with a TEXT editor and set your username and password.<br><br>

4) Upload the PhpDig folders and files to your [PHPDIG_DIR] directory.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; FTP the HTML, TXT, PHP, &amp; SQL files in ASCII format.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; FTP the GIF, PNG, &amp; JPG files in binary format.<br><br>

5) CHMOD the following directories to 777, or rwxrwxrwx, permission if on a *nix server.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; [PHPDIG_DIR]/text_content &nbsp;&nbsp;(this folder holds text files from index)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; [PHPDIG_DIR]/includes &nbsp;&nbsp;(can be set to 755 after install completed)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; [PHPDIG_DIR]/admin/temp &nbsp;&nbsp;(temp directory inside the admin directory)<br><br>

6) Now access http://www.yourdomain.com/[PHPDIG_DIR]/admin/install.php<br><br>

7) Fill in the form, and select Create Database or Create Tables Only for fresh install.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; <i><u>Some hosts do not allow users to create databases, so then create tables instead.</u></i><br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; Tip: use &quot;a_unique_prefix_&quot; (without the quotes) as the prefix for your PhpDig tables.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; After installation is successfully completed, the PhpDig admin panel will be shown.<br><br>

8) From the PhpDig admin panel, enter a URL in the text box to begin spidering a website.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; Please, <b>DO NOT</b> use PhpDig on phpdig.net (it may cost me money for bandwidth).<br><br>

9) Go to http://www.yourdomain.com/[PHPDIG_DIR]/search.php to search.<br><br>

10) Edit the config file as desired. Read on for further details.<br><br>


<p><b>3.3. Update version</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>


1) Check to see if there were any database table changes and if so update the tables.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; You can find any table changes in the sql directory: no sql file, no changes.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; If you keep current on PhpDig, you can update the tables using the install script.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; <i><u>Note that the install script will only update tables for the most recent changes.</u></i><br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; If you need to update further, you need to manually update the tables: read on.<br><br>

2) Unzip the archive and FTP the new files to your web account, overwriting the old files.<br><br>

&nbsp;&nbsp;&nbsp;&nbsp;&gt; If you made any config file changes, edit the new config file before doing the FTP.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; Compare the new _connect file with the old connect file to look for possible changes.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; If _connect.php is different than connect.php, then copy _connect.php to connect.php.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&gt; Make sure connect.php contains your database info, and then FTP the files to the server.<br><br>

3) Go to http://www.yourdomain.com/[PHPDIG_DIR]/search.php to search.<br><br>


<p><b>3.4. MySQL tables</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
There are three ways to install or update the database tables.
<i>- PhpDig Install Script</i>: In your favorite browser, request the page 
http://www.yourdomain.com/[PHPDIG_DIR]/admin/install.php and choose if you want to create the entire database, 
create the tables only, update the exisitng database, or write the connection parameters to the connect.php file. 
Note that some hosts do not allow users to create databases, so then create tables instead. 
See sections 3.2, 3.3, and 5 for further information: go on, read them.

<i>- Manual Installation</i>: Decide whether you want to create a new database for the tables, create the tables 
in a current database, or update the tables. Note that some hosts do not allow users to create databases, so then 
create tables instead. Also decide if you want a table prefix. If so, then open the sql file and add your prefix to 
the beginning of every table name in the file. If you wish, and have permission, to create a new database for the 
tables, go to the shell prompt and login to MySQL. Otherwise, you can create or update the tables in an exisitng 
database from shell. Obviously, the capitalizations hereto represent your information.
* To manually create a database<br>
<pre class="cdtext">
shell&gt; mysql -h HOSTNAME -u USERNAME -p
mysql&gt; create database DBNAME;
mysql&gt; exit
* To manually install or update tables
<pre class="cdtext">
* To manually verify presence of tables<br>
<pre class="cdtext">
shell&gt; mysql -h HOSTNAME -u USERNAME -p
mysql&gt; use DBNAME;
mysql&gt; show tables;
| Tables_in_DBNAME  |
| PREFIX_clicks     |
| PREFIX_engine     |
| PREFIX_excludes   |
| PREFIX_includes   |
| PREFIX_keywords   |
| PREFIX_logs       |
| PREFIX_site_page  |
| PREFIX_sites      |
| PREFIX_spider     |
| PREFIX_tempspider |
10 rows in set (0.00 sec)

mysql&gt; exit
<i>- Utilize phpMyAdmin</i>:  If this manual stuff freaks you out, you can use phpMyAdmin to install or update 
the tables and run queries. The author of PhpDig does not offer phpMyAdmin support. See the phpMyAdmin 
documentation and help files for further information.
As of PhpDig v.1.8.6 there are ten tables, but this may change in the future. You can count the number of tables 
in the current version by viewing the init_db.sql file. You should consider using a table prefix, because if there 
are two tables with the same name, you will get an error.

<p><b>3.5. _connect.php versus connect.php</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
The _connect.php file comes with the PhpDig package and is used by the PhpDig install script to create or update 
the connect.php file. The connect.php file is the file PhpDig will use to connect to your database. Depending on 
what method of install or upgrade you choose, you may need to copy the "[PHPDIG_DIR]/includes/_connect.php" 
file to "[PHPDIG_DIR]/includes/connect.php" and <i>edit the connect.php file</i> by replacing the &lt;host&gt;, 

&lt;user&gt;, &lt;pass&gt;, and &lt;database&gt; values with you database information. Further, if you use 
a table prefix for the PhpDig tables, then replace &lt;dbprefix&gt; with the actual prefix. Otherwise, set 
&lt;dbprefix&gt; to nothing: two single quotes, no space inbetween. Note that &lt; and &gt; should not be present 
in these values when you are done. Do not put your database information in the _connect.php file. 


<a name="toc4"></a><b>4. Configuration</b>
After installation, PhpDig will work without modification to the config file. However, you may wish to make 
modifications to the config file depending on your needs. Do not say that PhpDig does not work until after 
you fully explore the config file and its options. In any case, remember to change your PhpDig admin 
panel <b>username</b> and <b>password</b> for login.
Note that the authentication method used in the auth.php file is cookie based. If you happen to use authold.php 
for authentication, it is http based. As such, authold.php does not work when PHP is CGI confirgured. If you use 
authold.php and CGI based PHP, then use a .htaccess file in order to protect the [PHPDIG_DIR]/admin directory.

Regardless of whether you use auth.php (the default) or authold.php for authentication, it would be a good idea 
to add a .htaccess file to the admin directory for an extra layer of protection. If you don't want to pass the 
username and password in plain text, use authold.php (file modifications needed) or use SSL.
All of the configuration parameters are in the "[PHPDIG_DIR]/include/config.php" file. Here is a list of each 
of them, followed by a comment explaining its purpose, taken from the config file where default values have 
already been set.

<p><b>4.1. Configuring admin access</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
<pre class="cdtext">
define('PHPDIG_ADM_AUTH','1');     //Activates/deactivates the authentification functions
define('PHPDIG_ADM_USER','admin'); //Username
define('PHPDIG_ADM_PASS','admin'); //Password

<p><b>4.2. Configuring paths</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

The $_SERVER array elements and CONFIG_CHECK constant in the config file are used to prevent full, direct 
access to the config file. Further, an ABSOLUTE_SCRIPT_PATH constant is set, and the $relative_script_path 
variable is checked, in the config file.
For users who want to run PhpDig from shell, set a cron job, or call the search page from other than the 
default location, it is <b>important</b> to read the top of the config file, as it gives details on how to add 
a path (relative or full) to the first IF statement in the config file.
<i>If you try to access the spider via shell or cron from a directory not permitted, or if you move the search page 
out of its default location without any other modification, you will find yourself with nothing happening.</i>
The top of the config file gives details on how to move the search page. Should you wish to run PhpDig from shell 
or set a cron job, define the ABSOLUTE_SCRIPT_PATH as your full path up to but not including admin directory, no 
end slash, and then read section 7 for further details.

<p><b>4.3. Configuring robot and engine</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

<pre class="cdtext">
define('SPIDER_MAX_LIMIT',20);          //max recurse levels in spider
define('RESPIDER_LIMIT',5);             //recurse respider limit for update
define('LINKS_MAX_LIMIT',20);           //max links per each level
define('RELINKS_LIMIT',5);              //recurse links limit for an update

define('LIMIT_DAYS',0);                 //default days before reindex a page
define('SMALL_WORDS_SIZE',2);           //words to not index - must be 2 or more
define('MAX_WORDS_SIZE',30);            //max word size

define('PHPDIG_EXCLUDE_COMMENT','&lt;!-- phpdigExclude --&gt;');  //comment to exclude a page part
define('PHPDIG_INCLUDE_COMMENT','&lt;!-- phpdigInclude --&gt;');  //comment to include a page part
                                                            // must be on own lines in HTML source
                                                            // text within comments not indexed
                                                            // links within comments still indexed

define('PHPDIG_DEFAULT_INDEX',false);    //phpDig considers /index or /default
                                         //html, htm, php, asp, phtml as the
                                         //same as '/'

define('PHPDIG_SESSID_REMOVE',true);        // remove SIDS or vars from indexed URLS
define('PHPDIG_SESSID_VAR','PHPSESSID,s');  // name of SID or variable to remove
                                            // can be 's' or comma delimited 's,id,var,foo,etc'

define('APPEND_TITLE_META',false);       //append title and meta information to results
define('TITLE_WEIGHT',3);                //relative title weight: APPEND_TITLE_META needs to be true

define('CHUNK_SIZE',1024);               //chunk size for regex processing

define('SUMMARY_LENGTH',500);            //length of results summary

define('TEXT_CONTENT_PATH','text_content/'); //Text content files path
define('CONTENT_TEXT',1);                    //Activates/deactivates the
                                             //storage of text content.
define('PHPDIG_IN_DOMAIN',false);            //allows phpdig jump hosts in the same
                                             //domain. If the host is "www.mydomain.tld",
                                             //domain is "mydomain.tld"

//for limit to directory, URL format must either have file at end or ending slash at end
//e.g., http://www.domain.com/dirs/ (WITH ending slash) or http://www.domain.com/dirs/dirs/index.php
define('LIMIT_TO_DIRECTORY',true);      //limit index to given (sub)directory, no sub dirs of dirs 
                                        //are indexed

define("END_OF_LINE_MARKER","\r\n");             // End of line marker - keep double quotes

define('PHPDIG_LOGS',true);               //write logs
define('LOG_CLICKS',true);                //log clicks
define('SILENCE_404S',true);              //silence 404 output

define('TEMP_FILENAME_LENGTH',8);         //filename length of temp files
// if using external tools with extension, use 4 for filename of length 8

define('USE_RENICE_COMMAND','1');         //use renice for process priority

// regular expression to ban useless external links in index

// regexp forbidden extensions - return sometimes text/html mime-type !!!
The BANNED constant means to ban external links in index, meaning that those links do not show up as keys in 
search results. The FORBIDDEN_EXTENSIONS constant means to ban certain links from being indexed. Don't let the 
name fool you. A regex can be set in the FORBIDDEN_EXTENSIONS constant to ban various types of links from even 
being indexed. Again, BANNED is to ban keys from search results, and FORBIDDEN_EXTENSIONS is to ban the index 
of links.
There is also a $allowed_link_chars variable for creating a class of characters to allow in links, a $spec 
array for translating from HTML entity to character, month names are set in the $month_names array, and the 
$apache_indexes array is used to help prevent the spider from crawling different orderings of directory 
listings. Certainly, you may also modify the $allowed_link_chars, $spec, $month_names array, and 
$apache_indexes settings to suit your fancy.

<p><b>4.4. Configure PhpDig encoding</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

<p><i>PhpDig does not support multiple or multi-byte encodings. The choosen encoding applies to all indexed 
documents and the admin interface. Choose one encoding per installation and stick with it.</i></p>
<pre class="cdtext">
define('PHPDIG_ENCODING','iso-8859-1');  // encoding for interface, search and indexing.
                                         // iso-8859-1, iso-8859-2, iso-8859-7, tis-620,
                                         // and windows-1251 supported in this version.
<p>If you want PhpDig to support another encoding, you have to add array indexes to the following variables, 
taking examples from existing ones. See the config file for examples.</p>
<pre class="cdtext">

<p>If your encoding is not available, you can <i>try</i> the following, where NAME represents your encoding name, 
so change NAME to that encoding. Also, remove the line breaks. Note, this is a generic encoding so it may or may 
not work.</p>
<pre class="cdtext">

$phpdig_string_subst['NAME'] = 

$phpdig_words_chars['NAME'] = '[:alnum:]ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß

<p>You may need to run the query "alter table keywords modify keyword varchar(64) binary;" for certain encodings.</p>

<p><b>4.5. Configuring external binaries</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
External binaries are programs that you may use in conjunction with PhpDig to extract text from 
Word, PDF, Excel, or PowerPoint files. If you do not want to index Word, PDF, Excel, or PowerPoint 
files, then you do not have to edit these constants.
Each external tool is defined by three constants<br>
- INDEX (true or false) : Activate this file type indexing<br>

- PARSE (path) : Executable path location<br>
- OPTION (options) : Options of the program<br>
- EXTENSION : File extension if not STDOUT<br>
<pre class="cdtext">
define('PHPDIG_OPTION_MSWORD','-s 8859-1');




// if external binary is not STDOUT or different extension is needed
// for example, use '.txt' if external binary writes to filename.txt

// if set to true is_executable used - set to '0' if is_executable is undefined
define('USE_IS_EXECUTABLE_COMMAND','1'); //use is_executable for external binaries
If you have problems configuring PhpDig to work with external binaries, make sure to read this link: 
<a href="http://www.phpdig.net/forum/showthread.php?t=799">http://www.phpdig.net/forum/showthread.php?t=799</a>

<p><b>4.6. Configuring templates</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

<pre class="cdtext">
$phpdig_language = "en"; // ca, cs, da, de, en, es, fr, gr, it, nl, no, pt

// $template = "$relative_script_path/templates/phpdig.html";
<p>See the config file for further details about the $template variable and its relation to the 
$template_demo variable.</p>
<pre class="cdtext">
define('HIGHLIGHT_BACKGROUND','#FFBB00');        //Highlighting background color
                                                 //Only for classic mode
define('HIGHLIGHT_COLOR','#000000');             //Highlighting text color
                                                 //Only for classic mode

define('LINK_TARGET','_blank');                  //Target for result links
define('WEIGHT_IMGSRC','./tpl_img/weight.gif');  //Baragraph image path
define('WEIGHT_HEIGHT','5');                     //Baragraph height
define('WEIGHT_WIDTH','50');                     //Max baragraph width

define('SEARCH_PAGE','search.php');              //The name of the search page
define('DISPLAY_DROPDOWN',true);                 //Display dropdown on search page
define('DROPDOWN_URLS',true);                    //Always URLs in dropdown: DISPLAY_DROPDOWN 
                                                 //needs to be true

define('SUMMARY_DISPLAY_LENGTH',150);            //Max chars displayed in summary
define('SNIPPET_DISPLAY_LENGTH',150);            //Max chars displayed in each snippet

define('DISPLAY_SNIPPETS',true);                 //Display text snippets
define('DISPLAY_SNIPPETS_NUM',4);                //Max snippets to display
define('DISPLAY_SUMMARY',false);                 //Display description

define('PHPDIG_DATE_FORMAT','\1-\2-\3');         // Date format for last update
                                                 // \1 is year, \2 month and \3 day
                                                 // if using rss, use date format \1-\2-\3

define('SEARCH_BOX_SIZE',15);                    // Search box size
define('SEARCH_BOX_MAXLENGTH',50);               // Search box maxlength

define('SEARCH_DEFAULT_LIMIT',10);      //results per page

// start is AND OPERATOR, exact is EXACT PHRASE, and any is OR OPERATOR
define('SEARCH_DEFAULT_MODE','start');  // default search mode (start|exact|any)
// in language pack make the appropriate changes to 'w_begin', 'w_whole', and 'w_part'
// 'w_begin' => 'and operator', 'w_whole' => 'exact phrase', 'w_part' => 'or operator'

define('NUMBER_OF_RESULTS_PER_SITE',-1);  //max number of results per site
                                          // use -1 to display all results
Last but not least, the PhpDig version number is defined in the config file.

<p><b>4.7. RSS and List configuration</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
<p>You can config PhpDig to create an RSS feed of results with a link to the ouput from the search page.</p>
<pre class="cdtext">
define('ALLOW_RSS_FEED',false);                       // Do RSS and display link - if true, 
                                                      // set rss dir to 777
$theenc = PHPDIG_ENCODING;                            // needs to be same encoding used in index
$theurl = "http://www.phpdig.net/";                   // site offering the RSS feed
$thetitle = "PhpDig.net";                             // title for site offering the RSS feed
$thedesc = "PhpDig :: Web Spider and Search Engine";  // description of site offering the RSS feed
$thedir = "./rss";                                    // the rss directory name, no ending slash
$thefile = "search.rss";                              // used in rss filenames
<p>You can config PhpDig to create a list of previous search queries with links to the search page.</p>
<pre class="cdtext">
define('LIST_ENABLE',true);             //do listing of past queries
define('LIST_PAGE','list.php');         //the name of the list page
define('LIST_NEW_WINDOW',1);            //open queries in new window
define('LIST_SHOW_ZEROS',0);            //show queries with zero results
define('LIST_DEFAULT_LIMIT',20);        //listings per page - positive integer of ten - 10,20,30,...
define('LIST_META_TAG','&lt;meta name="robots" content="noindex,nofollow"&gt;'); //meta tag for list page

<p><b>4.8. FTP configuration (if necessary)</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
PhpDig does not index FTP sites. It follows links as can be seen via a web browser, but can be an 
intensive process. Many PhpDig users install PhpDig on shared hosting accounts, and on such accounts, 
PHP is often configured with safe_mode activated. On these accounts, access to the crontab is usually 
not allowed.
If this is the case for you, and should your host permit you to connect to your MySQL database through 
TCP/IP, you may wish to try the FTP indexing option, as it sends textual content of indexed documents to 
the proper directory on a remote server. For instance, you could run an instance of PHP on a cable 
connection "server" to perform the update process.

If you deactivate the FTP function (in case of low bandwidth connections), then only the summary stored in 
the database, and not the exact document, is displayed on the results page. The FTP parameters are as follows.
<pre class="cdtext">
define('FTP_ENABLE',0);//enable ftp content for distant PhpDig
define('FTP_HOST','&lt;ftp host&gt;'); //if distant PhpDig, ftp host;
define('FTP_PORT',21); //ftp port
define('FTP_PASV',1); //passive mode
define('FTP_PATH','&lt;path to phpdig directory&gt;'); //distant path from the ftp root
define('FTP_TEXT_PATH','text_content');//ftp path to text-content directory
define('FTP_USER','&lt;ftp usename&gt;');
define('FTP_PASS','&lt;ftp password&gt;');

<p><b>4.9. Configuring cron</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

<p>You can config PhpDig to automatically create a cron file that you can use to run a cron job.</p>
<pre class="cdtext">
// NOTE: make sure ABSOLUTE_SCRIPT_PATH is the full path up to but not including the 
	//admin dir, no ending slash
// NOTE: CRON_ENABLE set to true writes a file at CRON_CONFIG_FILE containing the cron 
	//job information
// The CRON_CONFIG_FILE must be 777 permissions if applicable to your OS/setup
// You still need to call the CRON_CONFIG_FILE to run the cron job!!!
// From shell: crontab CRON_CONFIG_FILE to set the cron job: replace CRON_CONFIG_FILE 
	//with actual file
// From shell: crontab -l to list and crontab -d to delete
If you need a cron job tutorial or would rather manually create a cron file, see this link: 
<a href="http://www.phpdig.net/forum/showthread.php?t=323">http://www.phpdig.net/forum/showthread.php?t=323</a>

<a name="toc5"></a><b>5. Update PhpDig</b>
<p><b>5.1. Update tables</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

The [PHPDIG_DIR]/sql/update_db_to_[VERSION].sql files contain all required SQL instructions to update your 
existing install of PhpDig. The [PHPDIG_DIR]/sql/update_db.sql file contains the most recent changes to the 
database tables for use with the install.php script, and the [PHPDIG_DIR]/sql/init_db.sql is for a fresh 
install of the tables. If there is no SQL update file for a particular version, then there were no updates to 
the database tables for that version.
If you keep current on PhpDig, you can update the tables using the install script by choosing the update 
exisitng database option. Note, however, that this option will only update the tables with the most recent 
changes, meaning from the release right before the new release to the new release. If you need to update PhpDig 
from versions further back, then you need to manually update the database tables. Do read sections 3.3 and 3.4 
for further information on how to update database tables.

<p><b>5.2. Update files</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
After the database tables have been updated, if there were updates to be done, it's time to update the files so 
unzip the archive and, if you made previous changes to the config file, edit the new config file before doing the FTP. 
Also, compare the new _connect.php file in the archive with the old connect.php file on the server to see if there were 
any changes. If the new _connect.php file is different, copy the new _connect.php file to connect.php and put your 
database information this connect.php file. Do not put your database information in the _connect.php file. Read 
section 3.5 for more information on the _connect.php and connect.php files.
Once the config.php and connect.php files have been set, FTP the HTML, TXT, PHP, & SQL files in ASCII format, 
overwriting existing files, and FTP the GIF, PNG, & JPG files in binary format. Note that on rare occasion an FTP does 
not overwrite the exisitng files on the server. To ensure that the new files overwrite the old files, you might want 
to delete the old files from the server and then FTP the new files to the server. The graphic files have not been 
changed in a while, not to say they won't be changed someday, so you can usually get away with just FTPing the text 
files. After any tables changes are made and the new files are on the server, the update is complete.


<a name="toc6"></a><b>6. Indexing with web interface</b>
<p><b>6.1. Index a new host</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
Open the PhpDig admin panel with your browser by visiting http://www.yourdomain.com/[PHPDIG_DIR]/admin/index.php 
and enter the username and password you set in the config file. In the admin panel, enter one link per line in the 
text box. Each link can be as simple as a URI (e.g., http://www.domain.com) or can include path and file information 
(e.g., http://www.domain.com/dir/ or http://www.domain.com/dir/file.ext or 
The "update sites" page contains information as to how PhpDig should go about indexing a URI. Pretend that the 
"update sites" page contains the following information (you will <i>not</i> see the "days" column if CRON_ENABLE 
is set to false in the config file):

<pre class="cdtext">
ID  URL              Days   Links  Depth
1   www.domain1.com  [ 0 ]  [ 1 ]  [ 2 ]
2   www.domain2.com  [   ]  [   ]  [   ]
3   www.domain3.com  [   ]  [   ]  [   ]
Now in the admin panel, if you select "yes" then PhpDig will use the "update sites" values for www.domain1.com  and 
use the "search depth" and "links per" for both www.domain2.com and www.domain3.com. If you select "no" then Phpdig 
will use the "search depth" and "links per" for www.domain1.com, www.domain2.com, and www.domain3.com. Basically, 
selecting "yes" means to use the "update sites" values if present and use "search depth" and "links per" otherwise, 
and selecting "no" means to use "search depth" and "links per" regardless of the "update sites" values.
The "search depth" option tells PhpDig how deep into a URI to index. The URI you enter into the text box has a depth 
of zero (it's the start page for index), the links from the start page to another page are of depth one, the links 
from the links from the start page are of depth two, and so forth. The "links per" option tells PhpDig how many links 
per each depth to index, meaning if a page has ten links but you set "links per" to five, then at most only five of 
the ten links will be followed. In toto, the maximun number of links found is ((links_per * search_depth) + 1) when 
"links per" is greater than zero. Setting "search depth" to zero means to index just one page regardless of the 
"links per" value.
Once you set your options, click the "dig this" button, and PhpDig will be off and indexing. PhpDig reconizes if the 
URI is a new host or an existing host and will index accordingly. You should check the config file for various settings 
<i>prior</i> to indexing, as LIMIT_TO_DIRECTORY, PHPDIG_IN_DOMAIN, *_MAX_LIMIT, and *_LIMIT are the most common settings 
to affect the way PhpDig indexes a site. If the config settings are set one way, but you want to index another way, you 
know what to do, meaning change the config settings.

Once PhpDig starts to index, it will try to open a new webpage to show you its progress. Note that while PhpDig attempts 
to flush output to the browser window, not all OS/setups abide by this. Assmuning you see the new page, you will get a 
listing that shows what link PhpDig is currently working on, as well as "+" symbols indicating that PhpDig found a link 
in the current page for possible index at the next level. If you do not see the new page, then you will likely have to 
wait to see it until PhpDig has done its work and completed the indexing process.
You need to take note that, if you are going to index a lot of content, some browsers may timeout. If you run into 
this problem, you might try using Firefox 
(<a href="http://www.mozilla.org/products/firefox/">http://www.mozilla.org/products/firefox/</a>) 
as your browser. Also when indexing a lot of content, you are more susceptible to having your MySQL connection drop, 
having some server somewhere timeout, having your host kill the process, etcetera. None of these things have anything 
to do with the PhpDig code itself, as in it's not a bug. You need to have a decent server with the right privileges 
to run PhpDig.
That said, you really should plan on how to index. Setting things to levels allowing for the maximum index is not 
always the right or fastest way to go about getting things done. Think about the different options discussed herein 
and plan accordingly. For example, you may find that you can index adequate content in a shorter amount of time if 
you index a site from several locations and use lower levels. <u>Be kind to another person's resources!</u> It really 
isn't necessary to index every single page of another person's site, really!

<p><b>6.2. Update an existing host</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>

From the PhpDig admin panel, you can reach the PhpDig update page by choosing a site and clicking on the update 
button. Once on the update page, a two-part inteface appears. On the left side of the screen is the folder 
structure of the site with four options: 1) the red no-way icon (admin/deny.gif) deletes the folder and excludes 
it from future index, 2) the red cross icon (admin/no.gif) deletes the folder without excluding it from future 
index, 3) the green check icon (admin/yes.gif) reindexes the pages in that directory only, and 4) the blue arrow 
(admin/details.gif) shows you what pages are in that specific directory.
Should you click the blue arrow, content will appear on the right side for that directory with two options: 1) the 
the red cross icon (admin/no.gif) deletes that page from the index and 2) the green check icon (admin/yes.gif) 
reindexes that page only. Note that when you click a green check icon, whether on the left or right side of the 
screen, only that particular content is reindexed. If you wish to reindex a site, use the textbox on the index 
page of the PhpDig admin panel. Should you delete any content, you should run the various "clean" options from the 
PhpDig admin panel to shore up the engine.
The PhpDig update page also has an option to unlock a site should it remain locked. Note that you should not unlock 
a site that is currently being indexed. Also on the PhpDig update page, you will find the ability to change the 
username and password for a .htaccess protected site. If you are indexing something that is not .htaccess protected, 
you do not need to enter anything in the username and password fields.

<p><b>6.3. Index maintenance</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
There are four scripts available from the admin panel to delete useless data from PhpDig.

<li> Clean index - deletes content containing invalid values or not linked to a page.<br>
<li> Clean dictionary - deletes words no longer used such as when a site is deleted.<br>
<li> Clean common words - deletes words that appear in the common_words.txt file.<br>
<li> Clean dashes - deletes content that duplicates an index page to prevent copies.<br>

You should run the "clean" scripts periodically to keep the engine up-to-date!

<a name="toc7"></a><b>7. Indexing by command line interface (shell)</b>
<p><b>7.1. Run PhpDig from shell and log results</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
Should you have access to shell and PHP is configured to run from shell, you can run the [PHPDIG_DIR]/admin/spider.php 
script from shell rather than from a web browser. If you are able to run PHP from shell and want to launch the spider 
from shell, <i><u>make sure ABSOLUTE_SCRIPT_PATH is correctly set in the config file</u></i> and then use one of the 
following commands, editing the full paths as needed:

<pre class="cdtext">
/full/path/to/php -f /full/path/to/admin/spider.php http://www.domain.com &gt;&gt; /full/path/to/spider.log
/full/path/to/php -f /full/path/to/admin/spider.php /full/path/to/list.txt &gt;&gt; /full/path/to/spider.log
/full/path/to/php -f /full/path/to/admin/spider.php all &gt;&gt; /full/path/to/spider.log
/full/path/to/php -f /full/path/to/admin/spider.php forceall &gt;&gt; /full/path/to/spider.log
Note that you may use one of the following options after "/full/path/to/admin/spider.php" in the command:<br><br>
 - all (default) : update all hosts<br>

 - forceall : force update all hosts<br>
 - http://www.domain.com : add or update the site<br>
 - path/file : add or update all sites listed in the given file<br>
Use "http://www.domain.com" (without quotes) to spider that domain. If you use /full/path/to/list.txt then the 
list.txt file is to contain a list of full URLs (e.g., http://www.domain.com) with one URL per line. The "all" 
option is to reindex all sites currently in the database. Note that the all, domain, and file options index 
according to the LIMIT_DAYS and META tags timeframe, whereas the "forceall" option does a forced reindex. Also, 
an index may be affected by the *_LIMIT values in the config file or by the values in the site_page table. 
If you want to run the spider in the background, then append " &" (that's space shift-7 without quotes) to the 
command. If you wish to watch the spider.log live, then type on one line "tail -f /full/path/to/spider.log" 
(without quotes). Alternatively, you may remove " &gt;&gt; /full/path/to/spider.log" from the command, but then 
the shell window will be busy until the process is complete.


<p><b>7.2. Run PhpDig from cron and log results</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
Should you have access to shell, including the crontab program, you can run the [PHPDIG_DIR]/admin/spider.php 
script via cron job. If you have access to crontab via shell and want to launch the spider from a cron job, 
<i><u>make sure ABSOLUTE_SCRIPT_PATH is correctly set in the config file</u></i> and then read the cron 
tutorial that is provided at this link: 
<a href="http://www.phpdig.net/forum/showthread.php?t=323">http://www.phpdig.net/forum/showthread.php?t=323</a>
Note that a spider.log file gets automatically created at /full/path/to/spider.log so you can view the 
spider info. The "&gt;&gt;" means to append to the file. You could replace "&gt;&gt;" with "&gt;" (without quotes) 
so that the spider.log is overwritten each time. You may wish to check the size of the spider.log file when 
not indexing and delete spider.log as necessary.


<a name="toc8"></a><b>8. Templates</b>
The PhpDig templates are HTML files containing XML-like tags that are auto replaced with the dynamic PhpDig search 
results. You can set the template that you would like to use in the config file. The XML-like tags take the form of 
&lt;phpdig:[NAME]/&gt; except for the &lt;phpdig:results&gt; (indicates the start of the rsearch results) and 
&lt;/phpdig:results&gt; (indicates the end of the search results) tags. See the template files for examples.
Note that if you do not want the list of templates at the top of the search page, simply remove the 

&lt;phpdig:templates_links/&gt; tag from the template. If you do not like the look of the templates, it is easy 
enough to make your own custom template. PhpDig offers three different templating options, and you set your 
template option via the $template variable in the config file.

<p><b>8.1. Results using $template set to "$relative_script_path/templates/[TEMPLATE].html"</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
You can make your own templates/[TEMPLATE].html file by following the same outline as can be found in the various 
PhpDig template files. You need to realize that all of the templates/[TEMPLATE].html files are static. You cannot 
include dynamic content in the templates/[TEMPLATE].html files by simply editing the HTML files. However, it is 
possible to get dynamic content in the static templates. For a tutorial on how to do this, see this link: 
<a href="http://www.phpdig.net/forum/showthread.php?t=348">http://www.phpdig.net/forum/showthread.php?t=348</a>. Be 
sure to read the <i>entire</i> thread. Alternatively, consider setting $template to "array" for more control.


<p><b>8.2. Results using $template set to "classic"</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
This option sets a generic PhpDig template for search results. Note that the HIGHLIGHT_BACKGROUND and 
HIGHLIGHT_COLOR options in the config file are the only options that can affect the look of the classic template, 
unless you go into the PhpDig code itself and make changes. Other than that, there is no file to edit in order to 
change the look of the output.

<p><b>8.3. Results using $template set to "array"</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
You can have PhpDig return an array with elements from the search results by setting $template = "array"; in 
the config file <u>and prior to v.1.8.7</u> you needed to edit search.php by setting phpdigSearch(...); to a 
variable named say $arrayout and print_r($arrayout); the array elements like so:
<pre class="cdtext">
$arrayout = phpdigSearch($id_connect, $query_string, $option, $refine,
              $refine_url, $lim_start, $limite, $browse, $site, $path, 
              $relative_script_path, $template, $adlog_flag, $rssdf, $template_demo);
If using a version <i>prior</i> to v.1.8.7, simply view the HTML source of the search page via a browser to see 
a nice format of the array elements available for use in a custon search page. If using v.1.8.7+ then simply edit 
the custom_search.php file to your desire, as the file contains code to output the various array elements.
Note that, if you wish to set $template to "array" to create your own custom page, you should be somewhat familiar 
with PHP itself. Otherwise, don't use this option unless you want to learn. Should you use the "array" option, you 
will need to write PHP code to format the output to your preference. You can find a somewhat <i>dated</i> but still 
applicable outline of how to utilize $template = "array"; at this link: 
<a href="http://www.phpdig.net/forum/showpost.php?p=3379&postcount=6">http://www.phpdig.net/forum/showpost.php?p=3379</a>

<a name="toc9"></a><b>9. Insert PhpDig into a website</b>
<p><b>9.1. Using the search.php script</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
You are able to place the search.php page wherever you'd like by following these instructions:
<pre class="cdtext">
/***** Example
* PhpDig installed at: http://www.domain.com/phpdig/
* Want search page at: http://www.domain.com/search.php
* Copy http://www.domain.com/phpdig/search.php to http://www.domain.com/search.php
* Copy http://www.domain.com/phpdig/clickstats.php to http://www.domain.com/clickstats.php
* Set $relative_script_path = './phpdig'; in search.php, clickstats.php, and function_phpdig_form.php
* Add ($relative_script_path != "./phpdig") && to if statement
See section 4.2 and the config file for futher details.

<p><b>9.2. Using a simple HTML form</b> &nbsp;&nbsp;&nbsp;&nbsp; <a href="#toc">[top]</a></p>
You can add a search box to any webpage using the following HTML form:
<pre class="cdtext">
&lt;form action="http://www.YOURDOMAIN.COM/DIR/search.php" method="post"&gt;
&lt;input type="text" name="query_string" value=""&gt;
&lt;input type="submit" name="search" value="Go"&gt;
Note that you have to set the "action" element to your search link. If you want something more elaborate for your 
HTML form, then view the HTML source of a page on this site with a form, copy the HTML, and edit accordingly.

<a name="toc10"></a><b>10. Getting help with PhpDig</b>

A messageboard dedicated to PhpDig can be found at this link: 
<a href="http://www.phpdig.net/">http://www.phpdig.net/</a>
Ask there any questions you have about this script.

<div align="center"><br />Copyright &copy; 2001 - 2005, PhpDig.net, ThinkDing LLC. All Rights Reserved.</div>
Return current item: Community Learning Network