<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" dir="ltr">
<head>
<title>PHPCrawl webcrawler library for PHP</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link type="text/css" rel="stylesheet" media="all" href="style.css" />
</head>
<body>
<div id="wrapper">
<div id="page">
<div id="top">
<h1 style="margin: 0px; float: left;">PHPCrawl webcrawler library</h1>
<div style="margin-left: 670px; margin-top: 14px; font-size: 12px;">Docs for version 0.8x</div>
</div>
<div id="container">
<div id="left">
<ul>
<li><a href="index.html">About PHPCrawl</a></li>
<li>
Documentation
<ul id="submenu">
<li><a href="requirements.html">Requirements</a></li>
<li><a href="quickstart.html">Installation & Quickstart</a></li>
<li><a href="example.html">Example</a></li>
<li><a href="multiprocesses.html">Using multi-processes</a></li>
<li><a href="multiprocessing_modes.html">Multiprocessing Modes</a></li>
<li><a href="spidering_huge_websites.html">Spidering huge websites</a></li>
<li><a href="faq.html">FAQ</a></li>
<li><a href="classreferences/index.html" target="blank"><u>Complete Class References</u></a></li>
</ul>
</li>
<li class="fat"><a href="http://sourceforge.net/projects/phpcrawl/files/PHPCrawl/" target="_blank">Download PHPCrawl</a></li>
<li><a href="testinterface.html">Testinterface</a></li>
<li><a href="versionhistory.html">Version history</a></li>
<li><a href="http://sourceforge.net/projects/phpcrawl/forums/forum/307696" target="_blank">Forum</a></li>
<li><a href="http://sourceforge.net/tracker/?group_id=89439&atid=590146" target="_blank">Report a bug</a></li>
</ul>
<div id="sf">
<a href="http://sourceforge.net/projects/phpcrawl"><img src="http://sflogo.sourceforge.net/sflogo.php?group_id=89439&type=14" width="150" height="40" alt="Get PHPCrawl at SourceForge.net. Fast, secure and Free Open Source software downloads" /></a>
</div>
<div id="sf">
<form action="https://www.paypal.com/cgi-bin/webscr" method="post">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="hosted_button_id" value="M53G4LP6XNHM4">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donate_SM.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/de_DE/i/scr/pixel.gif" width="1" height="1">
</form>
</div>
</div>
<div id="content">
<h3>Installation & Quickstart</h3><br />
The following steps show how to use phpcrawl:<br />
<ol>
<li>
Unpack the phpcrawl-package somewhere. That's all you have to do for installation.
</li>
<li>
Include the phpcrawl-mainclass to your script or project. Its located in the "libs"-path of the package.
<p id="code">
<span style="color: #000000">
<span style="color: #007700">include(</span><span style="color: #DD0000">"libs/PHPCrawler.class.php"</span><span style="color: #007700">);
<br /></span>
</span>
</code>
</p>
There are no other includes needed.
</li>
<li>
Extend the phpcrawler-class and override the <a href="classreferences/PHPCrawler/method_detail_tpl_method_handleDocumentInfo.htm" target="blank">handleDocumentInfo</a>-method with your own code to process the information of every document the crawler finds on its way.
<p id="code">
<span style="color: #000000">
<span style="color: #007700">class </span><span style="color: #0000BB">MyCrawler </span><span style="color: #007700">extends </span><span style="color: #0000BB">PHPCrawler
<br /></span><span style="color: #007700">{
<br /> function </span><span style="color: #0000BB">handleDocumentInfo</span><span style="color: #007700">(</span><span style="color: #0000BB">PHPCrawlerDocumentInfo $PageInfo</span><span style="color: #007700">)
<br /> {
<br /> </span><span style="color: #FF8000">// Your code comes here!
<br /> // Do something with the $PageInfo-object that
<br /> // contains all information about the currently
<br /> // received document.
<br />
<br /> // As example we just print out the URL of the document
<br /> </span><span style="color: #007700">echo </span><span style="color: #0000BB">$PageInfo</span><span style="color: #007700">-></span><span style="color: #0000BB">url</span><span style="color: #007700">.</span><span style="color: #DD0000">"\n"</span><span style="color: #007700">;
<br /> }
<br />}
<br /></span>
</span>
</p>
For a list of all available information about a page or file within the
<a href="classreferences/PHPCrawler/method_detail_tpl_method_handleDocumentInfo.htm" target="blank">handleDocumentInfo</a>-method see the
<a href="classreferences/PHPCrawlerDocumentInfo/overview.html" target="blank">PHPCrawlerDocumentInfo</a>-reference.
<br><br>
<i>Note to users of phpcrawl 0.7x or before: The old, overridable method "<a href="classreferences/PHPCrawler/method_detail_tpl_method_handlePageData.htm" target="blank">handlePageData()</a>", that receives the document-information as an array, still is
present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions.</i>
</li>
<li>
Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.
<p id="code">
<span style="color: #0000BB">
$crawler </span><span style="color: #007700">= new </span><span style="color: #0000BB">MyCrawler</span><span style="color: #007700">();
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">setURL</span><span style="color: #007700">(</span><span style="color: #DD0000">"www.foo.com"</span><span style="color: #007700">);
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">addContentTypeReceiveRule</span><span style="color: #007700">(</span><span style="color: #DD0000">"#text/html#"</span><span style="color: #007700">);
<br /></span><span style="color: #FF8000">// ...
<br />
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">go</span><span style="color: #007700">();
<br /></span>
</span>
</p>
For a list of all available setup-options/methods of the crawler take a look at the <a href="classreferences/PHPCrawler/overview.html" target="blank">PHPCrawler</a>-classreference.
</li>
</ol>
</div>
</div>
</div>
</div>
</body>
</html>