<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" dir="ltr">
<head>
<title>PHPCrawl webcrawler library for PHP - Requirements</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link type="text/css" rel="stylesheet" media="all" href="style.css" />
</head>
<body>
<div id="wrapper">
<div id="page">
<div id="top">
<h1 style="margin: 0px; float: left;">PHPCrawl webcrawler library/framework</h1>
</div>
<div id="container">
<iframe id="menuframe" src="menu.html" scrolling="no" frameborder="0"></iframe>
<div id="content">
<h3>Resume aborted crawling-processes</h3><br />
Sometimes it may happen that a crawling-process (or a script thats using phpcrawl) gets aborted/terminated
while spidering a website before it was finished completely (for whatever reasons).<br /><br />
Since PHPCrawl 0.81 it is possible to resume such a terminated script/process from the point where it was
halted (so it's not necessary to restart the script all over again).<br /><br />
In order to be able to resume a crawling-process, you'll have to
<ul>
<li>Initially call the <a href="classreferences/PHPCrawler/method_detail_tpl_method_enableResumption.htm" target="blank">enableResumption()</a> method in your script (from the first start, this prepares the crawler for possible
resumption and is necessary to be able to resume the script later on)<br />
</li>
<li>
Determinate the unique crawler-ID by calling <a href="classreferences/PHPCrawler/method_detail_tpl_method_getCrawlerId.htm" target="blank">getCrawlerId()</a> and store it somewhere (this ID is needed for identifying the process that should be resumed later on)<br />
<p id="code">
</span><span style="color: #FF8000">// ...
<br /></span><span style="color: #0000BB">$crawler </span><span style="color: #007700">= new </span><span style="color: #0000BB">MyCrawler</span><span style="color: #007700">();
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">setURL</span><span style="color: #007700">(</span><span style="color: #DD0000">"www.anyurl.com"</span><span style="color: #007700">);
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">enableResumption</span><span style="color: #007700">();
<br />
<br /></span><span style="color: #0000BB">$ID </span><span style="color: #007700">= </span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">getCrawlerId</span><span style="color: #007700">();
<br /></span><span style="color: #FF8000">// ...
<br /></span><span style="color: #0000BB">?></span>
</span>
</p>
</li>
</ul>
In order to resume an aborted process, you'll have to
<ul>
<li>Call the <a href="classreferences/PHPCrawler/method_detail_tpl_method_resume.htm" target="blank">resume()</a>-method before calling the go() or goMultiProcessed() method and pass the crawler-ID of the terminated crawling-process
you want to resume to it (as returned by <a href="classreferences/PHPCrawler/method_detail_tpl_method_getCrawlerId.htm" target="blank">getCrawlerId()</a>)<br />
<p id="code">
<span style="color: #FF8000">// ...
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">resume</span><span style="color: #007700">(</span><span style="color: #0000BB">120912912109</span><span style="color: #007700">); </span><span style="color: #FF8000">//ID of the aborted process
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">goMultiProcessed</span><span style="color: #007700">(</span><span style="color: #0000BB">5</span><span style="color: #007700">);
<br /></span><span style="color: #FF8000">// ...
<br /></span>
<p/>
</li>
</ul>
The following listing is a complete example of a resumable crawler-script (it's also included in the phpcrawl package).
You may test it by starting it from the commandline (CLI, type "php resumable_example.php"), abort it (Ctrl^C) and start it again).
<p id="code" style="width: 570px">
<span style="color: #000000">
<span style="color: #0000BB"><?php
<br />
<br /></span><span style="color: #FF8000">// Inculde the phpcrawl-mainclass
<br /></span><span style="color: #007700">include(</span><span style="color: #DD0000">"libs/PHPCrawler.class.php"</span><span style="color: #007700">);
<br />
<br /></span><span style="color: #FF8000">// Extend the class and override the handleDocumentInfo()-method
<br /></span><span style="color: #007700">class </span><span style="color: #0000BB">MyCrawler </span><span style="color: #007700">extends </span><span style="color: #0000BB">PHPCrawler
<br /></span><span style="color: #007700">{
<br /> function </span><span style="color: #0000BB">handleDocumentInfo</span><span style="color: #007700">(</span><span style="color: #0000BB">$DocInfo</span><span style="color: #007700">)
<br /> {
<br /> </span><span style="color: #FF8000">// Just detect linebreak for output
<br /> </span><span style="color: #007700">if (</span><span style="color: #0000BB">PHP_SAPI </span><span style="color: #007700">== </span><span style="color: #DD0000">"cli"</span><span style="color: #007700">) </span><span style="color: #0000BB">$lb </span><span style="color: #007700">= </span><span style="color: #DD0000">"\n"</span><span style="color: #007700">;
<br /> else </span><span style="color: #0000BB">$lb </span><span style="color: #007700">= </span><span style="color: #DD0000">"<br />"</span><span style="color: #007700">;
<br />
<br /> </span><span style="color: #FF8000">// Print the URL
<br /> </span><span style="color: #007700">echo </span><span style="color: #DD0000">"Page requested: "</span><span style="color: #007700">.</span><span style="color: #0000BB">$DocInfo</span><span style="color: #007700">-></span><span style="color: #0000BB">url</span><span style="color: #007700">.</span><span style="color: #0000BB">$lb</span><span style="color: #007700">;
<br /> </span><span style="color: #0000BB">flush</span><span style="color: #007700">();
<br /> }
<br />}
<br />
<br /></span><span style="color: #0000BB">$crawler </span><span style="color: #007700">= new </span><span style="color: #0000BB">MyCrawler</span><span style="color: #007700">();
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">setURL</span><span style="color: #007700">(</span><span style="color: #DD0000">"www.php.net"</span><span style="color: #007700">);
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">addContentTypeReceiveRule</span><span style="color: #007700">(</span><span style="color: #DD0000">"#text/html#"</span><span style="color: #007700">);
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">addURLFilterRule</span><span style="color: #007700">(</span><span style="color: #DD0000">"#\.(jpg|jpeg|gif|png)$# i"</span><span style="color: #007700">);
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">setPageLimit</span><span style="color: #007700">(</span><span style="color: #0000BB">50</span><span style="color: #007700">); </span><span style="color: #FF8000">// Set page-limit to 50 for testing
<br />
<br />// Important for resumable scripts/processes!
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">enableResumption</span><span style="color: #007700">();
<br />
<br /></span><span style="color: #FF8000">// At the firts start of the script retreive the crawler-ID
<br />// and store it
<br />// (in a temporary file in this example)
<br /></span><span style="color: #007700">if (!</span><span style="color: #0000BB">file_exists</span><span style="color: #007700">(</span><span style="color: #DD0000">"/tmp/mycrawlerid_for_php.net.tmp"</span><span style="color: #007700">))
<br />{
<br /> </span><span style="color: #0000BB">$crawler_ID </span><span style="color: #007700">= </span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">getCrawlerId</span><span style="color: #007700">();
<br /> </span><span style="color: #0000BB">file_put_contents</span><span style="color: #007700">(</span><span style="color: #DD0000">"/tmp/mycrawlerid_for_php.net.tmp"</span><span style="color: #007700">, </span><span style="color: #0000BB">$crawler_ID</span><span style="color: #007700">);
<br />}
<br /></span><span style="color: #FF8000">// If the script was restarted again (after it was aborted),
<br />// read the crawler-ID and pass it to the resume() method.
<br /></span><span style="color: #007700">else
<br />{
<br /> </span><span style="color: #0000BB">$crawler_ID </span><span style="color: #007700">= </span><span style="color: #0000BB">file_get_contents</span><span style="color: #007700">(</span><span style="color: #DD0000">"/tmp/mycrawlerid_for_php.net.tmp"</span><span style="color: #007700">);
<br /> </span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">resume</span><span style="color: #007700">(</span><span style="color: #0000BB">$crawler_ID</span><span style="color: #007700">);
<br />}
<br />
<br /></span><span style="color: #FF8000">// Start crawling
<br /></span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">goMultiProcessed</span><span style="color: #007700">(</span><span style="color: #0000BB">5</span><span style="color: #007700">);
<br />
<br /></span><span style="color: #FF8000">// Delete the stored crawler-ID after the process is finished
<br />// completely and successfully.
<br /></span><span style="color: #0000BB">unlink</span><span style="color: #007700">(</span><span style="color: #DD0000">"/tmp/mycrawlerid_for_php.net.tmp"</span><span style="color: #007700">);
<br />
<br /></span><span style="color: #0000BB">$report </span><span style="color: #007700">= </span><span style="color: #0000BB">$crawler</span><span style="color: #007700">-></span><span style="color: #0000BB">getProcessReport</span><span style="color: #007700">();
<br />
<br />if (</span><span style="color: #0000BB">PHP_SAPI </span><span style="color: #007700">== </span><span style="color: #DD0000">"cli"</span><span style="color: #007700">) </span><span style="color: #0000BB">$lb </span><span style="color: #007700">= </span><span style="color: #DD0000">"\n"</span><span style="color: #007700">;
<br />else </span><span style="color: #0000BB">$lb </span><span style="color: #007700">= </span><span style="color: #DD0000">"<br />"</span><span style="color: #007700">;
<br />
<br />echo </span><span style="color: #DD0000">"Summary:"</span><span style="color: #007700">.</span><span style="color: #0000BB">$lb</span><span style="color: #007700">;
<br />echo </span><span style="color: #DD0000">"Links followed: "</span><span style="color: #007700">.</span><span style="color: #0000BB">$report</span><span style="color: #007700">-></span><span style="color: #0000BB">links_followed</span><span style="color: #007700">.</span><span style="color: #0000BB">$lb</span><span style="color: #007700">;
<br />echo </span><span style="color: #DD0000">"Documents received: "</span><span style="color: #007700">.</span><span style="color: #0000BB">$report</span><span style="color: #007700">-></span><span style="color: #0000BB">files_received</span><span style="color: #007700">.</span><span style="color: #0000BB">$lb</span><span style="color: #007700">;
<br />echo </span><span style="color: #DD0000">"Bytes received: "</span><span style="color: #007700">.</span><span style="color: #0000BB">$report</span><span style="color: #007700">-></span><span style="color: #0000BB">bytes_received</span><span style="color: #007700">.</span><span style="color: #DD0000">" bytes"</span><span style="color: #007700">.</span><span style="color: #0000BB">$lb</span><span style="color: #007700">;
<br />echo </span><span style="color: #DD0000">"Process runtime: "</span><span style="color: #007700">.</span><span style="color: #0000BB">$report</span><span style="color: #007700">-></span><span style="color: #0000BB">process_runtime</span><span style="color: #007700">.</span><span style="color: #DD0000">" sec"</span><span style="color: #007700">.</span><span style="color: #0000BB">$lb</span><span style="color: #007700">;
<br /></span><span style="color: #0000BB">?></span> </span>
</p>
</div>
<!--
<?php
include("google_code.php");
?>
-->
</div>
<div id="footer">Copyright © 2003 - 2012 Uwe Hunfeld hide@address.com</div>
</div>
</div>
</body>
</html>