Location: PHPKode > projects > PHPCrawl > PHPCrawl_080/documentation/versionhistory.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" dir="ltr">
<head>
  <title>PHPCrawl webcrawler library for PHP</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <link type="text/css" rel="stylesheet" media="all" href="style.css" />
</head>

<body>

<div id="wrapper">

  <div id="page">
    
      <div id="top">
        <h1 style="margin: 0px; float: left;">PHPCrawl webcrawler library</h1>
        
        <div style="margin-left: 670px; margin-top: 14px; font-size: 12px;">Docs for version 0.8x</div>
      </div>
      
      <div id="container">
      
        <div id="left">
        
          <ul>
          <li><a href="index.html">About PHPCrawl</a></li>
          <li>
          Documentation
           <ul id="submenu">
           <li><a href="requirements.html">Requirements</a></li>
           <li><a href="quickstart.html">Installation & Quickstart</a></li>
           <li><a href="example.html">Example</a></li>
           <li><a href="multiprocesses.html">Using multi-processes</a></li>
           <li><a href="multiprocessing_modes.html">Multiprocessing Modes</a></li>
           <li><a href="spidering_huge_websites.html">Spidering huge websites</a></li>
           <li><a href="faq.html">FAQ</a></li>
           <li><a href="classreferences/index.html" target="blank"><u>Complete Class References</u></a></li>
           </ul>
          </li>
          
          <li class="fat"><a href="http://sourceforge.net/projects/phpcrawl/files/PHPCrawl/" target="_blank">Download PHPCrawl</a></li>
          <li><a href="testinterface.html">Testinterface</a></li>
          <li><a href="versionhistory.html">Version history</a></li>
          <li><a href="http://sourceforge.net/projects/phpcrawl/forums/forum/307696" target="_blank">Forum</a></li>
          <li><a href="http://sourceforge.net/tracker/?group_id=89439&atid=590146" target="_blank">Report a bug</a></li>
          </ul>
         
         <div id="sf">
         <a href="http://sourceforge.net/projects/phpcrawl"><img src="http://sflogo.sourceforge.net/sflogo.php?group_id=89439&amp;type=14" width="150" height="40" alt="Get PHPCrawl at SourceForge.net. Fast, secure and Free Open Source software downloads" /></a>
         </div>
         
         <div id="sf">
         <form action="https://www.paypal.com/cgi-bin/webscr" method="post">
         <input type="hidden" name="cmd" value="_s-xclick">
         <input type="hidden" name="hosted_button_id" value="M53G4LP6XNHM4">
         <input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donate_SM.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
         <img alt="" border="0" src="https://www.paypalobjects.com/de_DE/i/scr/pixel.gif" width="1" height="1">
         </form>
         </div>

        </div>
        
        <div id="content">
        <h3>Version-History</h3><br />
        
        <b>Version 0.80 beta</b><br>
        2012/04/23<br><br>
        
        Version 0.80 was (almost) completely rewritten and ported to "proper" object oriented PHP5-code.
        It brings new features like the ability to use multiple processes to spider a website and a new, internal,
        hdd-based caching-mechanism making it possible to spider even very huge websites.
        Even though some methods and options were renamded and redone, phpcrawl 0.8 should be fully compatible to older
        versions.<br><br>
        
        The changes in detail:
        <br>
       
        <ul class="changelog">
        <li>
        Code was completely refactored, ported to PHP5-OO-code and a lot of code was rewritten.
        </li>
        
        <li>
        Added the ability to use use multiple processes to spider a website. Method "goMultiProcessed()" added.
        </li>
        
        <li>
        New overridable method "initChildProcess()" added for initiating child-processes when using the crawler in multi-process-mode.
        </li>
        
        <li>
        Implementet an alternative, internal SQlite caching-mechanism for URLs making it possible to spider very large websites.<br />
        Method "setUrlCacheType()" added.
        </li>
        
        <li>
        New method setWorkingDirectory() added for defining the location of the crawlers temporary working-directory manually.
        Therefor method "setTmpFile()" is marked as deprecated (has no function anymore).
        </li>
        
        <li>
        New method "addContentTypeReceiveRule()" replaces the old method "addReceiveContentType()".<br>
        The function "addReceiveContentType()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        New method "addURLFollowRule()" replaces the old method "addFollowMatch()".<br>
        The function "addFollowMatch()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        New method "addURLFilterRule()" replaces the old method "addNonFollowMatch()".<br>
        The function "addNonFollowMatch()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        The crawler now is able to parse and obey "nofollow"-tags defined as meta-tags or rel-options (like "&lt;a href="page.html" rel="nofollow"&gt;")<br>
        Method "obeyNoFollowTags()" added.
        </li>
        
        <li>
        Overridabel user-method "handleDocumentInfo()" replaces the old method "handlePageData()".<br>
        Method "handlePageData()" still exists, but was marked as deprecated.<br>
        The new method now provides information about a document as a PHPCrawlerDocumentInfo-object instead
        of an array.
        </li>
        
        <li>
        New method "enableAggressiveLinkSearch()" replaces the old method "setAggressiveLinkExtraction()".<br>
        The function "setAggressiveLinkExtraction()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        New method "setLinkExtractionTags()" replaces the old method "addLinkExtractionTags()" (since it was named wrong, itdidn't add tags, it overwrites tha tag-list).<br>
        The function "addLinkExtractionTags()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        New method "addStreamToFileContentType()" replaces the old method "addReceiveToTmpFileMatch()".<br>
        The function "addReceiveToTmpFileMatch()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        New method "enableCookieHandling()" replaces the old method "setCookieHandling()".<br>
        The function "setCookieHandling()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        It's possible now to use a proxy-server for spidering websites. Method "setProxy()" added.
        </li>
        
        <li>
        Method "addReceiveToMemoryMatch()" has no function anymore and is marked as deprecated (since it was redundant).
        Users should use addStreamToFileContentType() now instead.
        </li>
        
        <li>
        Method "disableExtendedLinkInfo()" has no function anymore and is marked as deprecated.
        To reduce the memory-usage of the crawler users now should use the internal SQLite-urlcache
        (by calling "setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE)").
        </li>
        
        <li>
        New method "getProcessReport()" replaces the old method "getReport()".<br>
        The function "getReport()" still is present, but was marked as deprecated.
        </li>
        
        <li>
        Method "setFollowRedirectsTillContent()" has no function anymore and was marked as deprecated
        (since it completely interfered with other settings and rules that can be applied to the crawler).
        </li>
        
        <li>
        New overridable method "handleHeaderInfo()" added. This method will be called after the header of a document was
        received and BEFORE the content will be received. Gives the user the opportunity to abort the request (based on
        the http-header the server responded).
        </li>
        
        <li>
        Added some more information about found documents to the array (respectively PHPCrawlerDocumentInfo-object) that is passed
        to the overridable user-function handleDocumentInfo(): cookies (all cookies the server with the document), responseHeader
        (header the webserver responded a PHPCrawlerResponseHeader-object), error_occured (flag indication whether an error occured
        during the request), data_transfer_rate (average data-transferrate for receiving the document), data_transfer_time (time it
        took to receive the document) and meta_attributes (all meta-tag atteributes found in the source of the document).
        </li>
        
        <li>
        Added the ability to send post-data with requests for defined URLs, method "addPostData()" added.
        </li>
        
        <li>
        Fixed bug 3368719: When creating two (or more) instances of the crawler and obeyRobotsTxt() was set to true in both of them,
        the second call of the go()-method caused a fatal error. 
        </li>
        
        <li>
        Fixed bug 3368722: Flag "file_limit_reached" in the array that getReport() returns wasn't set to true when a page-limimt (setPageLimit()) was hit.
        </li>
        
        <li>
        Fixed bug 3389965: phpcrawl now shouldn't throw any notices and warnings anymore when error_reporting is set to E_STRICT or E_ALL.
        </li>
        
        <li>
        Fixed bug 3413627: Links like "&lt;a href="?page=3"&gt;" get recognized and rebuild correctly now.
        </li>
        
        <li>
        Fixed bug 3465964: Robots.txt-documents get parsed correctly now, even if the "User-agent"-directive in robots.txt-files
        is written like "user-agent" or "User-Agent".
        </li>
        
        <li>
        Fixed bug 3485155: Links like "test.htm?redirect=http://www.foo.ie&a=b" get recognized and rebuild correctly now.
        </li>
        
        <li>
        Fixed bug 3504517: Requests for links containing special characters (like "äöü") sometimes didn't work as expected 
        due to wrong character-encoding (server responded 404 NOT FOUND).
        </li>
        
        <li>
        Fixed bug 3510270: The array "links_found" of the array $page_data passed to the function handlePageData()
        never contained a linktext.
        </li>
        
        </ul>
        
        <b>Version 0.71</b><br>
        2011/06/30<br><br>
        
        Version 0.71 fixes some bugs and issues. One feature was added.<br>
        This will probably be the last version of phpcrawl that is compatible with PHP 4.<br><br>
        
        The changes in detail:
        <br>

        
        <ul class="changelog">
        <li>
        Bugfix: Empty links found in documents (like &lt;a href=""&gt;) are rebuild correctly now so that
        they lead to the same document they were found in.
        
        
        <li>
        Bugfix: It's possible now to initiate more than one instance of the phpcrawler-class without getting an "Fatal error: Call to undefined method stdClass::receivePage()"
        error.
        
        
        <li>
        A new method "setLinkExtractionTags()" replaces the "addLinkExtractionTags()" method.<br>

        The old method addLinkExtractionTags() is named wrong since it doesn't ADD new tags, it OVERWRITES the tag-list.<br>
        addLinkExtractionTags() still exists for compatibility-reasons, but was marked as deprecated.
        
        
        <li>
        Bugfix: Links containing spaces like &lt;a href="any file.hmtl"&gt; are recognized and processed correctly now.
        
        
        <li>
        Bugfix: Links containing quotes like &lt;a href="any'file.html"&gt; are recognized and processed correctly now.
        

        
        <li>
        The search-patterns used for agressive link-extractions (if setAggressiveLinkExtraction() set to TRUE) were redone.
        They should give some better results now.
        
        
        <li>
        Bugfix: Phpcrawl doesn't crash anymore with a segmentation fault when parsing links that are laid over very long text or html
        (like &lt;a href="foo,htm"&gt; ... very very long text goes here ...&lt;/a&gt;)
        
        
         <li>

        Method addLinkSearchContentType() added.<br>
        It's possible now to manually define what kind of documents should get parsed
        in order to find links in (according to their content/mime-type).<br>
        Before (and still by default) only documents of type "text/html" get checked for links.
        
        
        <li>
        Bugfix: Links containing linebreaks are recognized and processed correctly now.
        
        
        <li>
        Default value for socket-stream-timeout increased to 5 seconds.
        Although this can be set manually by using the setStreamTimeout()-method, the default value of 2 seconds was a little to low. 
        

        
        <li>
        Fixed some minor-bugs ih the test-interface and updated the containing example-setup (since
        the old one didn't work anymore because of site-changes over at php.net)
        
        
        </ul>
      </p>
      
      <p>
        <b>Version 0.7</b><br>
        2007/01/05<br><br>
        

        Version 0.7 brings new features like a link-priority-system, support for SSL (https) and robots.txt-files, support for basic-authentication,
        the capability to stream content directly to a temporary file and some more.<br>
        Also more information about found pages and files
        is passed to the user now, the performance was improved and the link-extraction was redone for some better results.<br><br>
        
        The changes in detail:
        <br>
        
        <ul class="changelog">
        <li>
        New function setLinkPriority() added.
        Its possible now to set different priorities (priority-levels)
        for different links.
        

        <li>
        Found links are cached now in a different way so that the general performance
        was improved, especially when cralwing a lot of pages (big sites) in one process.
        

        <li>
        Added support for Secure Sockets Layer (SSL), so links like "https://.." 
        will be followed corretly now.<br>
        Please note that the PHP OpenSSL-extension is required for it to work.
        

        <li>

        The link-extraction and other parts were redone and should give better results now.
        
        
        <li>
        Methods setAggressiveLinkExtraction() and addLinkExtractionTag() added for
        setting up the link extraction manually.
        
        
        <li>
        Added support for robots.txt-files, method obeyRobotsTxt() added.
        

        <li>
        More information about found pages and files will be passed now to the method handlePageData(),
        especially information about all found links in the current page is available now. Method 
        disableExtendedLinkInfo() added.
        

        <li>
        The crawler now is able to handle links like "http://www.foo.com:123"
        correctly and will send requests to the correct port. Method setPort() added.
        

        <li>
        The content of pages and files now can be received/streamed diretly to a temporary file
        instead of to memory, so now it shouldn't be a problem anymore to let the crawler receive big files.<br>
        Methods setTmpFile(), addReceiveToTmpFileMatch() and addReceiveToMemoryMatch() added.
        

        <li>

        Added support for basic authentication. Now the crawler is able to
        crawl protected content. Method addBasicAuthentication() added.
        

        <li>
        Now its possible to abort the crawling-process by letting the overridable function
        handlePageData() return any negative value.
        

        <li>
        Method setUserAgentString() added for setting the user-agent-string in request-headers
        the crawler sends.
        

        <li>

        A html-testinterface for testing the different setups of the crawler
        is included in the package now. 
        

        <li>
        The crawler doesn't do DNS-lookups anymore for every single page-request,
        only if the host is changing a lookup will be done once. This improves performance a little bit.
        

        <li>
        The crawler doesn't look for links anymore in every file it finds,
        only "text/html"-content will be checked for links.
        

        <li>

        Fixed problem with rebuilding links like "#foo".
        

        <li>
        Fixed problem with splitting/rebuilding URLs like "http://foo.bar.com/?http://foo/bar.html".
         

        <li>
        Fixed problem that the crawler handled i.e. "foo.com" and "www.foo.com" as different hosts.
        
        
        </ul>
      </p>

      <p>
        <b>Version 0.65_pl2</b><br>
        2005/08/22<br>

        <ul>
        <li>Phpcrawl now doesn't throw any notices anymore when error-reporting
        is set to E_ALL. Thanks to elitec0der!
        <li>Also there shouldn't be any notices anymore when "allow_call_time_pass_reference"
        is set to OFF. Thanks philipx!
        <li>Fixed a bug that appeared when running the crawler in followmode 3.
        (The crawler never went "path up" anymore)
        <li>Now the crawler sends the referer-header with requests correctly again.
        

        </ul>
      </p>

      <p>
        <b>Version 0.65_pl1 (patchlevel 1)</b><br>
        2004/08/06<br><br>
        
        Just a few bugs were fixed in this version:<br>
        (Yes, it took a time, sorry)
        <br>

        <ul class="changelog">
        <li> The crawler doesn't send empty "Cookie:" header anymore if there's no cookie to send.<br>
        (A few webserver rerturned a "bad request" header to this) Thanks Mark!
        <li> The crawler doesn't send one and the same cookie-header several times with a request anymore
        if it was set more than one time by the server.
        (Didn't really matter though)
        <li>Crawler will find links in metatags used with single quotation marks (') like<br>
        &lt;META HTTP-EQUIV='refresh' CONTENT='foo; URL=http://bar/'&gt; now. Thanks Arsa!
        <li>HTTP 1.0 requests will be send now because of problems with HTTP 1.1 headers
        and chunked content. Thanks Arsa again!
        

        </ul>
      </p>
      
      <p>
       <b>Version 0.65_beta</b><br>
       2003/09/04<br><br>
       
        Verion 0.65_beta is the first documentated realese.<br>
        There are some bugs left i bet,
        there are a lot of planned features currelntly not implemented,
        but it does its work.

        To get an overview of the features the class supports just take a look
        at the classreference and the method-descriptions, you should get a good
        impression.<br>
        <br>
        Here is s list of things that are currently NOT supported or implemented but
        (maybe) will be there in future-vesions.
        <ul class="changelog">
        <li>
        The crawler currently doesn't support other protocols than HTTP ("http://..."). It cant't hanlde
        links like "fpt://.." or "https://.." or others. 
        <li>
        Currently the crawler uses just one socket-connection at a time. This slows down the
        crawling process especially when crawling sites on a "slow" host.

        <li>
        In this version, content of pages or files the crawler found will be streamed into local memory
        (not ALL of them, just the "current" one of course).
        Thats ok if the content/source has a size of a few hundred kbytes or maybe even a little bit more than that,
        but its really "bad" if the content-size is really big, like a 700 MB ISO-image or something like that.
        
        <li>
        Currently the crawler follows links as they come, there are no options/methods to
        specify the "follow-order" or to set priorities to special links. (Just talking obout the ORDER, options
        to filter links and other options like that are implemented)<br>
        </ul>
        
        </div>
        
      </div>
  
  </div>
  
  
  
</div>

</body>
</html>
Return current item: PHPCrawl