Location: PHPKode > projects > PHPCrawl > PHPCrawl_070/documentation/version_info.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>
<head>
	<title>PHPCrawl - Webcrawler Class</title>
 <link rel="stylesheet" type="text/css" href="style.css">
</head>

<body>


		<div id="header">
				<h1>PHPCrawl Documentation</h1>
				For PHPCrawl Version 0.7
		</div>

		<div id="menu_container">
		  <div id="menu">
						<ul id="menu">
						<li><a href="index.html">Introduction & Requirements</a></li>
				  <li><a href="quickstart.html">Quickstart</a></li>
		    <li><a href="example.html">Example-Script</a></li>
				  <li><a href="version_info.html">Version-History</a></li>
				  <li><a href="testinterface.html">The Testinterface</a></li>
				  <li><a href="classreference.html">Classreference</a></li>
						</ul>
				</div>
    
		  <div id="download">
						<ul id="menu">
      <li><a href="download.html">Download PHPCrawl<br></a></li>
      <li><a href="http://sourceforge.net/projects/phpcrawl">Sourceforge Projectpage<br></a></li>
						</ul>
				</div>
    
    <div id="sflogo">
      <a href="http://sourceforge.net">
      <!--
      <img src="http://sflogo.sourceforge.net/sflogo.php?group_id=89439&amp;type=7" width="210" height="62" border="0" alt="SourceForge.net Logo"></a></div>
      -->
       <img src="img/sflogo.png" width="210" height="62" border="0" alt="SourceForge.net Logo"></a></div>
       
  </div>

  <div id="main">
  <h2>Version-History</h2>
  
  <p>
    <b>Version 0.7</b><br>
				2007/01/05
				<br><br>
				Version 0.7 brings new features like a link-priority-system, support for SSL (https) and robots.txt-files, support for basic-authentication,
				the capability to stream content directly to a temporary file and some more.<br>
    Also more information about found pages and files
				is passed to the user now, the performance was improved and the link-extraction was redone for some better results.
				<br><br>
				The changes in detail:
				<br>
    
				<ul>
				<li>
				New function setLinkPriority() added.
				Its possible now to set different priorities (priority-levels)
				for different links.
				<br><br>

				<li>
				Found links are cached now in a different way so that the general performance
				was improved, especially when cralwing a lot of pages (big sites) in one process.
				<br><br>

				<li>
				Added support for Secure Sockets Layer (SSL), so links like "https://.." 
				will be followed corretly now.<br>
				Please note that the PHP OpenSSL-extension is required for it to work.
				<br><br>

				<li>
				The link-extraction and other parts were redone and should give better results now.
    <br><br>
    
    <li>
				Methods setAggressiveLinkExtraction() and addLinkExtractionTag() added for
				setting up the link extraction manually.
				<br><br>
    
    <li>
				Added support for robots.txt-files, method obeyRobotsTxt() added.
				<br><br>

				<li>
				More information about found pages and files will be passed now to the method handlePageData(),
				especially information about all found links in the current page is available now. Method 
				disableExtendedLinkInfo() added.
				<br><br>

				<li>
				The crawler now is able to handle links like "http://www.foo.com:123"
				correctly and will send requests to the correct port. Method setPort() added.
				<br><br>

				<li>
				The content of pages and files now can be received/streamed diretly to a temporary file
				instead of to memory, so now it shouldn't be a problem anymore to let the crawler receive big files.<br>
				Methods setTmpFile(), addReceiveToTmpFileMatch() and addReceiveToMemoryMatch() added.
				<br><br>

				<li>
				Added support for basic authentication. Now the crawler is able to
				crawl protected content. Method addBasicAuthentication() added.
				<br><br>

				<li>
				Now its possible to abort the crawling-process by letting the overridable function
				handlePageData() return any negative value.
				<br><br>

				<li>
				Method setUserAgentString() added for setting the user-agent-string in request-headers
				the crawler sends.
				<br><br>

				<li>
				A html-testinterface for testing the different setups of the crawler
				is included in the package now. 
				<br><br>

				<li>
				The crawler doesn't do DNS-lookups anymore for every single page-request,
				only if the host is changing a lookup will be done once. This improves performance a little bit.
				<br><br>

				<li>
				The crawler doesn't look for links anymore in every file it finds,
				only "text/html"-content will be checked for links.
				<br><br>

				<li>
				Fixed problem with rebuilding links like "#foo".
				<br><br>

				<li>
				Fixed problem with splitting/rebuilding URLs like "http://foo.bar.com/?http://foo/bar.html".
				<br><br> 

				<li>
				Fixed problem that the crawler handled i.e. "foo.com" and "www.foo.com" as different hosts.
				<br><br>
    
    </ul>
  </p>

  <p>
				<b>Version 0.65_pl2</b><br>
				2005/08/22
				<br>
				<ul>
				<li>Phpcrawl now doesn't throw any notices anymore when error-reporting
				is set to E_ALL. Thanks to elitec0der!
				<li>Also there shouldn't be any notices anymore when "allow_call_time_pass_reference"
				is set to OFF. Thanks philipx!
				<li>Fixed a bug that appeared when running the crawler in followmode 3.
				(The crawler never went "path up" anymore)
				<li>Now the crawler sends the referer-header with requests correctly again.
				<br><br>
			 </ul>
  </p>

  <p>
				<b>Version 0.65_pl1 (patchlevel 1)</b><br>
				2004/08/06
				<br><br>
				Just a few bugs were fixed in this version:<br>
				(Yes, it took a time, sorry)
				<br>
				<ul>
				<li> The crawler doesn't send empty "Cookie:" header anymore if there's no cookie to send.<br>
				(A few webserver rerturned a "bad request" header to this) Thanks Mark!
				<li> The crawler doesn't send one and the same cookie-header several times with a request anymore
				if it was set more than one time by the server.
				(Didn't really matter though)
				<li>Crawler will find links in metatags used with single quotation marks (') like<br>
				&lt;META HTTP-EQUIV='refresh' CONTENT='foo; URL=http://bar/'&gt; now. Thanks Arsa!
				<li>HTTP 1.0 requests will be send now because of problems with HTTP 1.1 headers
				and chunked content. Thanks Arsa again!
    <br><br>
    </ul>
  </p>
  
  <p>
			<b>Version 0.65_beta</b><br>
			2003/09/04
			<br><br>
				Verion 0.65_beta is the first documentated realese.<br>
				There are some bugs left i bet,
				there are a lot of planned features currelntly not implemented,
				but it does its work.<br><br>
				To get an overview of the features the class supports just take a look
				at the classreference and the method-descriptions, you should get a good
				impression.<br>
				<br>
				Here is s list of things that are currently NOT supported or implemented but
				(maybe) will be there in future-vesions.
				<ul>
				<li>
				The crawler currently doesn't support other protocols than HTTP ("http://..."). It cant't hanlde
				links like "fpt://.." or "https://.." or others.<br><br> 
				<li>
				Currently the crawler uses just one socket-connection at a time. This slows down the
				crawling process especially when crawling sites on a "slow" host.<br><br>
				<li>
				In this version, content of pages or files the crawler found will be streamed into local memory
				(not ALL of them, just the "current" one of course).
				Thats ok if the content/source has a size of a few hundred kbytes or maybe even a little bit more than that,
				but its really "bad" if the content-size is really big, like a 700 MB ISO-image or something like that.
				<br><br>
				<li>
				Currently the crawler follows links as they come, there are no options/methods to
				specify the "follow-order" or to set priorities to special links. (Just talking obout the ORDER, options
				to filter links and other options like that are implemented)<br>
    </ul>
  </p>
  
  </div>
  
</body>
</html>
Return current item: PHPCrawl