Location: PHPKode > scripts > CriosWeb HTML Cleaner > criosweb-html-cleaner/readme.html
<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" />
  <title>HTMLCleaner</title>
  <meta name="author" content="hide@address.com">
  <meta name="generator" content="AceHTML Freeware">
<h2>CRIOSWEB HTML Cleaner</h2>
<p>How many of you needed to clean up those messy MS Word files in order to integrate them into valid W3C pages, or just integrate them in the overall design ?<br />
I&rsquo;ve looked for a good HTML Cleaner and did&rsquo;t find a good free one.</p>
<p>Meanwhile, I&rsquo;ve developed my own HTML Cleaner class in PHP, because I needed to clean up tons of word generated code in that time.</p>
<p>I&rsquo;ve combined the strong HTML Tidy library with my own regular expression-based cleaning algorithms. I wanted a simple method to strip all unnecesarry tags and styles yet to keep it W3C standard compliant.</p>
<p>Synthax checking is beeing done only when using <strong>Tidy</strong>.<br />
Note that <strong>this tool is designed to strip/clean useless tags and attributes back to HTML basics and optimize code</strong>, not sanitize (like HTMLPurifier).</p>
<p>Without the <strong>tidy</strong> PHP extension, the class can:<br />
- remove styles, attributes<br />
- strip useless tags<br />
- fill empty table cells with non-breaking spaces<br />
- optimize code (merge inline tags, strip empty inline tags, trim excess new lines)<br />
- drop empty paragraphs<br />
- compress (trim space and new-line breaks).</p>
<p>In conjunction with <strong>tidy</strong>, the class can apply all tidy actions (clean-up, fix errors, convert to XHTML, etc) and then optionally perform all actions of the class (remove styles, compress, etc).</p>
<p>Currently the following cleaning method is implemented: tag whitelist/attribute blacklist</p>
<p><strong>Properties:</strong></p>
<div>
<div style="">Code (php)</div>
<p>
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$html</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Options</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Tag_whitelist</span>=<span style="color: #ff0000;">&lsquo;&lt;table&gt;&lt;tbody&gt;&lt;thead&gt;&lt;tfoot&gt;&lt;tr&gt;&lt;th&gt;&lt;td&gt;&lt;colgroup&gt;&lt;col&gt;<br />
&lt;p&gt;<br />
&lt;hr&gt;&lt;blockquote&gt;<br />
&lt;b&gt;&lt;i&gt;&lt;u&gt;&lt;sub&gt;&lt;sup&gt;&lt;strong&gt;&lt;em&gt;&lt;tt&gt;&lt;var&gt;<br />
&lt;code&gt;&lt;xmp&gt;&lt;cite&gt;&lt;pre&gt;&lt;abbr&gt;&lt;acronym&gt;&lt;address&gt;&lt;samp&gt;<br />
&lt;fieldset&gt;&lt;legend&gt;<br />
&lt;a&gt;&lt;img&gt;<br />
&lt;h1&gt;&lt;h2&gt;&lt;h3&gt;&lt;h4&gt;&lt;h4&gt;&lt;h5&gt;&lt;h6&gt;<br />
&lt;ul&gt;&lt;ol&gt;&lt;li&gt;&lt;dl&gt;&lt;dt&gt;<br />
&lt;frame&gt;&lt;frameset&gt;<br />
&lt;form&gt;&lt;input&gt;&lt;select&gt;&lt;option&gt;&lt;optgroup&gt;&lt;button&gt;&lt;textarea&gt;&rsquo;</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Attrib_blacklist</span>=<span style="color: #ff0000;">&lsquo;id|on[<span style="color: #000099; font-weight: bold;">\w</span>]+&rsquo;</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$CleanUpTags</span>=<a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span><span style="color: #ff0000;">&lsquo;a&rsquo;</span>,<span style="color: #ff0000;">&rsquo;span&rsquo;</span>,<span style="color: #ff0000;">&lsquo;b&rsquo;</span>,<span style="color: #ff0000;">&lsquo;i&rsquo;</span>,<span style="color: #ff0000;">&lsquo;u&rsquo;</span>,<span style="color: #ff0000;">&rsquo;strong&rsquo;</span>,<span style="color: #ff0000;">&lsquo;em&rsquo;</span>,<span style="color: #ff0000;">&lsquo;big&rsquo;</span>,<span style="color: #ff0000;">&rsquo;small&rsquo;</span>,<span style="color: #ff0000;">&lsquo;tt&rsquo;</span>,<span style="color: #ff0000;">&lsquo;var&rsquo;</span>,<span style="color: #ff0000;">&lsquo;code&rsquo;</span>,<span style="color: #ff0000;">&lsquo;xmp&rsquo;</span>,<span style="color: #ff0000;">&lsquo;cite&rsquo;</span>,<span style="color: #ff0000;">&lsquo;pre&rsquo;</span>,<span style="color: #ff0000;">&lsquo;abbr&rsquo;</span>,<span style="color: #ff0000;">&lsquo;acronym&rsquo;</span>,<span style="color: #ff0000;">&lsquo;address&rsquo;</span>,<span style="color: #ff0000;">&lsquo;q&rsquo;</span>,<span style="color: #ff0000;">&rsquo;samp&rsquo;</span>,<span style="color: #ff0000;">&rsquo;sub&rsquo;</span>,<span style="color: #ff0000;">&rsquo;sup&rsquo;</span><span style="color: #66cc66;">)</span>;<span style="color: #808080; font-style: italic;">//array of inline tags that can be merged</span><br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$TidyConfig</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Encoding</span>=<span style="color: #ff0000;">&lsquo;latin1&#8242;</span>;</p>
<p><span style="color: #0000ff;">$this</span>-&gt;<span style="color: #006600;">Options</span> = <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;RemoveStyles&rsquo;</span>&nbsp; &nbsp; =&gt; <span style="color: #000000; font-weight: bold;">true</span>,&nbsp; <span style="color: #808080; font-style: italic;">//removes style definitions like style and class</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;IsWord&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =&gt; <span style="color: #000000; font-weight: bold;">true</span>,&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">//Microsoft Word flag - specific operations may occur</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;UseTidy&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp;=&gt; <span style="color: #000000; font-weight: bold;">true</span>,&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #808080; font-style: italic;">//uses the tidy engine also to cleanup the source (reccomended)</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;CleaningMethod&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; =&gt; <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span>TAG_WHITELIST,ATTRIB_BLACKLIST<span style="color: #66cc66;">)</span>,&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #808080; font-style: italic;">//cleaning methods</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;OutputXHTML&rsquo;</span>&nbsp; &nbsp;&nbsp; &nbsp;=&gt; <span style="color: #000000; font-weight: bold;">true</span>,&nbsp; &nbsp;<span style="color: #808080; font-style: italic;">//converts to XHTML by using TIDY.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;FillEmptyTableCells&rsquo;</span> =&gt; <span style="color: #000000; font-weight: bold;">true</span>, &nbsp;<span style="color: #808080; font-style: italic;">//fills empty cells with non-breaking spaces</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;DropEmptyParas&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; =&gt; <span style="color: #000000; font-weight: bold;">true</span>,&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">//drops empty paragraphs</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;Optimize&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =&gt;<b>true</b>,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #808080; font-style: italic;">//Optimize code - merge tags</span><br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;Compress&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =&gt; <span style="color: #000000; font-weight: bold;">false</span><span style="color: #66cc66;">)</span>;&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">//trims all spaces (line breaks, tabs) between tags and between words.</span></p>
<p><span style="color: #808080; font-style: italic;">// Specify TIDY configuration</span><br />
<span style="color: #0000ff;">$this</span>-&gt;<span style="color: #006600;">TidyConfig</span> = <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff0000;">&lsquo;indent&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=&gt; <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">/*a bit slow*/</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #ff0000;">&lsquo;output-xhtml&rsquo;</span>&nbsp; &nbsp;=&gt; <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//Outputs the data in XHTML format</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;word-2000&#8242;</span>&nbsp; &nbsp; =&gt; <span style="color: #000000; font-weight: bold;">false</span>, <span style="color: #808080; font-style: italic;">//Removes all proprietary data when an MS Word document has been saved as HTML</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #808080; font-style: italic;">//&rsquo;clean&rsquo;&nbsp; &nbsp; &nbsp; &nbsp; =&gt; true, /*too slow*/</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;drop-proprietary-attributes&rsquo;</span> =&gt;true, <span style="color: #808080; font-style: italic;">//Removes all attributes that are not part of a web standard</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;hide-comments&rsquo;</span> =&gt; <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//Strips all comments</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;preserve-entities&rsquo;</span> =&gt; <span style="color: #000000; font-weight: bold;">true</span>,&nbsp;<span style="color: #808080; font-style: italic;">// preserve the well-formed entitites as found in the input</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;quote-ampersand&rsquo;</span> =&gt; <span style="color: #000000; font-weight: bold;">true</span>,<span style="color: #808080; font-style: italic;">//output unadorned &amp; characters as &amp;.</span><br />
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; <span style="color: #ff0000;">&lsquo;wrap&rsquo;</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=&gt; <span style="color: #cc66cc;">200</span><span style="color: #66cc66;">)</span>; <span style="color: #808080; font-style: italic;">//Sets the number of characters allowed before a line is soft-wrapped</span><br />
</div>
<p><strong>Methods:</strong></p>
<div>
<div style="">Code (php)</div>
<p>
<span style="color: #000000; font-weight: bold;">function</span> RemoveBlacklistedAttributes<span style="color: #66cc66;">(</span><span style="color: #0000ff;">$attribs</span><span style="color: #66cc66;">)</span> <span style="color: #808080; font-style: italic;">//removes specified attributes</span><br />
<span style="color: #000000; font-weight: bold;">function</span> cleanUp<span style="color: #66cc66;">(</span><span style="color: #0000ff;">$encoding</span>=<span style="color: #ff0000;">&lsquo;latin1&#8242;</span><span style="color: #66cc66;">)</span> <span style="color: #808080; font-style: italic;">//actual cleanup function</span><br />

<br />
See it in action: <a href="http://luci.criosweb.ro/scripts/HTMLCleaner/" target="_blank">http://luci.criosweb.ro/scripts/HTMLCleaner/</a><br />
(No tidy support on server, unfortunately, so only basic cleaning applies)<br />
</div>
<p><strong>Changes:</strong> See <a href="changelog.txt">changelog</a></p>
<p><strong>Usage example:</strong></p>
<div class="ch_code_container" style="font-family: monospace;"><span style="color: #000000"><br />
<span style="color: #0000BB">&lt;?php&nbsp;</span><span style="color: #FF8000">/*&nbsp;Created&nbsp;on:&nbsp;12.07.2007&nbsp;*/&nbsp;</span><span style="color: #0000BB">?&gt;<br />
<br /></span>&lt;html&gt;<br />
<br />&lt;head&gt;<br />
<br />&lt;title&gt;HTMLCleaner&nbsp;::&nbsp;Clean&nbsp;MS&nbsp;Word&nbsp;HTML&nbsp;quickly&lt;/title&gt;<br />
<br />&lt;style&nbsp;type=&rdquo;text/css&rdquo;&gt;<br />
<br />body,table&nbsp;&nbsp;&nbsp;&nbsp;{font-family:tahoma,verdana,arial;font-size:9pt}<br />
<br />&lt;/style&gt;<br />
<br />&lt;/head&gt;<br />
<br />&lt;body&gt;<br />
<br />&lt;h2&gt;CRIOS&lt;i&gt;WEB&lt;/i&gt;&rsquo;s&nbsp;HTMLCleaner&lt;/h2&gt;<br />
<br />&lt;form&nbsp;enctype=&rdquo;multipart/form-data&rdquo;&nbsp;method=&rdquo;POST&rdquo;&nbsp;action=&rdquo;<span style="color: #0000BB">&lt;?$_SERVER</span><span style="color: #007700">[</span><span style="color: #DD0000">&lsquo;PHP_SELF&rsquo;</span><span style="color: #007700">]</span><span style="color: #0000BB">?&gt;</span>&ldquo;&gt;<br />
<br />HTML&nbsp;Document&nbsp;to&nbsp;be&nbsp;cleaned:<br />
<br />&lt;input&nbsp;type=&rdquo;file&rdquo;&nbsp;name=&rdquo;doc&rdquo;&gt;<br />
<br />&lt;br&gt;&lt;br&gt;<br />
<br />&lt;input&nbsp;type=&rdquo;submit&rdquo;&nbsp;value=&rdquo;Process&rdquo;&nbsp;name=&rdquo;process&rdquo;&gt;<br />
<br />&lt;/form&gt;</p>
<p><span style="color: #0000BB">&lt;?php<br />
<br /></span><span style="color: #007700">if(isset(</span><span style="color: #0000BB">$_POST</span><span style="color: #007700">[</span><span style="color: #DD0000">&ldquo;process&rdquo;</span><span style="color: #007700">])){<br />
<br /></span><span style="color: #0000BB">$filename</span><span style="color: #007700">=</span><span style="color: #0000BB">$_FILES</span><span style="color: #007700">[</span><span style="color: #DD0000">&ldquo;doc&rdquo;</span><span style="color: #007700">][</span><span style="color: #DD0000">&ldquo;tmp_name&rdquo;</span><span style="color: #007700">];</p>
<p>require(</span><span style="color: #DD0000">&ldquo;HTMLCleaner.php&rdquo;</span><span style="color: #007700">);</p>
<p></span><span style="color: #0000BB">$fp</span><span style="color: #007700">=</span><span style="color: #0000BB">fopen</span><span style="color: #007700">(</span><span style="color: #0000BB">$filename</span><span style="color: #007700">,</span><span style="color: #DD0000">&ldquo;r&rdquo;</span><span style="color: #007700">);<br />
<br /></span><span style="color: #0000BB">$word</span><span style="color: #007700">=</span><span style="color: #0000BB">fread</span><span style="color: #007700">(</span><span style="color: #0000BB">$fp</span><span style="color: #007700">,</span><span style="color: #0000BB">filesize</span><span style="color: #007700">(</span><span style="color: #0000BB">$filename</span><span style="color: #007700">));<br />
<br /></span><span style="color: #0000BB">fclose</span><span style="color: #007700">(</span><span style="color: #0000BB">$fp</span><span style="color: #007700">);<br />
<br /></span><span style="color: #0000BB">unlink</span><span style="color: #007700">(</span><span style="color: #0000BB">$filename</span><span style="color: #007700">);</p>
<p></span><span style="color: #0000BB">$cleaner</span><span style="color: #007700">=new&nbsp;</span><span style="color: #0000BB">HTMLCleaner</span><span style="color: #007700">();</p>
<p></span><span style="color: #FF8000">/*$cleaner-&gt;Options[&rsquo;UseTidy&rsquo;]=false;<br />
<br />$cleaner-&gt;Options[&rsquo;OutputXHTML&rsquo;]=false;*/<br />
<br />
<span style="color: #0000BB">$cleaner-&gt;Options[&rsquo;Optimize&rsquo;]=true;</span></p>
<p></span><span style="color: #0000BB">$cleaner</span><span style="color: #007700">-&gt;</span><span style="color: #0000BB">html</span><span style="color: #007700">=</span><span style="color: #0000BB">$word</span><span style="color: #007700">;<br />
<br /></span><span style="color: #0000BB">$cleanHTML</span><span style="color: #007700">=</span><span style="color: #0000BB">$cleaner</span><span style="color: #007700">-&gt;</span><span style="color: #0000BB">cleanUp</span><span style="color: #007700">(</span><span style="color: #DD0000">&lsquo;latin1&#8242;</span><span style="color: #007700">);</p>
<p>echo&nbsp;</span><span style="color: #DD0000">&lsquo;&lt;textarea&nbsp;style=&rdquo;width:100%;height:300px&rdquo;&gt;&rsquo;</span><span style="color: #007700">.</span><span style="color: #0000BB">htmlspecialchars</span><span style="color: #007700">(</span><span style="color: #0000BB">$cleanHTML</span><span style="color: #007700">,</span><span style="color: #0000BB">ENT_COMPAT</span><span style="color: #007700">,</span><span style="color: #DD0000">&ldquo;ISO-8859-1&#8243;</span><span style="color: #007700">).</span><span style="color: #DD0000">&ldquo;&lt;/textarea&gt;&rdquo;</span><span style="color: #007700">;<br />
<br />echo&nbsp;</span><span style="color: #0000BB">$cleanHTML</span><span style="color: #007700">;</p>
<p>}<br />
<br /></span><span style="color: #0000BB">?&gt;</span>&nbsp;<br />
<br />&lt;/body&gt;<br />
<br />&lt;/html&gt;<br />
<br /></span>
</div>
</body>
</html>
Return current item: CriosWeb HTML Cleaner