<?xml version="1.0" encoding="iso-8859-2"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" />
<title>HTMLCleaner</title>
<meta name="author" content="hide@address.com">
<meta name="generator" content="AceHTML Freeware">
<h2>CRIOSWEB HTML Cleaner</h2>
<p>How many of you needed to clean up those messy MS Word files in order to integrate them into valid W3C pages, or just integrate them in the overall design ?<br />
I’ve looked for a good HTML Cleaner and did’t find a good free one.</p>
<p>Meanwhile, I’ve developed my own HTML Cleaner class in PHP, because I needed to clean up tons of word generated code in that time.</p>
<p>I’ve combined the strong HTML Tidy library with my own regular expression-based cleaning algorithms. I wanted a simple method to strip all unnecesarry tags and styles yet to keep it W3C standard compliant.</p>
<p>Synthax checking is beeing done only when using <strong>Tidy</strong>.<br />
Note that <strong>this tool is designed to strip/clean useless tags and attributes back to HTML basics and optimize code</strong>, not sanitize (like HTMLPurifier).</p>
<p>Without the <strong>tidy</strong> PHP extension, the class can:<br />
- remove styles, attributes<br />
- strip useless tags<br />
- fill empty table cells with non-breaking spaces<br />
- optimize code (merge inline tags, strip empty inline tags, trim excess new lines)<br />
- drop empty paragraphs<br />
- compress (trim space and new-line breaks).</p>
<p>In conjunction with <strong>tidy</strong>, the class can apply all tidy actions (clean-up, fix errors, convert to XHTML, etc) and then optionally perform all actions of the class (remove styles, compress, etc).</p>
<p>Currently the following cleaning method is implemented: tag whitelist/attribute blacklist</p>
<p><strong>Properties:</strong></p>
<div>
<div style="">Code (php)</div>
<p>
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$html</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Options</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Tag_whitelist</span>=<span style="color: #ff0000;">‘<table><tbody><thead><tfoot><tr><th><td><colgroup><col><br />
<p><br />
<hr><blockquote><br />
<b><i><u><sub><sup><strong><em><tt><var><br />
<code><xmp><cite><pre><abbr><acronym><address><samp><br />
<fieldset><legend><br />
<a><img><br />
<h1><h2><h3><h4><h4><h5><h6><br />
<ul><ol><li><dl><dt><br />
<frame><frameset><br />
<form><input><select><option><optgroup><button><textarea>’</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Attrib_blacklist</span>=<span style="color: #ff0000;">‘id|on[<span style="color: #000099; font-weight: bold;">\w</span>]+’</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$CleanUpTags</span>=<a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span><span style="color: #ff0000;">‘a’</span>,<span style="color: #ff0000;">’span’</span>,<span style="color: #ff0000;">‘b’</span>,<span style="color: #ff0000;">‘i’</span>,<span style="color: #ff0000;">‘u’</span>,<span style="color: #ff0000;">’strong’</span>,<span style="color: #ff0000;">‘em’</span>,<span style="color: #ff0000;">‘big’</span>,<span style="color: #ff0000;">’small’</span>,<span style="color: #ff0000;">‘tt’</span>,<span style="color: #ff0000;">‘var’</span>,<span style="color: #ff0000;">‘code’</span>,<span style="color: #ff0000;">‘xmp’</span>,<span style="color: #ff0000;">‘cite’</span>,<span style="color: #ff0000;">‘pre’</span>,<span style="color: #ff0000;">‘abbr’</span>,<span style="color: #ff0000;">‘acronym’</span>,<span style="color: #ff0000;">‘address’</span>,<span style="color: #ff0000;">‘q’</span>,<span style="color: #ff0000;">’samp’</span>,<span style="color: #ff0000;">’sub’</span>,<span style="color: #ff0000;">’sup’</span><span style="color: #66cc66;">)</span>;<span style="color: #808080; font-style: italic;">//array of inline tags that can be merged</span><br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$TidyConfig</span>;<br />
<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #0000ff;">$Encoding</span>=<span style="color: #ff0000;">‘latin1′</span>;</p>
<p><span style="color: #0000ff;">$this</span>-><span style="color: #006600;">Options</span> = <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span><br />
<span style="color: #ff0000;">‘RemoveStyles’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//removes style definitions like style and class</span><br />
<span style="color: #ff0000;">‘IsWord’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//Microsoft Word flag - specific operations may occur</span><br />
<span style="color: #ff0000;">‘UseTidy’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//uses the tidy engine also to cleanup the source (reccomended)</span><br />
<span style="color: #ff0000;">‘CleaningMethod’</span> => <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span>TAG_WHITELIST,ATTRIB_BLACKLIST<span style="color: #66cc66;">)</span>, <span style="color: #808080; font-style: italic;">//cleaning methods</span><br />
<span style="color: #ff0000;">‘OutputXHTML’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//converts to XHTML by using TIDY.</span><br />
<span style="color: #ff0000;">‘FillEmptyTableCells’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//fills empty cells with non-breaking spaces</span><br />
<span style="color: #ff0000;">‘DropEmptyParas’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//drops empty paragraphs</span><br />
<span style="color: #ff0000;">‘Optimize’</span> =><b>true</b>, <span style="color: #808080; font-style: italic;">//Optimize code - merge tags</span><br /> <span style="color: #ff0000;">‘Compress’</span> => <span style="color: #000000; font-weight: bold;">false</span><span style="color: #66cc66;">)</span>; <span style="color: #808080; font-style: italic;">//trims all spaces (line breaks, tabs) between tags and between words.</span></p>
<p><span style="color: #808080; font-style: italic;">// Specify TIDY configuration</span><br />
<span style="color: #0000ff;">$this</span>-><span style="color: #006600;">TidyConfig</span> = <a href="http://www.php.net/array"><span style="color: #000066;">array</span></a><span style="color: #66cc66;">(</span><br />
<span style="color: #ff0000;">‘indent’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">/*a bit slow*/</span><br />
<span style="color: #ff0000;">‘output-xhtml’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//Outputs the data in XHTML format</span><br />
<span style="color: #ff0000;">‘word-2000′</span> => <span style="color: #000000; font-weight: bold;">false</span>, <span style="color: #808080; font-style: italic;">//Removes all proprietary data when an MS Word document has been saved as HTML</span><br />
<span style="color: #808080; font-style: italic;">//’clean’ => true, /*too slow*/</span><br />
<span style="color: #ff0000;">‘drop-proprietary-attributes’</span> =>true, <span style="color: #808080; font-style: italic;">//Removes all attributes that are not part of a web standard</span><br />
<span style="color: #ff0000;">‘hide-comments’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">//Strips all comments</span><br />
<span style="color: #ff0000;">‘preserve-entities’</span> => <span style="color: #000000; font-weight: bold;">true</span>, <span style="color: #808080; font-style: italic;">// preserve the well-formed entitites as found in the input</span><br />
<span style="color: #ff0000;">‘quote-ampersand’</span> => <span style="color: #000000; font-weight: bold;">true</span>,<span style="color: #808080; font-style: italic;">//output unadorned & characters as &.</span><br />
<span style="color: #ff0000;">‘wrap’</span> => <span style="color: #cc66cc;">200</span><span style="color: #66cc66;">)</span>; <span style="color: #808080; font-style: italic;">//Sets the number of characters allowed before a line is soft-wrapped</span><br />
</div>
<p><strong>Methods:</strong></p>
<div>
<div style="">Code (php)</div>
<p>
<span style="color: #000000; font-weight: bold;">function</span> RemoveBlacklistedAttributes<span style="color: #66cc66;">(</span><span style="color: #0000ff;">$attribs</span><span style="color: #66cc66;">)</span> <span style="color: #808080; font-style: italic;">//removes specified attributes</span><br />
<span style="color: #000000; font-weight: bold;">function</span> cleanUp<span style="color: #66cc66;">(</span><span style="color: #0000ff;">$encoding</span>=<span style="color: #ff0000;">‘latin1′</span><span style="color: #66cc66;">)</span> <span style="color: #808080; font-style: italic;">//actual cleanup function</span><br />
<br />
See it in action: <a href="http://luci.criosweb.ro/scripts/HTMLCleaner/" target="_blank">http://luci.criosweb.ro/scripts/HTMLCleaner/</a><br />
(No tidy support on server, unfortunately, so only basic cleaning applies)<br />
</div>
<p><strong>Changes:</strong> See <a href="changelog.txt">changelog</a></p>
<p><strong>Usage example:</strong></p>
<div class="ch_code_container" style="font-family: monospace;"><span style="color: #000000"><br />
<span style="color: #0000BB"><?php </span><span style="color: #FF8000">/* Created on: 12.07.2007 */ </span><span style="color: #0000BB">?><br />
<br /></span><html><br />
<br /><head><br />
<br /><title>HTMLCleaner :: Clean MS Word HTML quickly</title><br />
<br /><style type=”text/css”><br />
<br />body,table {font-family:tahoma,verdana,arial;font-size:9pt}<br />
<br /></style><br />
<br /></head><br />
<br /><body><br />
<br /><h2>CRIOS<i>WEB</i>’s HTMLCleaner</h2><br />
<br /><form enctype=”multipart/form-data” method=”POST” action=”<span style="color: #0000BB"><?$_SERVER</span><span style="color: #007700">[</span><span style="color: #DD0000">‘PHP_SELF’</span><span style="color: #007700">]</span><span style="color: #0000BB">?></span>“><br />
<br />HTML Document to be cleaned:<br />
<br /><input type=”file” name=”doc”><br />
<br /><br><br><br />
<br /><input type=”submit” value=”Process” name=”process”><br />
<br /></form></p>
<p><span style="color: #0000BB"><?php<br />
<br /></span><span style="color: #007700">if(isset(</span><span style="color: #0000BB">$_POST</span><span style="color: #007700">[</span><span style="color: #DD0000">“process”</span><span style="color: #007700">])){<br />
<br /></span><span style="color: #0000BB">$filename</span><span style="color: #007700">=</span><span style="color: #0000BB">$_FILES</span><span style="color: #007700">[</span><span style="color: #DD0000">“doc”</span><span style="color: #007700">][</span><span style="color: #DD0000">“tmp_name”</span><span style="color: #007700">];</p>
<p>require(</span><span style="color: #DD0000">“HTMLCleaner.php”</span><span style="color: #007700">);</p>
<p></span><span style="color: #0000BB">$fp</span><span style="color: #007700">=</span><span style="color: #0000BB">fopen</span><span style="color: #007700">(</span><span style="color: #0000BB">$filename</span><span style="color: #007700">,</span><span style="color: #DD0000">“r”</span><span style="color: #007700">);<br />
<br /></span><span style="color: #0000BB">$word</span><span style="color: #007700">=</span><span style="color: #0000BB">fread</span><span style="color: #007700">(</span><span style="color: #0000BB">$fp</span><span style="color: #007700">,</span><span style="color: #0000BB">filesize</span><span style="color: #007700">(</span><span style="color: #0000BB">$filename</span><span style="color: #007700">));<br />
<br /></span><span style="color: #0000BB">fclose</span><span style="color: #007700">(</span><span style="color: #0000BB">$fp</span><span style="color: #007700">);<br />
<br /></span><span style="color: #0000BB">unlink</span><span style="color: #007700">(</span><span style="color: #0000BB">$filename</span><span style="color: #007700">);</p>
<p></span><span style="color: #0000BB">$cleaner</span><span style="color: #007700">=new </span><span style="color: #0000BB">HTMLCleaner</span><span style="color: #007700">();</p>
<p></span><span style="color: #FF8000">/*$cleaner->Options[’UseTidy’]=false;<br />
<br />$cleaner->Options[’OutputXHTML’]=false;*/<br />
<br />
<span style="color: #0000BB">$cleaner->Options[’Optimize’]=true;</span></p>
<p></span><span style="color: #0000BB">$cleaner</span><span style="color: #007700">-></span><span style="color: #0000BB">html</span><span style="color: #007700">=</span><span style="color: #0000BB">$word</span><span style="color: #007700">;<br />
<br /></span><span style="color: #0000BB">$cleanHTML</span><span style="color: #007700">=</span><span style="color: #0000BB">$cleaner</span><span style="color: #007700">-></span><span style="color: #0000BB">cleanUp</span><span style="color: #007700">(</span><span style="color: #DD0000">‘latin1′</span><span style="color: #007700">);</p>
<p>echo </span><span style="color: #DD0000">‘<textarea style=”width:100%;height:300px”>’</span><span style="color: #007700">.</span><span style="color: #0000BB">htmlspecialchars</span><span style="color: #007700">(</span><span style="color: #0000BB">$cleanHTML</span><span style="color: #007700">,</span><span style="color: #0000BB">ENT_COMPAT</span><span style="color: #007700">,</span><span style="color: #DD0000">“ISO-8859-1″</span><span style="color: #007700">).</span><span style="color: #DD0000">“</textarea>”</span><span style="color: #007700">;<br />
<br />echo </span><span style="color: #0000BB">$cleanHTML</span><span style="color: #007700">;</p>
<p>}<br />
<br /></span><span style="color: #0000BB">?></span> <br />
<br /></body><br />
<br /></html><br />
<br /></span>
</div>
</body>
</html>