Location: PHPKode > scripts > Secure HTML parser and filter,XSS,CSRF > secure-html-parser-and-filter/documentation/markup_parser_class.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Class: Markup parser</title>
</head>
<body>
<center><h1>Class: Markup parser</h1></center>
<hr />
<ul>
<p><b>Version:</b> <tt>@(#) $Id: markup_parser.php,v 1.52 2009/08/19 09:09:49 mlemos Exp $</tt></p>
<h2><a name="table_of_contents">Contents</a></h2>
<ul>
<li><a href="#2.1.1">Summary</a></li>
<ul>
<li><a href="#3.2.0">Name</a></li>
<li><a href="#3.2.0.0">Author</a></li>
<li><a href="#3.2.0.1">Copyright</a></li>
<li><a href="#3.2.0.2">Version</a></li>
<li><a href="#3.2.0.3">Purpose</a></li>
<li><a href="#3.2.0.4">Usage</a></li>
</ul>
<li><a href="#4.1.1">Variables</a></li>
<ul>
<li><a href="#5.2.13">error</a></li>
<li><a href="#5.2.14">error_code</a></li>
<li><a href="#5.2.15">error_position</a></li>
<li><a href="#5.2.16">buffer_length</a></li>
<li><a href="#5.2.17">ignore_syntax_errors</a></li>
<li><a href="#5.2.18">warnings</a></li>
<li><a href="#5.2.19">store_positions</a></li>
<li><a href="#5.2.20">track_lines</a></li>
<li><a href="#5.2.21">tag_lower_case</a></li>
<li><a href="#5.2.22">quote_attribute_values</a></li>
<li><a href="#5.2.23">decode_entities</a></li>
<li><a href="#5.2.24">allow_grave_accent_quoting</a></li>
</ul>
<li><a href="#6.1.1">Functions</a></li>
<ul>
<li><a href="#7.2.8">GetPositionLine</a></li>
<li><a href="#9.2.9">ParseDTDExpressionValue</a></li>
<li><a href="#11.2.10">ParseAttributeList</a></li>
<li><a href="#13.2.11">StartParsing</a></li>
<li><a href="#15.2.12">Parse</a></li>
<li><a href="#17.2.13">FinishParsing</a></li>
<li><a href="#17.2.14">RewriteElement</a></li>
</ul>
</ul>
<p><a href="#table_of_contents">Top of the table of contents</a></p>
</ul>
<hr />
<ul>
<h2><li><a name="2.1.1">Summary</a></li></h2>
<ul>
<h3><a name="3.2.0">Name</a></h3>
<p>Markup parser</p>
<h3><a name="3.2.0.0">Author</a></h3>
<p>Manuel Lemos (<a href="mailto:mlemos-at-acm.org">mlemos-at-acm.org</a>)</p>
<h3><a name="3.2.0.1">Copyright</a></h3>
<p>Copyright &copy; (C) Manuel Lemos 2009</p>
<h3><a name="3.2.0.2">Version</a></h3>
<p>@(#) $Id: markup_parser.php,v 1.52 2009/08/19 09:09:49 mlemos Exp $</p>
<h3><a name="3.2.0.3">Purpose</a></h3>
<p>Parse HTML and other markup based documents.</p>
<h3><a name="3.2.0.4">Usage</a></h3>
<p>Use the <tt><a href="#function_StartParsing">StartParsing</a></tt> function to initialize the parser. Then use the <tt><a href="#function_Parse">Parse</a></tt> function to make the class parse markup data, eventually read from files. When you are done with feeding the whole document data, call the <tt><a href="#function_FinishParsing">FinishParsing</a></tt> function. </p>
<p> The <tt><a href="#function_Parse">Parse</a></tt> function returns arrays of tokens that describe each document element. The <tt><a href="#function_RewriteElement">RewriteElement</a></tt> function can be used to convert the tokens back to markup document strings. </p>
<p> Element tokens are associated to the respective positions in the document. Positions are numbers that represent their offsets relative to beginning of the document. The <tt><a href="#function_GetPositionLine">GetPositionLine</a></tt> function can return the line and column number associated to a given document position if the <tt><a href="#variable_track_lines">track_lines</a></tt> is set to 1. </p>
<p> The <tt><a href="#function_ParseDTDExpressionValue">ParseDTDExpressionValue</a></tt> and <tt><a href="#function_ParseAttributeList">ParseAttributeList</a></tt> functions can be used to parse expressions that may appear in DTD markup elements.</p>
<p><a href="#table_of_contents">Table of contents</a></p>
</ul>
</ul>
<hr />
<ul>
<h2><li><a name="variables"></a><a name="4.1.1">Variables</a></li></h2>
<ul>
<li><tt><a href="#variable_error">error</a></tt></li><br />
<li><tt><a href="#variable_error_code">error_code</a></tt></li><br />
<li><tt><a href="#variable_error_position">error_position</a></tt></li><br />
<li><tt><a href="#variable_buffer_length">buffer_length</a></tt></li><br />
<li><tt><a href="#variable_ignore_syntax_errors">ignore_syntax_errors</a></tt></li><br />
<li><tt><a href="#variable_warnings">warnings</a></tt></li><br />
<li><tt><a href="#variable_store_positions">store_positions</a></tt></li><br />
<li><tt><a href="#variable_track_lines">track_lines</a></tt></li><br />
<li><tt><a href="#variable_tag_lower_case">tag_lower_case</a></tt></li><br />
<li><tt><a href="#variable_quote_attribute_values">quote_attribute_values</a></tt></li><br />
<li><tt><a href="#variable_decode_entities">decode_entities</a></tt></li><br />
<li><tt><a href="#variable_allow_grave_accent_quoting">allow_grave_accent_quoting</a></tt></li><br />
<p><a href="#table_of_contents">Table of contents</a></p>
<h3><a name="variable_error"></a><li><a name="5.2.13">error</a></li></h3>
<h3>Type</h3>
<p><tt><i>string</i></tt></p>
<h3>Default value</h3>
<p><tt>''</tt></p>
<h3>Purpose</h3>
<p>Store the message that is returned when an error occurs.</p>
<h3>Usage</h3>
<p>Check this variable to understand what happened when a call to any of the class functions has failed.</p>
<p> This class uses cumulative error handling. This means that if one class functions that may fail is called and this variable was already set to an error message due to a failure in a previous call to the same or other function, the function will also fail and does not do anything.</p>
<p> This allows programs using this class to safely call several functions that may fail and only check the failure condition after the last function call.</p>
<p> Just set this variable to an empty string to clear the error condition.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_error_code"></a><li><a name="5.2.14">error_code</a></li></h3>
<h3>Type</h3>
<p><tt><i>int</i></tt></p>
<h3>Default value</h3>
<p><tt>0</tt></p>
<h3>Purpose</h3>
<p>Store the code that is returned when an error occurs.</p>
<h3>Usage</h3>
<p>Check this variable to understand what happened when a call to any of the class functions has failed. It may be set to several possible error codes defined as constants:</p>
<p> <tt>MARKUP_PARSER_ERROR_NONE</tt> - No error happened </p>
<p> <tt>MARKUP_PARSER_ERROR_UNEXPECTED</tt> - It was found a condition that the class is not yet ready to handle </p>
<p> <tt>MARKUP_PARSER_ERROR_INVALID_SYNTAX</tt> - A syntax error was found </p>
<p> <tt>MARKUP_PARSER_ERROR_INVALID_USAGE</tt> - An invalid value was passed to the class function parameters or set to the class variables</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_error_position"></a><li><a name="5.2.15">error_position</a></li></h3>
<h3>Type</h3>
<p><tt><i>int</i></tt></p>
<h3>Default value</h3>
<p><tt>-1</tt></p>
<h3>Purpose</h3>
<p>Point to the position of the markup data or file that refers to the last error that occurred.</p>
<h3>Usage</h3>
<p>Check this variable to determine the relevant position of the document when a parsing error occurs. A negative value indicates that there was no error or the last error is not associated to a specific document position.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_buffer_length"></a><li><a name="5.2.16">buffer_length</a></li></h3>
<h3>Type</h3>
<p><tt><i>int</i></tt></p>
<h3>Default value</h3>
<p><tt>8000</tt></p>
<h3>Purpose</h3>
<p>Maximum length of the chunks of markup data read from files that the class parse at one time.</p>
<h3>Usage</h3>
<p>Adjust this value according to the available memory.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_ignore_syntax_errors"></a><li><a name="5.2.17">ignore_syntax_errors</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>1</tt></p>
<h3>Purpose</h3>
<p>Specify whether the class should ignore syntax errors in malformed documents.</p>
<h3>Usage</h3>
<p>Set this variable to 0 if it is necessary to verify whether markup data may be corrupted due to to eventual bugs in the program that generated the document.</p>
<p> Currently the class only ignores some types of syntax errors. Other syntax errors may still cause the <tt><a href="#function_Parse">Parse</a></tt> to fail.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_warnings"></a><li><a name="5.2.18">warnings</a></li></h3>
<h3>Type</h3>
<p><tt><i>array</i></tt></p>
<h3>Default value</h3>
<p><tt>array()</tt></p>
<h3>Purpose</h3>
<p>Return a list of positions of the original document that contain syntax errors.</p>
<h3>Usage</h3>
<p>Check this variable to retrieve eventual document syntax errors that were ignored when the <tt><a href="#variable_ignore_syntax_errors">ignore_syntax_errors</a></tt> is set to 1.</p>
<p> The indexes of this array are the positions of the errors. The array values are the corresponding syntax error messages.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_store_positions"></a><li><a name="5.2.19">store_positions</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>1</tt></p>
<h3>Purpose</h3>
<p>Tell the class to return the position of each document element token.</p>
<h3>Usage</h3>
<p>Set this variable to 0 if you do not need to know the position of each parsed markup element.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_track_lines"></a><li><a name="5.2.20">track_lines</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>0</tt></p>
<h3>Purpose</h3>
<p>Tell the class to keep track the position of each document line.</p>
<h3>Usage</h3>
<p>Set this variable to 1 if you need to determine the line and column number associated to a given position of the parsed document.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_tag_lower_case"></a><li><a name="5.2.21">tag_lower_case</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>1</tt></p>
<h3>Purpose</h3>
<p>Tell the class to lower the case of tag and attribute names in the <tt><a href="#function_RewriteElement">RewriteElement</a></tt> function.</p>
<h3>Usage</h3>
<p>Set this variable to 0 when you want to preserve the original case tags and attributes being rewritten.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_quote_attribute_values"></a><li><a name="5.2.22">quote_attribute_values</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>1</tt></p>
<h3>Purpose</h3>
<p>Tell the class to always quote the values of attribute in the <tt><a href="#function_RewriteElement">RewriteElement</a></tt> function.</p>
<h3>Usage</h3>
<p>Set this variable to 0 when you want that attribute values be quoted only when they have spaces, tabs or line break characters.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_decode_entities"></a><li><a name="5.2.23">decode_entities</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>0</tt></p>
<h3>Purpose</h3>
<p>Tell the class to decode all the character entities in character data or tag attributes.</p>
<h3>Usage</h3>
<p>Set this variable to 1 if you need to get all the character data or tag attributes with character entities already decoded.</p>
<p><a href="#variables">Variables</a></p>
<h3><a name="variable_allow_grave_accent_quoting"></a><li><a name="5.2.24">allow_grave_accent_quoting</a></li></h3>
<h3>Type</h3>
<p><tt><i>bool</i></tt></p>
<h3>Default value</h3>
<p><tt>1</tt></p>
<h3>Purpose</h3>
<p>Tell the class to allow grave accent characters as delimiters for quoted tag attributes.</p>
<h3>Usage</h3>
<p>Set this variable to 0 if you want the class to be strict and not accept grave accent characters to quote tag attribute values.</p>
<p><a href="#variables">Variables</a></p>
<p><a href="#table_of_contents">Table of contents</a></p>
</ul>
</ul>
<hr />
<ul>
<h2><li><a name="functions"></a><a name="6.1.1">Functions</a></li></h2>
<ul>
<li><tt><a href="#function_GetPositionLine">GetPositionLine</a></tt></li><br />
<li><tt><a href="#function_ParseDTDExpressionValue">ParseDTDExpressionValue</a></tt></li><br />
<li><tt><a href="#function_ParseAttributeList">ParseAttributeList</a></tt></li><br />
<li><tt><a href="#function_StartParsing">StartParsing</a></tt></li><br />
<li><tt><a href="#function_Parse">Parse</a></tt></li><br />
<li><tt><a href="#function_FinishParsing">FinishParsing</a></tt></li><br />
<li><tt><a href="#function_RewriteElement">RewriteElement</a></tt></li><br />
<p><a href="#table_of_contents">Table of contents</a></p>
<h3><a name="function_GetPositionLine"></a><li><a name="7.2.8">GetPositionLine</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> GetPositionLine(</tt><ul>
<tt><i>int</i> </tt><tt><a href="#argument_GetPositionLine_position">position</a></tt><tt>,</tt><br />
<tt>(output) <i>int &amp;</i> </tt><tt><a href="#argument_GetPositionLine_line">line</a></tt><tt>,</tt><br />
<tt>(output) <i>int &amp;</i> </tt><tt><a href="#argument_GetPositionLine_column">column</a></tt></ul>
<tt>)</tt></p>
<h3>Purpose</h3>
<p>Get the line number of the document that corresponds to a given position.</p>
<h3>Usage</h3>
<p>Pass the document offset number as the position to be located. Make sure the <tt><a href="#variable_track_lines">track_lines</a></tt> variable is set to 1 before parsing the document.</p>
<h3>Arguments</h3>
<ul>
<p><tt><b><a name="argument_GetPositionLine_position">position</a></b></tt> - Position of the line to be located.</p>
<p><tt><b><a name="argument_GetPositionLine_line">line</a></b></tt> - Returns the number of the line that corresponds to the given document position.</p>
<p><tt><b><a name="argument_GetPositionLine_column">column</a></b></tt> - Returns the number of the column of the line that corresponds to the given document position.</p>
</ul>
<h3>Return value</h3>
<p>This function returns 1 if the <tt><a href="#variable_track_lines">track_lines</a></tt> variable is set to 1 and it was given a valid positive position number that does not exceed the position of the last parsed document line.</p>
<p><a href="#functions">Functions</a></p>
<h3><a name="function_ParseDTDExpressionValue"></a><li><a name="9.2.9">ParseDTDExpressionValue</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> ParseDTDExpressionValue(</tt><ul>
<tt><i>string</i> </tt><tt><a href="#argument_ParseDTDExpressionValue_value">value</a></tt><tt>,</tt><br />
<tt>(output) <i>array</i> </tt><tt><a href="#argument_ParseDTDExpressionValue_expression">expression</a></tt></ul>
<tt>)</tt></p>
<h3>Purpose</h3>
<p>Parse the value of an element expression used in a DTD.</p>
<h3>Usage</h3>
<p>Use only if you need to expand entity values when parsing DTDs.</p>
<h3>Arguments</h3>
<ul>
<p><tt><b><a name="argument_ParseDTDExpressionValue_value">value</a></b></tt> - DTD expression value to be parsed.</p>
<p><tt><b><a name="argument_ParseDTDExpressionValue_expression">expression</a></b></tt> - Array that defines the types and values of the parsed DTD expression.</p>
</ul>
<h3>Return value</h3>
<p>Returns 1 if it is given a valid DTD expression value.</p>
<p><a href="#functions">Functions</a></p>
<h3><a name="function_ParseAttributeList"></a><li><a name="11.2.10">ParseAttributeList</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> ParseAttributeList(</tt><ul>
<tt><i>string</i> </tt><tt><a href="#argument_ParseAttributeList_value">value</a></tt><tt>,</tt><br />
<tt>(output) <i>array</i> </tt><tt><a href="#argument_ParseAttributeList_attlist">attlist</a></tt></ul>
<tt>)</tt></p>
<h3>Purpose</h3>
<p>Parse the value of an attribute list expression used in a DTD.</p>
<h3>Usage</h3>
<p>Use only if you need to expand attribute list values when parsing DTDs.</p>
<h3>Arguments</h3>
<ul>
<p><tt><b><a name="argument_ParseAttributeList_value">value</a></b></tt> - Attribute list expression value to be parsed.</p>
<p><tt><b><a name="argument_ParseAttributeList_attlist">attlist</a></b></tt> - Array that defines the types and values of the parsed DTD attribute list expression.</p>
</ul>
<h3>Return value</h3>
<p>Returns 1 if it is given a valid DTD attribute list expression value.</p>
<p><a href="#functions">Functions</a></p>
<h3><a name="function_StartParsing"></a><li><a name="13.2.11">StartParsing</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> StartParsing(</tt><ul>
<tt>(input and output) <i>array</i> </tt><tt><a href="#argument_StartParsing_parameters">parameters</a></tt></ul>
<tt>)</tt></p>
<h3>Purpose</h3>
<p>Initialize the state of the markup parser.</p>
<h3>Usage</h3>
<p>Call this function before start parsing the markup document, passing the file name or data to be parse and eventually other parsing option parameters.</p>
<h3>Arguments</h3>
<ul>
<p><tt><b><a name="argument_StartParsing_parameters">parameters</a></b></tt> - Specifies a list of options that define how to parse the given document. Currently it has the following options: </p>
<p> <tt>Data</tt> - String with the markup data to be parsed </p>
<p> <tt>File</tt> - Name of the file from which the data to be parsed should be read instead of a static string. </p>
<p> <tt>DecodeEntities</tt> - Alternative way to set the option for determining whether the class should decode character entities, as described for the <tt><a href="#variable_decode_entities">decode_entities</a></tt>. </p>
</ul>
<h3>Return value</h3>
<p>Returns 1 if all parameters are correctly defined.</p>
<p><a href="#functions">Functions</a></p>
<h3><a name="function_Parse"></a><li><a name="15.2.12">Parse</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> Parse(</tt><ul>
<tt>(output) <i>bool &amp;</i> </tt><tt><a href="#argument_Parse_end">end</a></tt><tt>,</tt><br />
<tt>(output) <i>array</i> </tt><tt><a href="#argument_Parse_elements">elements</a></tt></ul>
<tt>)</tt></p>
<h3>Purpose</h3>
<p>Parse the markup document.</p>
<h3>Usage</h3>
<p>Call this function iteratively until the <tt><a href="#argument_Parse_end">end</a></tt> argument is returned set to 1.</p>
<h3>Arguments</h3>
<ul>
<p><tt><b><a name="argument_Parse_end">end</a></b></tt> - Determine when the parser reached the end of the document.</p>
<p><tt><b><a name="argument_Parse_elements">elements</a></b></tt> - Return a sequence of associative arrays with entries that describe each document element that was parsed.</p>
</ul>
<h3>Return value</h3>
<p>Returns 1 if there were no fatal parsing errors.</p>
<p><a href="#functions">Functions</a></p>
<h3><a name="function_FinishParsing"></a><li><a name="17.2.13">FinishParsing</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> FinishParsing(</tt><tt>)</tt></p>
<h3>Purpose</h3>
<p>Close any files and release any resources allocated while the document was being parsed.</p>
<h3>Usage</h3>
<p>Call this function after you are done with parsing the markup document.</p>
<h3>Return value</h3>
<p>Returns 1 if all resources were successfully released.</p>
<p><a href="#functions">Functions</a></p>
<h3><a name="function_RewriteElement"></a><li><a name="17.2.14">RewriteElement</a></li></h3>
<h3>Synopsis</h3>
<p><tt><i>bool</i> RewriteElement(</tt><ul>
<tt>(input and output) <i>array</i> </tt><tt><a href="#argument_RewriteElement_element">element</a></tt><tt>,</tt><br />
<tt>(output) <i>string &amp;</i> </tt><tt><a href="#argument_RewriteElement_markup">markup</a></tt></ul>
<tt>)</tt></p>
<h3>Purpose</h3>
<p>Generate a string for a previously parsed document markup element.</p>
<h3>Usage</h3>
<p>Call this function for each markup element when you want to regenerated an element that was just parsed and eventually filtered.</p>
<h3>Arguments</h3>
<ul>
<p><tt><b><a name="argument_RewriteElement_element">element</a></b></tt> - Associative array that defines the type and the values of the document element to be rewritten.</p>
<p><tt><b><a name="argument_RewriteElement_markup">markup</a></b></tt> - Return the string of the rewritten document element.</p>
</ul>
<h3>Return value</h3>
<p>Returns 0 if it is pass an invalid element definition.</p>
<p><a href="#functions">Functions</a></p>
<p><a href="#table_of_contents">Table of contents</a></p>
</ul>
</ul>

<hr />
<address>Manuel Lemos (<a href="mailto:mlemos-at-acm.org">mlemos-at-acm.org</a>)</address>
</body>
</html>
Return current item: Secure HTML parser and filter,XSS,CSRF