Location: PHPKode > projects > Obsessive Website Statistics > ows/docs/dimensions.txt
OWS Data Storage Documentation
==========================

	1. Introduction
	2. Theory
	3. Implementation

1. Introduction

OWS implements a star schema for the primary data storage. It is critical
that plugins adhere to the structure of the schema to ensure accurate
data analysis by the user. This discusses naming conventions in the star
schema that has been implemented. Try to follow it unless it makes sense
to do otherwise. :) 

Ideally, OWS has wrapped all of this stuff in a nice easy to use abstraction
layer so that you won't need to understand any of this. However, it is 
documented here just in case that isn't the case.

2. Theory

OWS needs to provide useful analysis possibilities to its users, quickly.
Originally, OWS used a flat table to store all logfile data so that it was 
simple to upload and (relatively) simple to analyze. However, it soon became 
extremely slow when table sizes got into the millions of rows. So, data storage
now is in a star schema, and retrieval performance has increased significantly.
Of course, this is at the expense of insertion performance. However, 
retrieval speed is really what matters here. Using the initial version of the
star schema over the old flat scheme yields around a 75% speedup on tables of
100,000 rows, and possibly more for bigger tables. This goes along with a 50%
reduction in table sizes (though, index size has gone way up). In any case, 
I haven't done any formal benchmarking at this time. 

Without getting too much into the theory of star schemas (there are a number
of excellent books and websites that you can reference), the way OWS is organized
on a per-domain basis is as follows.

At the base is a fact table. It contains a lot of keys, which are references
to rows in dimension tables. By using joins to the other dimensional tables,
it should be possible to recreate the stored logfile data. (in fact, there is a
function inside of dimensions.inc.php which can generate an SQL query to do
just that)
	
Every useful type of data gleaned from the logfile data is stored as a 'dimension'
of the star schema. For example: date, referrer, agent... each dimension is
stored in its table.

Each dimension table contains columns which are attributes of that dimension. For
example, the referrer dimension may contain attributes such as 'is external'. 
Attributes try to define the primary data of the dimension in useful ways. 
Generally, the data can be gleaned from the primary data field, but storing it
as an attribute speeds retrieval and analysis later.

In the future, aggregate tables and data pruning should be implemented.


3. Implementation

Each dimension is implemented in its own table. Each dimension is named 
something like

	websitename_dimensionname
	
Please note that the dimension 'config' is reserved. The primary key of 
each dimension is always named like
	
	websitename_id
	
and, there always exists a field 'dimensionname' in the table in question,
which should contain the main data of the field. Each dimension table
will contain attributes for the piece of data. For example, a 'date' 
dimension could hold information such as

hostname_date
	
	date_id			(primary key)
	date			(primary data field)
	day_of_month	(attribute)
	day_of_week		(attribute)
	is_weekend		(attribute)
	... 
	
and so on. Initially, each dimension will be a different element of the
logfile line (host, bytes, agent.. etc). However, plugins can either extend 
the dimension that it uses by adding more attributes to a dimension, or
they can add new dimensions to the data. 

Attributes do not necessarily have to be unique, and may be other keys
pointing to snowflakes of the star schema. However, at this time there is
not builtin support to query and/or create these.

It should be noted that each dimension 'belongs' to a plugin that creates
it. This means the plugin MUST define the primary data field when
dimensions are defined. Plugins that extend a dimension do not need to 
define other fields. The ID field will always be defined by the internal
dimension creation routines. 

Each website will have a 'fact table' that joins all the dimensions
together in a way that makes sense. It will always be called

	websitename

and will have keys pointing to each dimension of the schema. For example:

	websitename

		id					(primary key)
		dimension1_id
		dimension2_id
		dimension3_id

In theory, there should probably be useful facts stored in the fact table, 
instead of just links to each table. However, I don't see this being 
particularly useful atm... let me know if you see a good use. (Note to self:
what about storing visit data there??)
		
In the future, we will introduce aggregate tables. Support does not exist
for these at the moment, but will probably have the same type of naming 
convention as described above.
Return current item: Obsessive Website Statistics