OWS Data Storage Documentation ========================== 1. Introduction 2. Theory 3. Implementation 1. Introduction OWS implements a star schema for the primary data storage. It is critical that plugins adhere to the structure of the schema to ensure accurate data analysis by the user. This discusses naming conventions in the star schema that has been implemented. Try to follow it unless it makes sense to do otherwise. :) Ideally, OWS has wrapped all of this stuff in a nice easy to use abstraction layer so that you won't need to understand any of this. However, it is documented here just in case that isn't the case. 2. Theory OWS needs to provide useful analysis possibilities to its users, quickly. Originally, OWS used a flat table to store all logfile data so that it was simple to upload and (relatively) simple to analyze. However, it soon became extremely slow when table sizes got into the millions of rows. So, data storage now is in a star schema, and retrieval performance has increased significantly. Of course, this is at the expense of insertion performance. However, retrieval speed is really what matters here. Using the initial version of the star schema over the old flat scheme yields around a 75% speedup on tables of 100,000 rows, and possibly more for bigger tables. This goes along with a 50% reduction in table sizes (though, index size has gone way up). In any case, I haven't done any formal benchmarking at this time. Without getting too much into the theory of star schemas (there are a number of excellent books and websites that you can reference), the way OWS is organized on a per-domain basis is as follows. At the base is a fact table. It contains a lot of keys, which are references to rows in dimension tables. By using joins to the other dimensional tables, it should be possible to recreate the stored logfile data. (in fact, there is a function inside of dimensions.inc.php which can generate an SQL query to do just that) Every useful type of data gleaned from the logfile data is stored as a 'dimension' of the star schema. For example: date, referrer, agent... each dimension is stored in its table. Each dimension table contains columns which are attributes of that dimension. For example, the referrer dimension may contain attributes such as 'is external'. Attributes try to define the primary data of the dimension in useful ways. Generally, the data can be gleaned from the primary data field, but storing it as an attribute speeds retrieval and analysis later. In the future, aggregate tables and data pruning should be implemented. 3. Implementation Each dimension is implemented in its own table. Each dimension is named something like websitename_dimensionname Please note that the dimension 'config' is reserved. The primary key of each dimension is always named like websitename_id and, there always exists a field 'dimensionname' in the table in question, which should contain the main data of the field. Each dimension table will contain attributes for the piece of data. For example, a 'date' dimension could hold information such as hostname_date date_id (primary key) date (primary data field) day_of_month (attribute) day_of_week (attribute) is_weekend (attribute) ... and so on. Initially, each dimension will be a different element of the logfile line (host, bytes, agent.. etc). However, plugins can either extend the dimension that it uses by adding more attributes to a dimension, or they can add new dimensions to the data. Attributes do not necessarily have to be unique, and may be other keys pointing to snowflakes of the star schema. However, at this time there is not builtin support to query and/or create these. It should be noted that each dimension 'belongs' to a plugin that creates it. This means the plugin MUST define the primary data field when dimensions are defined. Plugins that extend a dimension do not need to define other fields. The ID field will always be defined by the internal dimension creation routines. Each website will have a 'fact table' that joins all the dimensions together in a way that makes sense. It will always be called websitename and will have keys pointing to each dimension of the schema. For example: websitename id (primary key) dimension1_id dimension2_id dimension3_id In theory, there should probably be useful facts stored in the fact table, instead of just links to each table. However, I don't see this being particularly useful atm... let me know if you see a good use. (Note to self: what about storing visit data there??) In the future, we will introduce aggregate tables. Support does not exist for these at the moment, but will probably have the same type of naming convention as described above.