Wednesday, June 4, 2008

A Real Time Web Analytics System

Table of Contents
1.0 Introduction
2.0 The Requirements
3.0 The Architecture
4.0 The Reports
5.0 The Implementation
6.0 Conclusion

1.0 Introduction
This document describes an implementation of a realtime web logs capture and reporting system. This system was developed to provide realtime reports for measuring traffic parameters like pageviews, visits, unique visitors etc. in realtime. The system was designed and built to replace the batch process system which generated reports in a deferred mode and did not allow for realtime monitoring and action on the various online services.

2.0 The Requirements
The existing system generated reports on the previous day’s logs and not real time, the system could not be scaled up, was not equipped to handle heavy traffic, had no scope for adding new services and there was no scope for adding or editing logs.
The realtime web log capture and reporting system was to provide for:
o Real time web log capture from web servers at geographically dispersed locations
o Building a robust web logs data warehouse
o Provide extensive realtime reports from the web logs
The advantages of this system would be:
• Can access data in “real time”
• The process can be scaled up to handle more traffic
• Provision has been made to add a new service or delete an existing service, which can be accessed from the very next day
• Logs can be added and modified
The system was required to capture, collate, and aggregate the web-logs which accumulate on the web-app servers. The aggregates need to be produced in near-real time. Hence a multi-layer architecture, comprising a layer of capture agents deployed on every web-app server, a layer of collation server applications which collate data from the capture agents, and layer of computation servers which aggregate data at high speed, needs to be implemented. This multi-layer architecture leaves aggregate data in industry-standard RDBMS tables, which can then be queried for the purpose of view through user interface screens. The aggregate tables are sought to be updated in near-real-time.

3.0 The Architecture
The overall architecture thus comprises 4 layers starting from capture, through collation, updating the database tables and reporting.


Fig 1. The LogWatch Architecture


The architecture has four layers Collation clients (L1), Collation servers (l2), Computation servers (L3), Reporting server (L4) and a database server to store the aggregated results. By design the architecture is completely scalable in the first three layers L1, L2, L3. All the layers communicate with each other over TCP/IP.
Each collation client in L1 will connect to one Collation server in L2. A maximum of 30 Collation clients can connect to one Collation server. Primary back-up fail-over features will be provided (If one of the collation server fails, clients connecting to that will automatically shift to other servers in the cluster).
The computation is distributed to the computation servers (L3) by service. Computation required for a service will be handled by its Computation server. Primary back fail-over is not possible in this layer. If required the architecture will allow distribution of computing by service. (for example there can be two servers performing computations for a service like e-mail).
The computed information (aggregated) is stored in a database, which is used by the L4 (Reporting) layer.

4.0 The Reports
The system provides reports on the following:
4.1 Hits by time
4.2 Page Views by time, by pages
4.3 Visits by time, by page
4.4 Unique visitor by time, by page
4.5 Return frequency
4.6 Return visit
4.7 Visiting frequency by visitor
4.8 Average time spent
4.9 By page average time spent
4.10 Referrer by domains, URL
4.11 Search engines
4.12 Search engine keywords
4.13 By search engine by keyword
4.14 Browser type, version, OS
4.15 Parameter analysis
4.16 Country, city, state wise reports
4.17 By country top pages
4.18 By ISP
4.19 Top entry pages
4.20 Top exit pages
4.21 Path reporting (across service)
4.22 Directory filter based reporting
4.23 Fall-out reports

5.0 The Implementation
The implementation of the solution was done on an incremental basis. Deliverables were planned for each increment based on the requirement specified. There were five development cycles, the details of which are as specified.
• Incremental cycle 1
o Setting up the framework for real-time log capture
o Health monitoring system
o Hits by time
o Page Views by time, by pages
• Incremental cycle 2 – The following reports computation were developed
o Visits by time, by page
o Unique visitor by time, by page
o Return frequency
o Return visit
o Visiting frequency by visitor
o Average time spent
o By page average time spent
• Incremental cycle 3 – The following reports computation were developed
o Referrer by domains, URL
o Search engines
o Search engine keywords
o By search engine by keyword
o Browser type, version, OS
o Parameter analysis
• Incremental cycle 4 – The following reports computation were developed
o Country, city, state wise reports
o By country top pages
o By ISP
o Top entry pages
o Top exit pages
o Path reporting (across service)
• Incremental cycle 5 – The following reports computation were developed
o Directory filter based reporting
o Fall-out reports

The deliverables in each phase required elements of each layer to be developed, implemented, tested and deployed. For instance, a few database tables of the final aggregate table schema were needed to be designed from the first cycle itself along with the corresponding reports.

6.0 Conclusion
This document describes an implementation of a realtime web logs capture and reporting system. This system was developed to provide realtime reports for measuring traffic parameters like pageviews, visits, unique visitors etc. in realtime. The system was designed and built to replace the batch process system which generated reports in a deferred mode and did not allow for realtime monitoring and action on the various online services.
The architecture of the system consists of four layers. Layer 1 is the Collation client agent, which parses the log written by the web-servers and transfers it to the collation servers. Layer 2 is the Collation layer which has server utilities which will collate the information for computation servers. Layer 3 is the Computation layer which has servers which update database tables containing the data for the purpose of reporting. Layer 4 is the Reporting layer which has servers which report based on the collated and computed data.
This system has overcome the shortcomings of the existing system which was not scalable and provided reports in a deferred mode. This was overcome by the present system which has a highly scalable architecture and provides reports in real time.