:: Copyright 2020 OpenStack Foundation This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ====================== Website Activity Stats ====================== https://storyboard.openstack.org/#!/story/2007387 Basic website activity stats around which pages are hit most often, which pages are 404s, and total number of visitors aid in properly running a site. With this info you can correct broken links or redirect users to appropriate locations. Popular pages can be given more attention as they are read most often. Visitor numbers help you learn if changes that are being made are effective or not. Unfortunately for a long period of time we've not really published any of this useful data. Problem Description =================== One of the major reasons we have not published this data historically is that many tools that work with this data over share. We are particularly concerned about publishing information that might be attributed to specific users. The ideal here is that we could publish a bare minimum of information that allows web admins to properly manage sites without leaking personal information. In particular we don't want to leak IP Addresses or subnets as IPs are considered PII and without significant traffic subnets typically identify specific users. We also want to avoid publishing referer information as this can be used to infer who users are as well. This can happen if users follow links from internal company wikis, bug trackers or code hosting systems. Out of an abundance of caution we will avoid publishing Operating System, Web Browser, and google search terms as well. This data is likely safe to share, particularly if we avoid making it cross referenceable with other fields. For this reason we may add these stats in the future. Proposed Change =============== We can use goaccess, a GPL tool, to produce conservative website stats reports from apache access logs. The key here is that newer goaccess (since Ubuntu Bionic) allow you to remove data from the end result report files. This allows us to tell goaccess to produce reports only with the data we feel is safe for public consumption. We would run periodic Zuul jobs that connected to static.opendev.org, uncompressed Apache log files as necessary, then fed them through goaccess. The resulting report.html output file could then be written into AFS as well as hosted directly from the zuul logs system. This would give us reports that updated roughly daily covering the period of time for which logs are available. To make this possible we will use Zuul's per project ssh keys. This will allow the jobs to add static.opendev.org to the running ansible inventory then run ansible to perform the above steps. If publishing into AFS we would write them to a known location for each site:: https://example.website.org/goaccess.html To do this we need a configuration file that excludes the panels we do not want:: log-format COMBINED ignore-panel VISITORS ignore-panel REQUESTS ignore-panel REQUESTS_STATIC ignore-panel NOT_FOUND ignore-panel HOSTS ignore-panel OS ignore-panel BROWSERS ignore-panel VISIT_TIMES ignore-panel VIRTUAL_HOSTS ignore-panel REFERRERS ignore-panel REFERRING_SITES ignore-panel KEYPHRASES ignore-panel STATUS_CODES ignore-panel REMOTE_USER ignore-panel GEO_LOCATION enable-panel VISITORS enable-panel REQUESTS enable-panel REQUESTS_STATIC enable-panel NOT_FOUND enable-panel STATUS_CODES Then we can run (roughly) this command in the Zuul jobs:: goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf Alternatives ------------ We can use tracker that run in the browser like goatcounter. One downside to this approach is that we would need to run custom 404 pages in order to collect data on 404s. This is more complicated than the web server logs approach. One upside to this approach is that we could track referrers to 404s enabling us to more easily fix our own broken links. If we were collecting a rich set of data they would provide much more info, but because we've decided that we do not want to collect that information the server logs should be sufficient. Implementation ============== Assignee(s) ----------- Primary assignee: Clark Boylan (clarkb) Gerrit Topic ------------ Use Gerrit topic "website-stats" for all patches related to this spec. .. code-block:: bash git-review -t website-stats Work Items ---------- * Write zuul jobs to produce and publish the goaccess reports. * Document goaccess tooling for web admins. Repositories ------------ None Servers ------- static.opendev.org would be updated to implement this for the sites it hosts. DNS Entries ----------- None Documentation ------------- We will need to document where the stats can be retrieved once available. We should also document the choices we made around which data is collected. Security -------- We could potentially leak sensitive client information unintentionally. The example config file used above is intended to do its best to avoid that by explicitly disabling all available goaccess panels then enabling the few we know are safe. Testing ------- We can run the new job against test data to ensure it works as expected without disclosing unwanted info. Dependencies ============ None