About Matomo and Log Analytics
This is a guide on setting up Log Analytics with Matomo. Matomo is an open source, self hosted, privacy-friendly analytics platform that comes available as a Cloudron crate. Standard Matomo installations ingest data through a JavaScript tracking that you must embed in each website that you wish to enable analytics on.
Matomo also offers Log Analytics - where instead of using a client-side JavaScript tracker, it ingests data directly from your Nginx log files (access.log
). In comparison to using JavaScript, server-side Log Analytics have the following benefits:
- It's more privacy-friendly: Instead of injecting tracking code into your website, you will do passive analysis from Nginx log data only.
- It offers better performance for visitors: If your website is optimised for speed, you don't want to make another request for the analytics library, which adds to the loading time.
- It's more durable: With the popularity of ad-blockers, a lot of analytics scripts don't load at all. Log Analytics offer more accurate data.
I am using Log Analytics primarily out of privacy consideration for my website's visitors. I want to understand where my visitors come from, but in the most respectful, privacy-friendly way possible. Server-side log analytics means I won't inject any code at all, which is much friendlier in my opinion.
Overview of how log analytics works
Broadly speaking, the process for sending logs to your Matomo installation looks like this. We will be automating it using a cronjob.
- Cloudron's Nginx webserver creates logs in
/var/log/nginx
which are called access.log
.
- Using Matomo's
import_logs.py
script (GitHub), we send the log files to your Matomo installation url (e.g. https://matomo.example.com
- On your Matomo docker container, an
archive
job is run, and the data is now available on the dashboard.
The biggest difficulty in this setup is step 2. As of Cloudron box version 7.0.1
, Cloudron's Nginx is configured to use a niche log format called combined2
. This log format seems to be only used by collectd
and nobody else, hence Matomo's import_logs.py
script cannot parse it. We will have to use a custom regex pattern in order to allow Matomo's import script to work.
Note: According to @girish , Cloudron will revert to the default Nginx combined
log format for the 7.1 release (source). Hence, if you are following this guide from the future, feel free to omit the custom regex pattern.
Differences between Nginx's default log format, and Cloudron's Combined2
The combined2
log format that Cloudron uses is slightly different from Nginx's default combined
log format. Here's a comparison of their structure:
The combined2
format:
$remote_addr - [$time_local] "$request" $status $body_bytes_sent $request_time "$http_referer" "$host" "$http_user_agent"
The combined
format:
$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"
As you can see, the fields are different just enough that the import_logs.p
script cannot parse it. Thankfully, we can specify a custom regex pattern using the --log-format-regex
option.
Regex Pattern for combined2
logs:
This is the Regex pattern that you need to use to parse the logs successfully:
(?P<ip>[\w*.:-]+)\s+\S+\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<generation_time_milli>\d*\.?\d+)\s+"(?P<referrer>.*?)"\s"(?P<host>[\w\-\.]*)"\s"(?P<user_agent>.*?)"
Essentially, it defines a bunch of named capture groups, such as <date>
or <path>
which import_logs.py
can understand.
Here's a nice, visual explanation of the regex format, complete with some example log data (IP addresses are fake, sourced from reserved ranges):
Using the import_logs.py
script
In order to import our server logs into Matomo, we must use their provided Python 3 import script. We can get the script from their official Github repository:
https://github.com/matomo-org/matomo-log-analytics
I will show you where to download it in a moment.
The import_logs.py
script requires three parameters:
--url
: This is the url of your matomo installation. It must include the https://
prefix!
--token-auth
: This is an API authentication token from Matomo. You must generate it from the dashboard.
--log-format-regex
This tells the script to use your custom provided regex pattern, so it can understand Cloudron's combined2
format.
Once again, if you are following this guide from the future (e.g. version 7.1 and above), you do not need to specify the --log-format-regex
.
This is how the command should look like:
python3 import_logs.py \
--url=https://matomo.example.com \
--token-auth=KEEP_THIS_SECRET \
--log-format-regex='(?P<ip>[\w*.:-]+)\s+\S+\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<generation_time_milli>\d*\.?\d+)\s+"(?P<referrer>.*?)"\s"(?P<host>[\w\-\.]*)"\s"(?P<user_agent>.*?)"' \
/var/log/nginx/access.log.1
Run the import_logs.py
Now we are ready to get it working. We must first login to your base Cloudron server. Cloudron does not run Nginx on a per-application basis (i.e. in every docker container), but rather runs Nginx on the base server itself. Hence all the logs are there, and we need to execute the script there.
First, login to your server using SSH:
ssh root@my.example.com
Next, we will download the script from Matomo, and go inside the directory that contains it.
cd ~
git clone https://github.com/matomo-org/matomo-log-analytics.git
cd matomo-log-analytics
Now we run the above command. Make sure to have the correct --url
and --token-auth
parameters, as well as the right log file, which should be /var/log/nginx/access.log.1
.
python3 import_logs.py \
--url=https://matomo.example.com \
--token-auth=KEEP_THIS_SECRET \
--log-format-regex='(?P<ip>[\w*.:-]+)\s+\S+\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<generation_time_milli>\d*\.?\d+)\s+"(?P<referrer>.*?)"\s"(?P<host>[\w\-\.]*)"\s"(?P<user_agent>.*?)"' \
/var/log/nginx/access.log.1
If you are successful, you should see the output which looks like this:
x lines parsed, x lines recorded, x records/sec (avg), x records/sec (current)
Parsing log /var/log/nginx/access.log.1...
x lines parsed, x lines recorded, x records/sec (avg), x records/sec (current)
...
Processing your log data
------------------------
In order for your logs to be processed by Matomo, you may need to run the following command:
./console core:archive --force-all-websites --url='https://matomo.example.com'
Now your logs should have been ingested by Matomo. If you have any additional logs, such as access.log.2
, access.log.3
, et cetera, this is the time to import them as well.
In order for Matomo's dashboard to update, we will have to tell it to archive
, thankfully the default Matomo cloudron installation is already configured to archive automatically every 15 minutes. If you wish to perform a manual archive
, simply open a terminal in the Matomo docker (you can do this from the browser) and tell it to run the archive
cronjob.
Check your Matomo dashboard now
Do you see any data? If you do not see any data, it may be because you have not setup a website in the Matomo dashboard. By default, Matomo rejects log entries that do not correspond to a website in the dashboard. If your Matomo install is brand new, this is the time for you to add your websites. Then run the import commands again.
Now your dashboard should be updated with the log analytics.
Automating Log Imports using Cronjobs
Now, we must import the files every day. The best way to automate this is to put the command into a bash script, and set a cronjob to automate it. This is the script that you can use:
#!/usr/bin/env bash
python3 import_logs.py \
--url=https://matomo.example.com \
--token-auth=KEEP_THIS_SECRET \
--log-format-regex='(?P<ip>[\w*.:-]+)\s+\S+\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<generation_time_milli>\d*\.?\d+)\s+"(?P<referrer>.*?)"\s"(?P<host>[\w\-\.]*)"\s"(?P<user_agent>.*?)"' \
access.log.1
Make sure that the log file is access.log.1
. Nginx automatically rotates the log files once a day at midnight, where the log files switch like this:
access.log -> access.log.1 -> access.log.2.gz -> access.log.3.gz
Since we are running our cronjob once a day at 1:00am, we always want to get access.log.1
which represent "yesterday's" logs, fresh right after the rotation If you ran the import on access.log
, you will gain an empty file since the logs were just rotated.
Save it somewhere like at /root/import-cronjob.sh
.
Now, all you have to do is to add the cronjob into the root
crontab. To do so, you run:
crontab -e
Follow the on-screen instructions to choose an editor, and then add the following cronjob:
0 1 * * * /root/import-cronjob.sh >/dev/null 2>&1
This tells the server to run the script once a day, at 1:00 (1:00am), and to silence all output.
Save the crontab, and now your setup should be complete. Congratulations, your Matomo instance on Cloudron is now using server-side Log Analytics!
I hope this tutorial has been helpful. If you need any help, feel free to ask questions.
Keywords to aid search
Matomo, log analytics, nginx, logs, combined2, log format.