first of all thanks to @Sydney for your great tutorial for log analysis.
Unfortunately I still have problems in the implementation.
I installed Matomo in Cloudron, set up the site in Matomo and now I want to import the logs. I use the command for this:
python3 import_logs.py \ --url=https://analytics.my-site.de \ --token-auth=my-token\ --log-format-regex='(?P<ip>[\w*.:-]+)\s+\S+\s+[(?P<date>.*?)\s+(?P<timezone>.*?)]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<generation_time_milli>\d*\.?\d+)\s+"(?P<referrer>.*?)"\s"(?P<host>[\w\-\.]*)"\s"(?P<user_agent>.*?)"' \ /var/log/nginx/access.log.1
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) Parsing log /var/log/nginx/access.log.1... Logs import summary ------------------- 0 requests imported successfully 0 requests were downloads 23233 requests ignored: 0 HTTP errors 0 HTTP redirects 23233 invalid log lines 0 filtered log lines 0 requests did not match any known site 0 requests did not match any --hostname 0 requests done by bots, search engines... 0 requests to static resources (css, js, images, ico, ttf...) 0 requests to file downloads did not match any --download-extensions Website import summary ---------------------- 0 requests imported to 0 sites 0 sites already existed 0 sites were created: 0 distinct hostnames did not match any existing site: Performance summary ------------------- Total time: 0 seconds Requests imported per second: 0.0 requests per second Processing your log data ------------------------ In order for your logs to be processed by Matomo, you may need to run the following command: ./console core:archive --force-all-websites --url='https://analytics.my-site.de'
Invalid line detected (line did not match): 66.249.*.* - [31/Jan/2022:21:59:34 +0000] "GET my-site.com/blog/*/*/* HTTP/1.1" 200 14007 0.438 "-" "my-site.com" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.g**gle.com/bot.html)"
I only want to track a WordPress site, no other apps running in Cloudron.
Version Cloudron: v7.0.4 (Ubuntu 20.04.3 LTS)
Version Matomo: Matomo 4.6.2
Would anyone here support me?
Thanks in advance
@feelniceinc I think this is because the regexp to parse the log lines is not correct. Cloudron uses a format called "combined2" like below, so you might have to adjust that regexp accordingly:
log_format combined2 '$remote_addr - [$time_local] ' '"$request" $status $body_bytes_sent $request_time ' '"$http_referer" "$host" "$http_user_agent"';
That said, in the next release, we have removed the above custom format since it was causing problems when integrating with other tools (like crowdsec, iirc). As a temporary workaround, you can edit the nginx configs to say
access_log /var/log/nginx/access.log combined;instead of combined2 and restart nginx to see if it parses correctly.
@feelniceinc Can you post your final import command so the solution is available here?
Hey, my bash script for the cronjob is now:
#!/usr/bin/env bash sudo python3 /path/to/matomo-log-analytics/import_logs.py \ --url=https://mysite/matomo \ --token-auth=token \ --idsite=site_id \ /var/log/nginx/access.log.1
Hey there, @FeelNiceInc . I'm glad to hear that my tutorial was helpful for you, and I'm sorry that my provided regex did not work.
I think @girish 's solution is the best - by changing Cloudron's Nginx webserver to use the default
combinedlog format, matomo's log import script will automatically recognise and import the logs without needing to specify a special regex.
The regex that I provided in my tutorial was specifically in order to accomodate Cloudron's idiosyncratic
combined2log format -- but otherwise it provides little benefit.
I'm not sure why the regex didn't work for you, as it is working for me. For future readers that stumble upon this thread, I would recommend going with @girish 's advice, and simply change Cloudron to use the
However, if you already have an archive of logs that are in the
combined2format which you need to import, I recommend trying to figure out the correct regex by hand. I use a regex visualiser called RegExr, which makes it easier to craft custom regular expressions.
The Regexr link to the
combined2log format is here:
I recommend taking a few lines of your server logs, and pasting them into regexr -- and see what matches, and what doesn't match. The way the regex expression is formatted is that it defines a few named capture groups, which are as follows:
(?P<method>\S+)HTTP Request Method (e.g. Post, Get)
(?P<path>.*?)HTTP Request Path (e.g. /homepage.html)
(?P<status>\d+)HTTP Request Status
(?P<generation_time_milli>\d*\.?\d+)Amount of time for the server to respond
(?P<user_agent>.*?)User Agent (what browser, device, etc)
All the weird things like
.+in between simply account for things like spaces in the log lines. Try playing around with the Regex until it matches everything in your logs. The regexr website makes it all very visual and easy to understand.
I'm glad that you were able to get log analytics working. I hope this helps!