App logs filling disk
-
Last Friday, we found that our Cloudron apps stopped working due to one app (Etherpad-Lite) filling 100% of /dev/root with >100GB of new app logs. We deleted the logs, rebooted, and chalked it up as a one-time issue. Earlier today, this same thing happened a second time: Disk filled with another 100GB of extra platformdata. Running
cloudron-support --troubleshoot
and rebooting Cloudron resolved the issue, but without a clear understanding of why this happened, I'm concerned that this could happen again.This leaves me with a few follow-up questions:
- After having already deleted the excess log files, is there anything we can look at to track down the root cause, or is the "evidence" now all gone?
- Do we need to be actively managing log rotation for apps? Should I be setting up app-specific logrotate config files? Recommendations on how to do this?
- Any recommendations on how to monitor or configure alerts on file size or disk usage on Cloudron? (Server is running on AWS EC2, so perhaps I just do this with AWS Cloudwatch tools.)
- I'm seeing a number of recent disk-full threads on Cloudron and they didn't seem to be localized on Etherpad-Lite, so wondering if there are any platform issues that might be related.
Some specifics on our situation:
- saw that our externally-hosted uptime server reported that multiple Cloudron-hosted apps were unresponsive.
- visited Cloudron dashboard and found
/dev/root
at 100% capacity, with "platformdata" filling up disk usage chart. (160GB disk) - connected via SSH (AWS EC2 "Session Manager" option failed, presumably due to filled disk)
- found that Etherpad-Lite app had created >100GB of log files in the past day
- tracked down the troubleshooting instructions (idea: link to Troubleshooting page from the "/dev/root at 100% capacity" error dialog)
- ran
sudo cloudron-support --troubleshoot
, still saw DNS failures, but after rebooting everything started normal again.
-
N nebulon moved this topic from Support
-
N nebulon has marked this topic as solved
-
Oh wow, what an interesting (and slightly frightening) failure mode! "Your log files are so big that we've given up on managing them."