Emailing notifications of certain crucial system events, such as full disk space
Hi Cloudron Community,
unfortunately I had to deal with a full disk space and its consequences.
I was quite stunned that the disk usage was running full so quickly. Without prior notice services stopped and the dashboard was unreachable. After resizing the disk and the partitions I was able to start the server again. Anyways the unbound DNS server wasn't running properly, even though it did after resizing the disk, which caused mainly the mailserver to stop working.
The reason I believe the disk space was running full in the first place:
The mounted CIFS backup volume wasn't mounted anymore → backups were made locally. Why did that happen? No idea. But it would have been good to know that it happened.
These events just showed me that I would miss the possibility to setup email notifications for certain system events, which require immediate action, such as
- Backup volume not found/unmounted;
- Disk usage exceeds a threshold;
- Service X not running since X minutes → indicates a problem;
- … you name it.
And since I read quite often that people’s disk space was running full "suddenly" I thought this might be caused by a lack of information.
@dev-cb I've suggested in another thread that a pop-up message on the Cloudron Dashboard might be an even better reminder/notification about a given Cloudron using local storage. Even disks with close to 1TB fill up quickly if you use local storage and a healthy backup frequency! I don't think Cloudron should be responsible for our disk usage/monitoring, but it could certainly alert us to some conditions that might lead to a full disk.
I don't think Cloudron should be responsible for our disk usage/monitoring
I agree with you in general. It is the users responsibility to take action anyways – not Cloudron’s. But since one promise which is communicated on the website is the following, I think slightly different in detail:
Cloudron lets you focus on using the apps and not worry about system administration.
The users asking for help after having a full disk certainly had to deal with system administration – very suddenly and on a level which needs quite some knowledge (partitioning a disk).
A notification on the dashboard might give a good indication but also requires the user to constantly working with the dashboard which is not always the case. Imagine there are just running a few apps, for example to enable groupware for a small business.
Since Cloudron provides a built-in mail service it should be possible to implement the delivery of notifications or warnings to prevent a system failure.
I can just speak based on the experience I had, which wasn’t pleasant at all. The failure forced my business to stop running for a couple of hours which had cost quite something in time, effort and money. Could have been prevented.
What’s the opinion of others here? How often is the support been contacted with an issue such as full disk space and its consequences?
Backup volume not found/unmounted;
Disk usage exceeds a threshold;
Service X not running since X minutes → indicates a problem;
… you name it.
I agree with parts of your statement.
Yes an email notification about running out of space would be nice.
Why did the system do a local backup instead of notifying about the missing mount, or even better just try to remount the storage and then do the backup?
(I believe I talked with @nebulon about this and this should be fixed with the new update? big emphasis on believe since I am not sure if we really talked or my mind is playing games with me )
But now we come into a territory where Clouron it self would have to implement a full monitoring solution it self.
"Don't re-invent the wheel"
I use Zabbix for all my systems.
No there is no Zabbix app for Cloudron yet.
With Zabbix I can monitor everything.
Disk Usage, docker status Cloudron API and more.
You want to see some data for a system? Sure here:
I monitor each container, I have all statistics space, ram, cpu for each container.
Also I can setup notification flows and states about how critical an error is and when to start notifying people.
Usage: 75% Warning 80% Average 85% High 90% Critical 95% Disaster.
And can define groups where to send messages at each level and with what media.
Warning only via Rocket Chat and E-Mail, High>Critical Mobile Push Telegram Bot / SMS / Alarm.
yada yada yada.
I am unsure if it would be good to re-invent the wheel here.
To a certain degree OK but we should be careful.
I think this will end up in an initial feature which is going to grow because people will want more functionality for the internal monitoring.
. . .
Hope you can understand why I am concerned about this becoming a feature since the overhead could become rather large.
Zabbix is running on a Master Node and each Client has an Agent. (Yes the master is an external System)
Zabbix can monitor clients active and passive.
Passive means the Master asks the system for data and the system delivers.
This does not always work within special networks where the master can not reach the client.
Then you use active monitoring then the client reports all data in a certain interval to the master.
There can be a master / slave / proxy setup for big scale monitoring solutions. (Google Zabbix HA Cluster Setup for more details)
For more in detail please consult the doc: https://www.zabbix.com/documentation/current/en/manual/introduction/about