Emailing notifications of certain crucial system events, such as full disk space
-
@dev-cb said in Emailing notifications of certain crucial system events, such as full disk space:
Backup volume not found/unmounted;
Disk usage exceeds a threshold;
Service X not running since X minutes ā indicates a problem;
ā¦ you name it.I agree with parts of your statement.
Yes an email notification about running out of space would be nice.
Why did the system do a local backup instead of notifying about the missing mount, or even better just try to remount the storage and then do the backup?
(I believe I talked with @nebulon about this and this should be fixed with the new update? big emphasis on believe since I am not sure if we really talked or my mind is playing games with me)
But now we come into a territory where Clouron it self would have to implement a full monitoring solution it self.
"Don't re-invent the wheel"I use Zabbix for all my systems.
No there is no Zabbix app for Cloudron yet.With Zabbix I can monitor everything.
Disk Usage, docker status Cloudron API and more.
You want to see some data for a system? Sure here:
I monitor each container, I have all statistics space, ram, cpu for each container.
DATA!!!Also I can setup notification flows and states about how critical an error is and when to start notifying people.
Usage: 75% Warning 80% Average 85% High 90% Critical 95% Disaster.
And can define groups where to send messages at each level and with what media.
Warning only via Rocket Chat and E-Mail, High>Critical Mobile Push Telegram Bot / SMS / Alarm.yada yada yada.
I am unsure if it would be good to re-invent the wheel here.
To a certain degree OK but we should be careful.I think this will end up in an initial feature which is going to grow because people will want more functionality for the internal monitoring.
. . .
Hope you can understand why I am concerned about this becoming a feature since the overhead could become rather large.
@BrutalBirdie Can you just describe how your Zabbix system is running, I'm guessing outside of the cloudron vps right ?
-
@BrutalBirdie Can you just describe how your Zabbix system is running, I'm guessing outside of the cloudron vps right ?
@benborges
Zabbix is running on a Master Node and each Client has an Agent. (Yes the master is an external System)
Zabbix can monitor clients active and passive.
Passive means the Master asks the system for data and the system delivers.This does not always work within special networks where the master can not reach the client.
Then you use active monitoring then the client reports all data in a certain interval to the master.There can be a master / slave / proxy setup for big scale monitoring solutions. (Google Zabbix HA Cluster Setup for more details)
For more in detail please consult the doc: https://www.zabbix.com/documentation/current/en/manual/introduction/about
-
I also encountered "disk full" issue, and I was quite dumbfounded there was no email notification for this, that seems pretty basic as far as monitoring goes.
Cloudron is well-placed to add this functionality, and it would save us so much headaches.
-
I also encountered "disk full" issue, and I was quite dumbfounded there was no email notification for this, that seems pretty basic as far as monitoring goes.
Cloudron is well-placed to add this functionality, and it would save us so much headaches.
@AmbroiseUnly for some reason, linux doesn't have an event when nearing full disk space. The only way to do this then is to keep polling aggressively but this causes a lot of disk churn. Also, the notification is then limited to how frequently you can poll. There is some
quota
support but it needs also kernel support (which Cloudron cannot control). -
Would it be possible to have a guide then? Something with best-practices in mind.
Another user mentioned Zabbix, but it feels complicated to use (the doc isn't so friendly, it doesn't look simple). I don't know if that really is complex to set up, but a guide with some sort of "Cloudron recommendation" would be really nice.
Typically, something that covers how to get alerted (email) when disk reaches 50/75/90/95/99/100% capacity, and maybe also some CPU watchers. A guide covering it from "how to install it" to "how to configure it" would be really helpful.
Also, if it uses a Cloudron App, it might also be beneficial for Cloudron, because customers would reach 3 Cloudron apps quicker, meaning more sales for you.
-
You could do something like this via cron and maybe ntfy.
We had a discussion like this already, see an example here: https://forum.cloudron.io/post/72148Otherwise, googling
cron alert disk full mail
brought up e.g.
https://askubuntu.com/questions/1503361/script-to-notify-via-email-when-low-on-disk-space or https://github.com/corneliusroot/QuickStatus -
For anyone interested in configuring proper monitoring on your Cloudron server, I wrote a guide about it, and I hope you'll find it useful!
It's the kind of guide I wish I would have found when first looking at this topic.
-
I am wondering if this might be possible by now. I just got the notification "Server is running out of disk space" on the Cloudron notification tab. Since there is already the possibility to subscribe to email alerts for events like "App is down", couldn't this event be added as well?
I like the idea of Cloudron being a self-contained system, so I don't want to add a custom monitoring system to it that needs to be maintained along side it. -
@AmbroiseUnly for some reason, linux doesn't have an event when nearing full disk space. The only way to do this then is to keep polling aggressively but this causes a lot of disk churn. Also, the notification is then limited to how frequently you can poll. There is some
quota
support but it needs also kernel support (which Cloudron cannot control).@girish How about a more indirect solution?
Something that correlates to disk space, such as inodes or other low cost checks.
If not that, then how about creating a safety system for Cloudron, let's call it AirBag with ABS brakes for when you're about to crash it deploys in a controlled way.
AirBag with ABS might look like a series of 10 eager zeroed files evenly dividing a threshold of say 1GB always present on disk. When the system runs out of disk, 1 of 10 is deleted and a notification is sent. Repeat 4 more times, then wait.
That way the system has a controlled descent to 0 and some left for when an admin comes by and needs some space to work with.
Thoughts?
-
@girish How about a more indirect solution?
Something that correlates to disk space, such as inodes or other low cost checks.
If not that, then how about creating a safety system for Cloudron, let's call it AirBag with ABS brakes for when you're about to crash it deploys in a controlled way.
AirBag with ABS might look like a series of 10 eager zeroed files evenly dividing a threshold of say 1GB always present on disk. When the system runs out of disk, 1 of 10 is deleted and a notification is sent. Repeat 4 more times, then wait.
That way the system has a controlled descent to 0 and some left for when an admin comes by and needs some space to work with.
Thoughts?
-
Email notification can be added but it will be unreliable (and don't want to mislead users). See https://forum.cloudron.io/topic/7555/emailing-notifications-of-certain-crucial-system-events-such-as-full-disk-space/8
@joseph said in Emailing notifications of certain crucial system events, such as full disk space:
Email notification can be added but it will be unreliable (and don't want to mislead users). See https://forum.cloudron.io/topic/7555/emailing-notifications-of-certain-crucial-system-events-such-as-full-disk-space/8
Sure, I do understand those limitations. I was just thinking that it would be nice to have an email notification equivalent (maybe with a note pointing out the limitations) for every notification type shown in the Cloudron dashboard.
-
Currently, we run
df
every 30 mins. Maybe this is accurate enough already. In which case, what is missing is the email notification . Can add that for next release.@girish That sounds great! The last two incidents were this would have helped me were developing over several days (exploding Rocket.Chat logs and syslog.js), so this should be within the necessary precision to prevent this type of situation.