Disk space should never bring a whole server down
-
Ugh, nope, disk full again, whole server down. Makes me dislike weekends when this stuff happens.
No clue what to do but the post title remains valid - maybe Apps need diskspace limits because I'm caught between hard server resets, brief times of access, and then lockout again.
-
Whatever 1000% CPU is doing, it's not showing a Cloudron Dashboard:
-
# systemctl status box ● box.service - Cloudron Admin Loaded: loaded (/etc/systemd/system/box.service; enabled; vendor preset: enabled) Active: activating (auto-restart) (Result: exit-code) since Sat 2021-03-06 23:22:48 UTC; 88ms ago Process: 311 ExecStart=/home/yellowtent/box/box.js (code=exited, status=1/FAILURE) Main PID: 311 (code=exited, status=1/FAILURE)
# systemctl status nginx ● nginx.service - nginx - high performance web server Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/nginx.service.d └─cloudron.conf Active: active (running) since Sat 2021-03-06 23:09:24 UTC; 14min ago Docs: http://nginx.org/en/docs/ Process: 1431 ExecStart=/usr/sbin/nginx -c /etc/nginx/nginx.conf (code=exited, status=0/SUCCESS) Main PID: 1634 (nginx) Tasks: 17 (limit: 4915) CGroup: /system.slice/nginx.service ├─1634 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf ├─1638 nginx: worker process ├─1639 nginx: worker process ├─1641 nginx: worker process ├─1642 nginx: worker process ├─1645 nginx: worker process ├─1646 nginx: worker process ├─1647 nginx: worker process ├─1648 nginx: worker process ├─1649 nginx: worker process ├─1650 nginx: worker process ├─1651 nginx: worker process ├─1652 nginx: worker process ├─1653 nginx: worker process ├─1654 nginx: worker process ├─1655 nginx: worker process └─1656 nginx: worker process Mar 06 23:09:23 cloudron01 systemd[1]: Starting nginx - high performance web server... Mar 06 23:09:24 cloudron01 systemd[1]: Started nginx - high performance web server.
Sorry, I have to work evenings and weekends, it's the only time I can concentrate on the deep work without all the emails & messages interruptions of the week days.
-
# systemctl status unbound ● unbound.service - Unbound DNS Resolver Loaded: loaded (/etc/systemd/system/unbound.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2021-03-06 23:26:38 UTC; 8s ago Main PID: 20802 (unbound) Tasks: 1 (limit: 4915) CGroup: /system.slice/unbound.service └─20802 /usr/sbin/unbound -d Mar 06 23:26:38 cloudron01 systemd[1]: Started Unbound DNS Resolver. Mar 06 23:26:38 cloudron01 unbound[20802]: [20802:0] notice: init module 0: subnet Mar 06 23:26:38 cloudron01 unbound[20802]: [20802:0] notice: init module 1: validator Mar 06 23:26:38 cloudron01 unbound[20802]: [20802:0] notice: init module 2: iterator Mar 06 23:26:38 cloudron01 unbound[20802]: [20802:0] info: start of service (unbound 1.6.7). Mar 06 23:26:39 cloudron01 unbound[20802]: [20802:0] error: could not fflush(/var/lib/unbound/root.key): No space left on device Mar 06 23:26:39 cloudron01 unbound[20802]: [20802:0] error: could not fflush(/var/lib/unbound/root.key): No space left on device
-
In just cannot get my head around how the disk can be allowed to fill to the point of a total system failure.
Slowdown, sure I understand - but it's a total fail and I can't see why this isn't preventable.
Is it really all one has to to bring a Cloudron down is load it up with data?
There's a bunch of Apps that allow for uploads, it really wouldn't take much effort to flood those with a few GB.
-
@nebulon @girish App have Memory & CPU allocations - any reason they can't have disk- space allocations too?
I'd rather a single app hits a wall than an entire server.
It seems all one would have to do to bring a whole Cloudron down this way would be send a lot of email attachments to the point of disk saturation.
Maybe I'm wrong and it's something else - but feel free to delete this post and move to email if it's a reproducible risk.
-
@marcusquinn said in Disk space should never bring a whole server down:
@nebulon @girish App have Memory & CPU allocations - any reason they can't have disk- space allocations too?
Yes, the memory & CPU allocations are features of the linux kernel cgroups. However, disk space allocation is not part of them.
I guess the issue atleast to handle right now is that for some reason disk space is full. Running
docker image prune -a
sometimes frees up some disk space. Can you try that? Alternately, if you drop me a mail on support , I can look into the server. -
@girish said in Disk space should never bring a whole server down:
docker image prune -a
OK, thanks, tried that: "Total reclaimed space: 816.4MB"
Still no respondio though. Have emailed support@ but 2am here and an early start, so back online in 8h or so, by which it'll be your 2am, and appreciate it's Saturday, so just grateful for pointers and hoping I might have some other requests for assistance waking up soon too.
-
Can we add a disk space email alert, and event where after some critical threshold /tmp is cleaned up & docker images pruned by Cloudron.
Completely avoidable with a bit of this..
-
I'm wondering if maybe Cloudron should have its own volume by default.
A quick search in the subject but kinda tired now:
- https://www.reddit.com/r/docker/comments/loleal/how_to_limit_disk_space_for_a_docker_container/
- https://guide.blazemeter.com/hc/en-us/articles/115003812129-Overcoming-Container-Storage-Limitation-Overcoming-Container-Storage-Limitation#:~:text=In the current Docker version,be left in the container
- https://stackoverflow.com/questions/38542426/docker-container-specific-disk-quota
-
@marcusquinn Managed to bring it up by truncating many logs. Should be coming up in a bit, hold on.
-
@girish Ahhhh - thank you kindly!
I have an unused 1TB volume mounted, although I'm not sure how much of the remaining free space is used in the Move function, as I guess that was killing it when I triggered to move the 16GB Jira App data to it?
-
@girish said in Disk space should never bring a whole server down:
Managed to bring it up by truncating many logs
Is this perhaps related to the issue I reported a little while back too, regarding the logrotate not running properly under certain circumstances?
-
@d19dotca I remembered that mention, although fading brain never found or got to looking at that. I kinda think this situation is a bit too easy to get into and hard to get out of once its Terminal only.
-
Going to trigger a move on Confluence to the mounted volume, it's 4.5GB with 7.5GB free space now on the main volume - so hopefully that's enough working space but I have to zzz, problems where I know I don't immediately know how to solve are kinda exhausting.
-
@marcusquinn looks like things are back up! There is ~7GB left, so hopefully that should hold up for sometime.
-
I am looking into some clues on what can be done to mitigate this, will report back. BTW, for the volume suggestion, this is possible. In fact, we used to do this very long ago with each app having it's own btrfs partition. Usually, people start with a simple VPS. This means that for this to work out of the box one has to create a loopback file system which is very slow. Also, when I logged in to your server, it was mysql that was down which was not happy with lack of disk.
I am wondering if the solution involves suggesting the user to make a specific kind of setup if they want to protect themselves against this kind of issue. That is totally doable (for example, suggest user to move platformdata and boxdata to a separate volume/disk post installation)
-
@robi We actually have a disk space alert, in fact, it's there right now in the dashboard.
But the above is not super useful because it's just checking space in a cronjob. This cronjob is quite conservative because we don't want to keep spinning the disk too much. I am not aware of a way to get a "signal" from the server when disk space limits are hit. If a server fills up too fast between cron runs, the whole thing is useless...
-
I've triggered some bigger app data moves to the mounted 1TB volume but it seems to have chewed through 3GB of the remaining free space on the main volume already and I'm back to "Cloudron is offline. Reconnecting". Probably just making hasty tiredness errors now.
-
@marcusquinn maybe it's best to move them by hand first. Can you send me the apps you want to move by email and I can move it by hand since this seems to keep hitting a wall. ie. free space -> try to free space -> run out of space and start over...
-
@girish yes, but does it email you when approaching the threshold?
threshold setting? (twice a day should be plenty)
action setting checkboxes? (maybe a custom one too?)
heck, even deleting an non critical app would be fine since it's restorable from backup.
-
@marcusquinn Hang in there @marcusquinn. Bonne courage.
-
WHM has disk space limitations. Is it possible to copy their method and have it implemented in CR?
-
Thanks for all the help - I managed to get some extra hands on deck this morning and we're moving lots of data to a mounted volume for much more headroom.
I still think it's a little too vulnerable having this hazard able to bring a server down.
Also, I couldn't see if there's a way to set Email storage to be a mounted volume too?
-
@girish Also, the current warning is IMO not very useful if the threshold is not configurable. Depending on how the server is used, a few GB may be enough for weeks, or for mere hours if there's media stuff on the server, or if a user uploads stuff on nextcloud or something.
-
@marcusquinn said in Disk space should never bring a whole server down:
Also, I couldn't see if there's a way to set Email storage to be a mounted volume too?
Currently, emails are part of boxdata and you need to move the boxdata entirely. I’ve done this in my current server due to the amount of email stored for my clients. The steps for this are at https://docs.cloudron.io/storage/#default-data-directory for reference.
I’m making an assumption by volume you meant an external disk vs the actual Volumes function that Cloudron has.
There is a feature request I believe to keep emails separate but boxdata really don’t contain much data at all other than emails so it’s doable as-is for now. It’d just be nice to see the GUI handle moving the email data much like it does for apps.
-
@d19dotca Thanks. I'm an app specialist and anything more than a few minutes digging in the dirt is my kinda hell. Just getting brain fog now as I've lost a bunch of important work and 2 days of progress on it now
-
Anyone know where /app/data actually is in the full file system structure?
I'm trying to navigate a snapshot clone to see if that has the missing config.php file that hasn't come back for EspoCRM but just not seeing anything obvious and searching docs hasn't found me the clue.
-
The problem I have is that EspoCRM Administration writes changes back to
/app/data/data/config.php
- however, that file also contains all the database connection details, password hash, basically everything for that instance to work.So when the disk was full, it seems to have somehow written a 0kb version of config.php.
And because of the rsync encryption failing to backup EspoCRM, the Cloudron backups aren't complete.
So that leaves provider backup snapshot restore and dig around.
Basically, whatever anyone does - never allow the disk to get full - the cascade of problems that can happen from that interruption is just one massive time hole.
-
@marcusquinn Holy sh*t, with some dumb-luck trying everything I know, I seem to have fixed it.
Lesson learnt - never run out of disk space - sods law says it will be the apps you rely on the most that will get corrupted.
Now, given the many open ways to load up a Cloudron with data (email/FilePizza/PrivateBin) maybe there's a way to avoid this causing a total fail?
-
@marcusquinn said in Disk space should never bring a whole server down:
Now, given the many open ways to load up a Cloudron with data (email/FilePizza/PrivateBin) maybe there's a way to avoid this causing a total fail?
I think FilePizza if fully P2P and so I'm not sure you could fill the server up with that (but you could with Jirafeau).
But yeah, I reckon configurable disk space notifications (e.g. email/notify me hourly/daily/whatever once I've only got x space left) but be a good first step to help this not to happen.
-
Quick fix idea: maybe 70% full is a better nag threshold?
-
Thanks for all the feedback here. We discovered cloudron a whiles back and have been testing it out on a number of server over the last couple of months. We wanted to get a good handle on how everything works before rolling anything out into production. Firstly it’s a excellent platform and fills a great need. But we did run into a little problem with one of our test servers running on a digital ocean droplet. About 2 weeks ago it went from using 20gb of space to nearly 80gb in the space of 4 hours. We received an alart from digital ocean however things were happening so fast that all we could initially do is upgrade the instance, this gave us half and hour and then we had to do it again, then we just attached a 100gb volume. Although just in testing there was a wordpress app we were fond of and so we transferred it off the cloudron and left a pixelfed app. Somewhere between shutting down the server to add the volume and moving the Wordpress app, the space usage stopped increasing. I know what your thinking Wordpress right? No we checked the install before hand and it was working fine on another server. We then removed the 100gb volume and resized the digital ocean server back to its original size and evething was back to normal. I figured that some server updates ran that morning and some out of control process started this and resizing the server up and down somehow got rid of the problem.
-
@bestknownhost Did you perhaps have AdGuard installed?
-
@robi No we didn't.
-
@bestknownhost did you figure out what was filling up the disk with du -sh /* and drilling down?
-
@bestknownhost for a start to clarify, are you using an external backup storage or just the local disk for now? Using the local disk may cause disk usage to go up quickly depending on how much data you've put into the server.
If that is not the case, then you may have hit some issue we recently saw with mysql binlogs https://forum.cloudron.io/topic/4510/able-to-clean-up-binlog-files-in-var-lib-mysql-directory?_=1616402616926 ?
And as @robi mentioned, do you have any idea so far what is using all that disk space?
-
@marcusquinn I was running into a simular issue while testing some stuff, most likely because of the Nexcloud Plugin "External Sites":
I am not sure right now, but i dont think that it recreates the files, but more likely it writes a looooooooot of logs down since cpu got pushed aswell( THATS NOT A TUTORIAL! ITS ONLY FOR REPEATABILITY OF BUGS! )
How to create the Issue repeatable:
1: Create a Nextcloud and share a folder(structure) to a public link.
2. Insert this link into any secondary website (wordpress etc) as a button that does NOT open a new tab.
3. Add the Plugin "External Sites" to Nextcloud - go to config and add the secondary website.
4. By using the embed Mode of external site implementation this issue is possible to get triggered by a user with access to the External Sites Buttons.
4.1 *Actual i did not test it by using a non-admin user as "trigger" userHow to finally trigger the filling of Disk space ?
-> Now follow the link in Nextcloud to your secondary website.
-> By clicking the button back into nextcloud the issues is triggered.( THATS NOT A TUTORIAL! ITS ONLY FOR REPEATABILITY OF BUGS! )
-
girish
-
Here is my SOLUTION:
It does not solve the root cause why you are running out of space, but with this methodology you will buy yourself time.
Generate 3 files of 2 gigabytes each.
This is one way of generating these files:
fallocate -l 2G /storage-padding-buffer-2-gb-file1.img
fallocate -l 2G /storage-padding-buffer-2-gb-file2.img
fallocate -l 2G /storage-padding-buffer-2-gb-file3.imgWhen your server is out of storage, you may delete one or all of these padding files, so that regain the space you need to rescue the server.
I have had the same issue with cloudron, because over time, storage will run out.
For now I chose not to update the storage of my VPS server because it will double my hosting cost for this node, from USD400 to USD800 per year. That's digital ocean pricing for you, but I digress.This is a systems engineering issue and isn't caused by Cloudron. However I would not have anything against an elegant solution from the team if it were possible :).
I want to say I am working on a post to describe I work with a massive cleanup, and exactly which steps I took to regain loads of space. TLDR; use ncdu, analyze all containers and identify where apps are storing logs and rotating these, clear NPM package cache in each container. More to cone
-
marcusquinn
-
Maybe the Cloudron app needs to generate its own partition to run from, where regular app storage can't saturate the OS or Cloudron partitions?
-
@marcusquinn said in Disk space should never bring a whole server down:
Maybe the Cloudron app needs to generate its own partition to run from, where regular app storage can't saturate the OS or Cloudron partitions?
Right. The main issue, it's not possible to create proper disk partitions in VPS
i.e one can only create file backed loop back file systems but such things are not to be used in production and I have no idea about their reliability/durability.
-
@girish This is still a huge problem.
My production server have failing applications again due to disk space filling up. Luckily DigitalOcean's backup functionality saves my setup this time again. If I had relied only on Cloudron for backups it would have been disaster time.
I am not asking you for help to fix it or to blame anyone. But this needs an engineered solution by upstream, you guys.
You could for instance recommend that we make use of separate storage volumes on the system drive. This brings the TCO cost down, for storage. Recommendations and verified testing from you would be valuable for us as customers.
I can imagine hundreds of other customers of yours that are seeing the same issue.
Also:
Your assertion above that disk space is cheap is true for physical drive storage, but not for VPS server storage. For me to double disk space from 80gb to 160gb it also doubles my yearly VPS cost. I would believe a sizable portion of your users are hosting Cloudron on a VPS.
Of course you could temporarily invalidate this problem by recommending 160GB storage capacity. This might alienate some potential users.Now onwards to repair my Cloudron install and apps!
edit:
Solving the root causeUsed ncdu to browse every container
- Gained 1gb of storage by deleting /usr/local/share/.cache/yarn/ on a container volume
- Gained 500mb of storage by deleting Anaconda distribution package cache within a container volume
Analysis: There seems to be space wasteful ways of letting the Metor spread around old versions of libraries and builds (?).
-
@makemrproper said in Disk space should never bring a whole server down:
If I had relied only on Cloudron for backups it would have been disaster time.
Can you clarify this? Why are you unable to rely on Cloudron backups ?
I agree with the bigger point though. Unfortunately, we have found no clear technical solution to solve disk space issues even outside of Cloudron. What do people generally do when hosting apps on a VPS?
As for anaconda cache and meteor are you referring to jupyter and wekan apps ? Maybe those packages can be fixed to clear the cache.
-
@makemrproper Hey makemrproper,
If you want a VPS with a lot of space, you might try looking into BuyVMs storage volumes.
They take a bit of time to set up (you have to mount the volume), but it's very difficult to get a deal on the space provided elsewhere.
It's also very well cached, so I've found it to be almost as performant disk wise as what I've used on digital ocean.
They are a smaller provider though, so the reliability won't be quite as good (you might have a little more down-time compare to other providers).
I know this doesn't necessarily solve your issue, but more disk space is always great :).
-
@girish I think what @makemrproper meant was if their backups were on the same server as the Cloudron, they wouldn't be able to back up, or restore, from them since the disk was not responsive. I am impressed again by how patient you all are with these situations. Keep up the good work and attitudes.
-
@makemrproper said in Disk space should never bring a whole server down:
This is still a huge problem.
I understand the desire for an approach that stops the problem happening in the first place. In the interim, I really recommend an alert system like
ntfy
. Use their hosted service or host it yourself ( have self packaged for Cloudron - more recent version is at https://forum.cloudron.io/post/54552).Set a cron job for as often as you want, running a script for
df -h
, set alert levels in the script sending notifications to dashboard or iOS/Android device.As self-hosters we want to rely on things working, but we can't escape our responsibility to keep an eye on things.
ntfy
handles this in one of the simplest ways.