Disk space should never bring a whole server down
-
Anyone know where /app/data actually is in the full file system structure?
I'm trying to navigate a snapshot clone to see if that has the missing config.php file that hasn't come back for EspoCRM but just not seeing anything obvious and searching docs hasn't found me the clue.
-
The problem I have is that EspoCRM Administration writes changes back to
/app/data/data/config.php
- however, that file also contains all the database connection details, password hash, basically everything for that instance to work.So when the disk was full, it seems to have somehow written a 0kb version of config.php.
And because of the rsync encryption failing to backup EspoCRM, the Cloudron backups aren't complete.
So that leaves provider backup snapshot restore and dig around.
Basically, whatever anyone does - never allow the disk to get full - the cascade of problems that can happen from that interruption is just one massive time hole.
-
@marcusquinn Holy sh*t, with some dumb-luck trying everything I know, I seem to have fixed it.
Lesson learnt - never run out of disk space - sods law says it will be the apps you rely on the most that will get corrupted.
Now, given the many open ways to load up a Cloudron with data (email/FilePizza/PrivateBin) maybe there's a way to avoid this causing a total fail?
-
@marcusquinn said in Disk space should never bring a whole server down:
Now, given the many open ways to load up a Cloudron with data (email/FilePizza/PrivateBin) maybe there's a way to avoid this causing a total fail?
I think FilePizza if fully P2P and so I'm not sure you could fill the server up with that (but you could with Jirafeau).
But yeah, I reckon configurable disk space notifications (e.g. email/notify me hourly/daily/whatever once I've only got x space left) but be a good first step to help this not to happen.
-
Quick fix idea: maybe 70% full is a better nag threshold?
-
Thanks for all the feedback here. We discovered cloudron a whiles back and have been testing it out on a number of server over the last couple of months. We wanted to get a good handle on how everything works before rolling anything out into production. Firstly itβs a excellent platform and fills a great need. But we did run into a little problem with one of our test servers running on a digital ocean droplet. About 2 weeks ago it went from using 20gb of space to nearly 80gb in the space of 4 hours. We received an alart from digital ocean however things were happening so fast that all we could initially do is upgrade the instance, this gave us half and hour and then we had to do it again, then we just attached a 100gb volume. Although just in testing there was a wordpress app we were fond of and so we transferred it off the cloudron and left a pixelfed app. Somewhere between shutting down the server to add the volume and moving the Wordpress app, the space usage stopped increasing. I know what your thinking Wordpress right? No we checked the install before hand and it was working fine on another server. We then removed the 100gb volume and resized the digital ocean server back to its original size and evething was back to normal. I figured that some server updates ran that morning and some out of control process started this and resizing the server up and down somehow got rid of the problem.
-
@bestknownhost Did you perhaps have AdGuard installed?
-
@bestknownhost did you figure out what was filling up the disk with du -sh /* and drilling down?
-
@bestknownhost for a start to clarify, are you using an external backup storage or just the local disk for now? Using the local disk may cause disk usage to go up quickly depending on how much data you've put into the server.
If that is not the case, then you may have hit some issue we recently saw with mysql binlogs https://forum.cloudron.io/topic/4510/able-to-clean-up-binlog-files-in-var-lib-mysql-directory?_=1616402616926 ?
And as @robi mentioned, do you have any idea so far what is using all that disk space?
-
@marcusquinn I was running into a simular issue while testing some stuff, most likely because of the Nexcloud Plugin "External Sites":
I am not sure right now, but i dont think that it recreates the files, but more likely it writes a looooooooot of logs down since cpu got pushed aswell( THATS NOT A TUTORIAL! ITS ONLY FOR REPEATABILITY OF BUGS! )
How to create the Issue repeatable:
1: Create a Nextcloud and share a folder(structure) to a public link.
2. Insert this link into any secondary website (wordpress etc) as a button that does NOT open a new tab.
3. Add the Plugin "External Sites" to Nextcloud - go to config and add the secondary website.
4. By using the embed Mode of external site implementation this issue is possible to get triggered by a user with access to the External Sites Buttons.
4.1 *Actual i did not test it by using a non-admin user as "trigger" userHow to finally trigger the filling of Disk space ?
-> Now follow the link in Nextcloud to your secondary website.
-> By clicking the button back into nextcloud the issues is triggered.( THATS NOT A TUTORIAL! ITS ONLY FOR REPEATABILITY OF BUGS! )
-
-
Here is my SOLUTION:
It does not solve the root cause why you are running out of space, but with this methodology you will buy yourself time.
Generate 3 files of 2 gigabytes each.
This is one way of generating these files:
fallocate -l 2G /storage-padding-buffer-2-gb-file1.img
fallocate -l 2G /storage-padding-buffer-2-gb-file2.img
fallocate -l 2G /storage-padding-buffer-2-gb-file3.imgWhen your server is out of storage, you may delete one or all of these padding files, so that regain the space you need to rescue the server.
I have had the same issue with cloudron, because over time, storage will run out.
For now I chose not to update the storage of my VPS server because it will double my hosting cost for this node, from USD400 to USD800 per year. That's digital ocean pricing for you, but I digress.This is a systems engineering issue and isn't caused by Cloudron. However I would not have anything against an elegant solution from the team if it were possible :).
I want to say I am working on a post to describe I work with a massive cleanup, and exactly which steps I took to regain loads of space. TLDR; use ncdu, analyze all containers and identify where apps are storing logs and rotating these, clear NPM package cache in each container. More to cone
-
-
Maybe the Cloudron app needs to generate its own partition to run from, where regular app storage can't saturate the OS or Cloudron partitions?
-
@marcusquinn said in Disk space should never bring a whole server down:
Maybe the Cloudron app needs to generate its own partition to run from, where regular app storage can't saturate the OS or Cloudron partitions?
Right. The main issue, it's not possible to create proper disk partitions in VPS i.e one can only create file backed loop back file systems but such things are not to be used in production and I have no idea about their reliability/durability.
-
@girish This is still a huge problem.
My production server have failing applications again due to disk space filling up. Luckily DigitalOcean's backup functionality saves my setup this time again. If I had relied only on Cloudron for backups it would have been disaster time.
I am not asking you for help to fix it or to blame anyone. But this needs an engineered solution by upstream, you guys.
You could for instance recommend that we make use of separate storage volumes on the system drive. This brings the TCO cost down, for storage. Recommendations and verified testing from you would be valuable for us as customers.
I can imagine hundreds of other customers of yours that are seeing the same issue.
Also:
Your assertion above that disk space is cheap is true for physical drive storage, but not for VPS server storage. For me to double disk space from 80gb to 160gb it also doubles my yearly VPS cost. I would believe a sizable portion of your users are hosting Cloudron on a VPS.
Of course you could temporarily invalidate this problem by recommending 160GB storage capacity. This might alienate some potential users.Now onwards to repair my Cloudron install and apps!
edit:
Solving the root causeUsed ncdu to browse every container
- Gained 1gb of storage by deleting /usr/local/share/.cache/yarn/ on a container volume
- Gained 500mb of storage by deleting Anaconda distribution package cache within a container volume
Analysis: There seems to be space wasteful ways of letting the Metor spread around old versions of libraries and builds (?).
-
@makemrproper said in Disk space should never bring a whole server down:
If I had relied only on Cloudron for backups it would have been disaster time.
Can you clarify this? Why are you unable to rely on Cloudron backups ?
I agree with the bigger point though. Unfortunately, we have found no clear technical solution to solve disk space issues even outside of Cloudron. What do people generally do when hosting apps on a VPS?
As for anaconda cache and meteor are you referring to jupyter and wekan apps ? Maybe those packages can be fixed to clear the cache.
-
@makemrproper Hey makemrproper,
If you want a VPS with a lot of space, you might try looking into BuyVMs storage volumes.
They take a bit of time to set up (you have to mount the volume), but it's very difficult to get a deal on the space provided elsewhere.
It's also very well cached, so I've found it to be almost as performant disk wise as what I've used on digital ocean.
They are a smaller provider though, so the reliability won't be quite as good (you might have a little more down-time compare to other providers).
I know this doesn't necessarily solve your issue, but more disk space is always great :).
-
@girish I think what @makemrproper meant was if their backups were on the same server as the Cloudron, they wouldn't be able to back up, or restore, from them since the disk was not responsive. I am impressed again by how patient you all are with these situations. Keep up the good work and attitudes.
-
@makemrproper said in Disk space should never bring a whole server down:
This is still a huge problem.
I understand the desire for an approach that stops the problem happening in the first place. In the interim, I really recommend an alert system like
ntfy
. Use their hosted service or host it yourself ( have self packaged for Cloudron - more recent version is at https://forum.cloudron.io/post/54552).Set a cron job for as often as you want, running a script for
df -h
, set alert levels in the script sending notifications to dashboard or iOS/Android device.As self-hosters we want to rely on things working, but we can't escape our responsibility to keep an eye on things.
ntfy
handles this in one of the simplest ways. -