Cloudron UI shows "Cloudron crashed/stopped" but logs show no error
-
@fbartels Do you see any OOM errors in
dmesg
? You can also try giving the backup process lot more memory in Backups -> Configure -> Advanced.BTW, the reason you are setting backups created/deleted is that Cloudron is "cleaning" up the intermediate backups. Basically, let's say the cloudron has 10 apps. It uploads 5 apps but the 6th app fails. It will then later clean up the backups of the 5 apps because it considers those as intermediate artifacts.
-
yes, seem to be an out of memory issue:
[2068863.944193] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068863.944205] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068863.944208] node 0: slabs: 179, objs: 2864, free: 0 [2068866.935154] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068866.935159] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068866.935161] node 0: slabs: 190, objs: 3040, free: 0 [2068867.686548] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068867.686556] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068867.686559] node 0: slabs: 192, objs: 3072, free: 0 [2068874.914186] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068874.914194] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068874.914198] node 0: slabs: 193, objs: 3088, free: 0
On my first try I set the limit to 1GB (from 400MB) but still had the same result. Now rerunning with a limit of 2GB.
The host itself reports the following memory usage:
root@my:~# LC_ALL=C free -m total used free shared buff/cache available Mem: 7985 3230 186 118 4568 3583 Swap: 4095 1906 2189
-
@fbartels Not entirely sure what's going on , but can you try
echo 3 > /proc/sys/vm/drop_caches
. This supposedly frees up that buf/cache column per this article. Also, as a first step, can you just disable the backup of that single appcloud.domain.com
and see if that works? How big iscloud.domain.com
? (you can dodu -hcs /home/yellowtent/appsdata/<appid>/
to figure out). -
Sorry for not getting back earlier @girish. Was caught up in something else.
I misread your comment and stopped the app in question and that brought a successful backup. Seems I did not stop it when stopping most of the other apps.
The "cloud" app is currently at 8.6GB and slightly bigger than the next biggest app (which still succeeds):
root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a 8.6G /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a 8.6G total root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe 8.4G /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe 8.4G total root@my:~# LC_ALL=c du -hcs /home/yellowtent/boxdata/mail/ 7.9G /home/yellowtent/boxdata/mail/ 7.9G total root@my:~#
I will try another backup will really all app stopped.
Edit: Ah, no that won't work as stopped apps do not seem to be part of the process when triggering a complete backup.
PS: on a related note. I am using encrypted backups (as tgz) and wanted to turn off encryption for a test, but even with the encryption password removed the new backup files still have the
.enc
file ending. I did not check though if the file was still encrypted however. -
If I trigger the app backup of the "cloud" app alone it succeeds, but as part of the box backup it seems to be that app that makes the full backup fail most of the time. I write most of the time since when getting back to it this morning the whole box backup already failed at a much smaller app.
In my quest to move my whole Cloudron to another server I have spent yesterday working on a script to directly trigger the cloudron api to disable automatic backups of all apps, trigger each individual apps backup and then immediately stop the app and then finally doing a box backup. But then I came to the realisation that apps not included in the automatic backup are also not part of the box backup (plus I cannot trigger app backups from stopped apps).
So my best of course of action seems to be to manually backup just the "cloud" app and have everything else covered through the whole box backup. And then restore the box backup on a new system and then manually import the last backup of my "cloud" app on the new host.
edit: I was just waiting for a box backup to complete again (with the cloud app enabled) and while it finished backing up all apps it did then finally crash when backing up "box".
-
@fbartels you still see the same OOM errors in dmesg, correct? and any errors on the minio side as you previously reported?
BTW, about the
.enc
file ending, it's probably the old backup files. Currently, old backups of previous config are not removed. You can safely remove them.Do you think you can give us SSH access, so I can debug this a bit? If so, please drop me a mail on support@cloudron.io .
-
@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:
OOM errors in dmesg, correct?
yes and no. At times the backup stopped without the oom error.
@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:
give us SSH access
gladly. At the moment I have the automatic backup for my two biggest apps deactivated, which makes the box backup succeed. When triggering backups of these individually they succeed however.
-
@fbartels Thanks for the access! It turns out the issue has nothing with backups. It seems that whenever the "check disk space" cron job runs, the code crashes and this in turn brings down the backup process as well. This is related to the
LC_ALL=C
thread that we added to box.service because the system is on a difference locale. I think the update to 6.2.7 removed it (since the change is only in 6.3 branch). I added it back and now the backup succeeds.(And of course, the backup takes more time when those big apps are also backed up and then it's just enough time for the cron task to run and crash).
-
@girish Thanks for looking at this. Just triggered another backup with all apps active and it succeeded. I guess then it's finally time to move to the new hardware (plus making sure that the new system is using the c locale by default)