Cloudron UI shows "Cloudron crashed/stopped" but logs show no error

fbartels

I just noticed that on the receiving side the following is logged:

minio_1    | API: PutObjectPart(bucket=cloudron-9wdeu, object=snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc)
minio_1    | Time: 18:30:08 UTC 04/14/2021
minio_1    | DeploymentID: 240381b9-8df8-4c64-bcf6-400d9433e406
minio_1    | RequestID: 1675CCF846C91F23
minio_1    | RemoteHost: 94.16.123.209
minio_1    | Host: minio2.host-consultants.de
minio_1    | UserAgent: aws-sdk-nodejs/2.850.0 linux/v14.15.4 callback
minio_1    | Error: unexpected EOF (*errors.errorString)
minio_1    |        4: cmd/fs-v1-helpers.go:323:cmd.fsCreateFile()
minio_1    |        3: cmd/fs-v1-multipart.go:317:cmd.(*FSObjects).PutObjectPart()
minio_1    |        2: cmd/object-handlers.go:2366:cmd.objectAPIHandlers.PutObjectPartHandler()
minio_1    |        1: net/http/server.go:2069:http.HandlerFunc.ServeHTTP()

it also does not always stop at the exact same position, but with the same app (around 45 minutes into the backup)

2021-04-14T18:29:29.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5527M@8MBps (cloud.domain.com)"}
2021-04-14T18:29:39.172Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5627M@8MBps (cloud.domain.com)"}
2021-04-14T18:29:49.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5733M@8MBps (cloud.domain.com)"}
2021-04-14T18:29:58.629Z box:shell backup-snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a (stdout): 2021-04-14T18:29:58.629Z box:backupupload process: rss: 228978688 heapTotal: 69120000 heapUsed: 28616976 external: 39227347
2021-04-14T18:29:58.629Z box:backupupload v8 heap : used 28622424 total: 69120000 max: 2197815296

2021-04-14T18:29:59.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5834M@8MBps (cloud.domain.com)"}

girish

@fbartels Do you see any OOM errors in dmesg ? You can also try giving the backup process lot more memory in Backups -> Configure -> Advanced.

BTW, the reason you are setting backups created/deleted is that Cloudron is "cleaning" up the intermediate backups. Basically, let's say the cloudron has 10 apps. It uploads 5 apps but the 6th app fails. It will then later clean up the backups of the 5 apps because it considers those as intermediate artifacts.

fbartels

yes, seem to be an out of memory issue:

[2068863.944193] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
[2068863.944205]   cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0
[2068863.944208]   node 0: slabs: 179, objs: 2864, free: 0
[2068866.935154] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
[2068866.935159]   cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0
[2068866.935161]   node 0: slabs: 190, objs: 3040, free: 0
[2068867.686548] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
[2068867.686556]   cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0
[2068867.686559]   node 0: slabs: 192, objs: 3072, free: 0
[2068874.914186] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
[2068874.914194]   cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0
[2068874.914198]   node 0: slabs: 193, objs: 3088, free: 0

On my first try I set the limit to 1GB (from 400MB) but still had the same result. Now rerunning with a limit of 2GB.

The host itself reports the following memory usage:

root@my:~# LC_ALL=C free -m
              total        used        free      shared  buff/cache   available
Mem:           7985        3230         186         118        4568        3583
Swap:          4095        1906        2189

fbartels

Is there anything else I can do? I have stopped most apps on the server to free up ram, but the result is still the same.

girish

@fbartels Not entirely sure what's going on , but can you try echo 3 > /proc/sys/vm/drop_caches . This supposedly frees up that buf/cache column per this article. Also, as a first step, can you just disable the backup of that single app cloud.domain.com and see if that works? How big is cloud.domain.com ? (you can do du -hcs /home/yellowtent/appsdata/<appid>/ to figure out).

fbartels

Sorry for not getting back earlier @girish. Was caught up in something else.

I misread your comment and stopped the app in question and that brought a successful backup. Seems I did not stop it when stopping most of the other apps.

The "cloud" app is currently at 8.6GB and slightly bigger than the next biggest app (which still succeeds):

root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a
8.6G	/home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a
8.6G	total
root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe
8.4G	/home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe
8.4G	total
root@my:~# LC_ALL=c du -hcs /home/yellowtent/boxdata/mail/
7.9G	/home/yellowtent/boxdata/mail/
7.9G	total
root@my:~#

I will try another backup will really all app stopped.

Edit: Ah, no that won't work as stopped apps do not seem to be part of the process when triggering a complete backup.

PS: on a related note. I am using encrypted backups (as tgz) and wanted to turn off encryption for a test, but even with the encryption password removed the new backup files still have the .enc file ending. I did not check though if the file was still encrypted however.

fbartels

If I trigger the app backup of the "cloud" app alone it succeeds, but as part of the box backup it seems to be that app that makes the full backup fail most of the time. I write most of the time since when getting back to it this morning the whole box backup already failed at a much smaller app.

In my quest to move my whole Cloudron to another server I have spent yesterday working on a script to directly trigger the cloudron api to disable automatic backups of all apps, trigger each individual apps backup and then immediately stop the app and then finally doing a box backup. But then I came to the realisation that apps not included in the automatic backup are also not part of the box backup (plus I cannot trigger app backups from stopped apps).

So my best of course of action seems to be to manually backup just the "cloud" app and have everything else covered through the whole box backup. And then restore the box backup on a new system and then manually import the last backup of my "cloud" app on the new host.

edit: I was just waiting for a box backup to complete again (with the cloud app enabled) and while it finished backing up all apps it did then finally crash when backing up "box".

girish

@fbartels you still see the same OOM errors in dmesg, correct? and any errors on the minio side as you previously reported?

BTW, about the .enc file ending, it's probably the old backup files. Currently, old backups of previous config are not removed. You can safely remove them.

Do you think you can give us SSH access, so I can debug this a bit? If so, please drop me a mail on support@cloudron.io .

fbartels

@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:

OOM errors in dmesg, correct?

yes and no. At times the backup stopped without the oom error.

@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:

give us SSH access

gladly. At the moment I have the automatic backup for my two biggest apps deactivated, which makes the box backup succeed. When triggering backups of these individually they succeed however.

girish

@fbartels Thanks for the access! It turns out the issue has nothing with backups. It seems that whenever the "check disk space" cron job runs, the code crashes and this in turn brings down the backup process as well. This is related to the LC_ALL=C thread that we added to box.service because the system is on a difference locale. I think the update to 6.2.7 removed it (since the change is only in 6.3 branch). I added it back and now the backup succeeds.

(And of course, the backup takes more time when those big apps are also backed up and then it's just enough time for the cron task to run and crash).

fbartels

@girish Thanks for looking at this. Just triggered another backup with all apps active and it succeeded. I guess then it's finally time to move to the new hardware (plus making sure that the new system is using the c locale by default)

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Cloudron UI shows "Cloudron crashed/stopped" but logs show no error