Cloudron UI shows "Cloudron crashed/stopped" but logs show no error
-
Hi,
as written in the other topic I am testing for a migration at the moment. And checking backups on the old system I noticed that the ui says that "Cloudron crashed/stopped". Triggering a manual backup also leads to the same message. Curiously the backups listed below are also already three weeks old. I know from the storage usage of my backup location however that new backups are created and deleted.
The log shows the following at the end:
Apr 14 18:28:00 box:shell backup-snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a (stdout): 2021-04-14T16:28:00.203Z box:storage/s3 Uploaded snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc with partSize 26214400: {"Location":"https://minio2.storage.com/cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc","Bucket":"cloudron-domaincom","Key":"snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc","ETag":"\"05dc3018683b7c699e8d15337a3f822a-285\""} Apr 14 18:28:00 box:shell backup-snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a (stdout): 2021-04-14T16:28:00.204Z box:backupupload upload completed. error: null Apr 14 18:28:00 box:backups runBackupUpload: result - {"result":""} Apr 14 18:28:00 box:backups cloud.domain.com uploadAppSnapshot: snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a done. 878.055 seconds Apr 14 18:28:00 box:backups Rotating app backup of dd5f0f98-2b81-495d-8828-9c967128304a to id 2021-04-14-154526-702/app_dd5f0f98-2b81-495d-8828-9c967128304a_2021-04-14-162800-413_v4.8.9 Apr 14 18:28:00 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying files from 0-1 (cloud.domain.com)"} Apr 14 18:28:00 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying (multipart) snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc (cloud.domain.com)"} Apr 14 18:28:00 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 1 - /cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc bytes=0-1073741823 (cloud.domain.com)"} Apr 14 18:28:00 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 2 - /cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc bytes=1073741824-2147483647 (cloud.domain.com)"} Apr 14 18:28:00 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 3 - /cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc bytes=2147483648-3221225471 (cloud.domain.com)"} Apr 14 18:29:34 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 3 - Etag: \"9c395fe5ecf46a165a6dd714314ae9b0\" (cloud.domain.com)"} Apr 14 18:29:34 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 4 - /cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc bytes=3221225472-4294967295 (cloud.domain.com)"} Apr 14 18:29:34 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 2 - Etag: \"9226ec0a9d137b7d8139a5dd328854cf\" (cloud.domain.com)"} Apr 14 18:29:34 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 5 - /cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc bytes=4294967296-5368709119 (cloud.domain.com)"} Apr 14 18:29:37 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 1 - Etag: \"4e243fbd130369c189673495d608d1e3\" (cloud.domain.com)"} Apr 14 18:29:37 box:tasks 7109: {"percent":78.77777777777779,"message":"Copying part 6 - /cloudron-domaincom/snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc bytes=5368709120-6442450943 (cloud.domain.com)"}
Edit: now seeing it with proper syntax highlighting I see that the log obviously stops mid upload.
The thing that may have broken this (although I thought the backups looked fine afterwards) is that I recently changed the url of my Minio endpoint to a new instance. For this I simply discarded the old backup state and started with an empty bucket.
I just noticed that on the receiving side the following is logged:
minio_1 | API: PutObjectPart(bucket=cloudron-9wdeu, object=snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc) minio_1 | Time: 18:30:08 UTC 04/14/2021 minio_1 | DeploymentID: 240381b9-8df8-4c64-bcf6-400d9433e406 minio_1 | RequestID: 1675CCF846C91F23 minio_1 | RemoteHost: 94.16.123.209 minio_1 | Host: minio2.host-consultants.de minio_1 | UserAgent: aws-sdk-nodejs/2.850.0 linux/v14.15.4 callback minio_1 | Error: unexpected EOF (*errors.errorString) minio_1 | 4: cmd/fs-v1-helpers.go:323:cmd.fsCreateFile() minio_1 | 3: cmd/fs-v1-multipart.go:317:cmd.(*FSObjects).PutObjectPart() minio_1 | 2: cmd/object-handlers.go:2366:cmd.objectAPIHandlers.PutObjectPartHandler() minio_1 | 1: net/http/server.go:2069:http.HandlerFunc.ServeHTTP()
it also does not always stop at the exact same position, but with the same app (around 45 minutes into the backup)
2021-04-14T18:29:29.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5527M@8MBps (cloud.domain.com)"} 2021-04-14T18:29:39.172Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5627M@8MBps (cloud.domain.com)"} 2021-04-14T18:29:49.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5733M@8MBps (cloud.domain.com)"} 2021-04-14T18:29:58.629Z box:shell backup-snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a (stdout): 2021-04-14T18:29:58.629Z box:backupupload process: rss: 228978688 heapTotal: 69120000 heapUsed: 28616976 external: 39227347 2021-04-14T18:29:58.629Z box:backupupload v8 heap : used 28622424 total: 69120000 max: 2197815296 2021-04-14T18:29:59.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5834M@8MBps (cloud.domain.com)"}
-
I just noticed that on the receiving side the following is logged:
minio_1 | API: PutObjectPart(bucket=cloudron-9wdeu, object=snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a.tar.gz.enc) minio_1 | Time: 18:30:08 UTC 04/14/2021 minio_1 | DeploymentID: 240381b9-8df8-4c64-bcf6-400d9433e406 minio_1 | RequestID: 1675CCF846C91F23 minio_1 | RemoteHost: 94.16.123.209 minio_1 | Host: minio2.host-consultants.de minio_1 | UserAgent: aws-sdk-nodejs/2.850.0 linux/v14.15.4 callback minio_1 | Error: unexpected EOF (*errors.errorString) minio_1 | 4: cmd/fs-v1-helpers.go:323:cmd.fsCreateFile() minio_1 | 3: cmd/fs-v1-multipart.go:317:cmd.(*FSObjects).PutObjectPart() minio_1 | 2: cmd/object-handlers.go:2366:cmd.objectAPIHandlers.PutObjectPartHandler() minio_1 | 1: net/http/server.go:2069:http.HandlerFunc.ServeHTTP()
it also does not always stop at the exact same position, but with the same app (around 45 minutes into the backup)
2021-04-14T18:29:29.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5527M@8MBps (cloud.domain.com)"} 2021-04-14T18:29:39.172Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5627M@8MBps (cloud.domain.com)"} 2021-04-14T18:29:49.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5733M@8MBps (cloud.domain.com)"} 2021-04-14T18:29:58.629Z box:shell backup-snapshot/app_dd5f0f98-2b81-495d-8828-9c967128304a (stdout): 2021-04-14T18:29:58.629Z box:backupupload process: rss: 228978688 heapTotal: 69120000 heapUsed: 28616976 external: 39227347 2021-04-14T18:29:58.629Z box:backupupload v8 heap : used 28622424 total: 69120000 max: 2197815296 2021-04-14T18:29:59.173Z box:tasks 7111: {"percent":78.77777777777779,"message":"Uploading backup 5834M@8MBps (cloud.domain.com)"}
@fbartels Do you see any OOM errors in
dmesg
? You can also try giving the backup process lot more memory in Backups -> Configure -> Advanced.BTW, the reason you are setting backups created/deleted is that Cloudron is "cleaning" up the intermediate backups. Basically, let's say the cloudron has 10 apps. It uploads 5 apps but the 6th app fails. It will then later clean up the backups of the 5 apps because it considers those as intermediate artifacts.
-
@fbartels Do you see any OOM errors in
dmesg
? You can also try giving the backup process lot more memory in Backups -> Configure -> Advanced.BTW, the reason you are setting backups created/deleted is that Cloudron is "cleaning" up the intermediate backups. Basically, let's say the cloudron has 10 apps. It uploads 5 apps but the 6th app fails. It will then later clean up the backups of the 5 apps because it considers those as intermediate artifacts.
yes, seem to be an out of memory issue:
[2068863.944193] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068863.944205] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068863.944208] node 0: slabs: 179, objs: 2864, free: 0 [2068866.935154] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068866.935159] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068866.935161] node 0: slabs: 190, objs: 3040, free: 0 [2068867.686548] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068867.686556] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068867.686559] node 0: slabs: 192, objs: 3072, free: 0 [2068874.914186] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068874.914194] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068874.914198] node 0: slabs: 193, objs: 3088, free: 0
On my first try I set the limit to 1GB (from 400MB) but still had the same result. Now rerunning with a limit of 2GB.
The host itself reports the following memory usage:
root@my:~# LC_ALL=C free -m total used free shared buff/cache available Mem: 7985 3230 186 118 4568 3583 Swap: 4095 1906 2189
-
yes, seem to be an out of memory issue:
[2068863.944193] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068863.944205] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068863.944208] node 0: slabs: 179, objs: 2864, free: 0 [2068866.935154] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068866.935159] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068866.935161] node 0: slabs: 190, objs: 3040, free: 0 [2068867.686548] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068867.686556] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068867.686559] node 0: slabs: 192, objs: 3072, free: 0 [2068874.914186] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020) [2068874.914194] cache: kmalloc-256(1374:f791441d40ab7c205137475a27c83da974fcc4a9eb255e11f88a998d03abf118), object size: 256, buffer size: 256, default order: 0, min order: 0 [2068874.914198] node 0: slabs: 193, objs: 3088, free: 0
On my first try I set the limit to 1GB (from 400MB) but still had the same result. Now rerunning with a limit of 2GB.
The host itself reports the following memory usage:
root@my:~# LC_ALL=C free -m total used free shared buff/cache available Mem: 7985 3230 186 118 4568 3583 Swap: 4095 1906 2189
-
Is there anything else I can do? I have stopped most apps on the server to free up ram, but the result is still the same.
@fbartels Not entirely sure what's going on , but can you try
echo 3 > /proc/sys/vm/drop_caches
. This supposedly frees up that buf/cache column per this article. Also, as a first step, can you just disable the backup of that single appcloud.domain.com
and see if that works? How big iscloud.domain.com
? (you can dodu -hcs /home/yellowtent/appsdata/<appid>/
to figure out). -
@fbartels Not entirely sure what's going on , but can you try
echo 3 > /proc/sys/vm/drop_caches
. This supposedly frees up that buf/cache column per this article. Also, as a first step, can you just disable the backup of that single appcloud.domain.com
and see if that works? How big iscloud.domain.com
? (you can dodu -hcs /home/yellowtent/appsdata/<appid>/
to figure out).Sorry for not getting back earlier @girish. Was caught up in something else.
I misread your comment and stopped the app in question and that brought a successful backup. Seems I did not stop it when stopping most of the other apps.
The "cloud" app is currently at 8.6GB and slightly bigger than the next biggest app (which still succeeds):
root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a 8.6G /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a 8.6G total root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe 8.4G /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe 8.4G total root@my:~# LC_ALL=c du -hcs /home/yellowtent/boxdata/mail/ 7.9G /home/yellowtent/boxdata/mail/ 7.9G total root@my:~#
I will try another backup will really all app stopped.
Edit: Ah, no that won't work as stopped apps do not seem to be part of the process when triggering a complete backup.
PS: on a related note. I am using encrypted backups (as tgz) and wanted to turn off encryption for a test, but even with the encryption password removed the new backup files still have the
.enc
file ending. I did not check though if the file was still encrypted however. -
Sorry for not getting back earlier @girish. Was caught up in something else.
I misread your comment and stopped the app in question and that brought a successful backup. Seems I did not stop it when stopping most of the other apps.
The "cloud" app is currently at 8.6GB and slightly bigger than the next biggest app (which still succeeds):
root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a 8.6G /home/yellowtent/appsdata/dd5f0f98-2b81-495d-8828-9c967128304a 8.6G total root@my:~# LC_ALL=c du -hcs /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe 8.4G /home/yellowtent/appsdata/96832cf7-cec4-4b59-94c8-9fe500da24fe 8.4G total root@my:~# LC_ALL=c du -hcs /home/yellowtent/boxdata/mail/ 7.9G /home/yellowtent/boxdata/mail/ 7.9G total root@my:~#
I will try another backup will really all app stopped.
Edit: Ah, no that won't work as stopped apps do not seem to be part of the process when triggering a complete backup.
PS: on a related note. I am using encrypted backups (as tgz) and wanted to turn off encryption for a test, but even with the encryption password removed the new backup files still have the
.enc
file ending. I did not check though if the file was still encrypted however.If I trigger the app backup of the "cloud" app alone it succeeds, but as part of the box backup it seems to be that app that makes the full backup fail most of the time. I write most of the time since when getting back to it this morning the whole box backup already failed at a much smaller app.
In my quest to move my whole Cloudron to another server I have spent yesterday working on a script to directly trigger the cloudron api to disable automatic backups of all apps, trigger each individual apps backup and then immediately stop the app and then finally doing a box backup. But then I came to the realisation that apps not included in the automatic backup are also not part of the box backup (plus I cannot trigger app backups from stopped apps).
So my best of course of action seems to be to manually backup just the "cloud" app and have everything else covered through the whole box backup. And then restore the box backup on a new system and then manually import the last backup of my "cloud" app on the new host.
edit: I was just waiting for a box backup to complete again (with the cloud app enabled) and while it finished backing up all apps it did then finally crash when backing up "box".
-
If I trigger the app backup of the "cloud" app alone it succeeds, but as part of the box backup it seems to be that app that makes the full backup fail most of the time. I write most of the time since when getting back to it this morning the whole box backup already failed at a much smaller app.
In my quest to move my whole Cloudron to another server I have spent yesterday working on a script to directly trigger the cloudron api to disable automatic backups of all apps, trigger each individual apps backup and then immediately stop the app and then finally doing a box backup. But then I came to the realisation that apps not included in the automatic backup are also not part of the box backup (plus I cannot trigger app backups from stopped apps).
So my best of course of action seems to be to manually backup just the "cloud" app and have everything else covered through the whole box backup. And then restore the box backup on a new system and then manually import the last backup of my "cloud" app on the new host.
edit: I was just waiting for a box backup to complete again (with the cloud app enabled) and while it finished backing up all apps it did then finally crash when backing up "box".
@fbartels you still see the same OOM errors in dmesg, correct? and any errors on the minio side as you previously reported?
BTW, about the
.enc
file ending, it's probably the old backup files. Currently, old backups of previous config are not removed. You can safely remove them.Do you think you can give us SSH access, so I can debug this a bit? If so, please drop me a mail on support@cloudron.io .
-
@fbartels you still see the same OOM errors in dmesg, correct? and any errors on the minio side as you previously reported?
BTW, about the
.enc
file ending, it's probably the old backup files. Currently, old backups of previous config are not removed. You can safely remove them.Do you think you can give us SSH access, so I can debug this a bit? If so, please drop me a mail on support@cloudron.io .
@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:
OOM errors in dmesg, correct?
yes and no. At times the backup stopped without the oom error.
@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:
give us SSH access
gladly. At the moment I have the automatic backup for my two biggest apps deactivated, which makes the box backup succeed. When triggering backups of these individually they succeed however.
-
@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:
OOM errors in dmesg, correct?
yes and no. At times the backup stopped without the oom error.
@girish said in Cloudron UI shows "Cloudron crashed/stopped" but logs show no error:
give us SSH access
gladly. At the moment I have the automatic backup for my two biggest apps deactivated, which makes the box backup succeed. When triggering backups of these individually they succeed however.
@fbartels Thanks for the access! It turns out the issue has nothing with backups. It seems that whenever the "check disk space" cron job runs, the code crashes and this in turn brings down the backup process as well. This is related to the
LC_ALL=C
thread that we added to box.service because the system is on a difference locale. I think the update to 6.2.7 removed it (since the change is only in 6.3 branch). I added it back and now the backup succeeds.(And of course, the backup takes more time when those big apps are also backed up and then it's just enough time for the cron task to run and crash).
-
@fbartels Thanks for the access! It turns out the issue has nothing with backups. It seems that whenever the "check disk space" cron job runs, the code crashes and this in turn brings down the backup process as well. This is related to the
LC_ALL=C
thread that we added to box.service because the system is on a difference locale. I think the update to 6.2.7 removed it (since the change is only in 6.3 branch). I added it back and now the backup succeeds.(And of course, the backup takes more time when those big apps are also backed up and then it's just enough time for the cron task to run and crash).