Tar Backups timing out on too large part number
-
@adrw It looks like my suggestion was wrong. We always only do 1GB copies, I read the code wrong. Re-reading the logs it seems that after around 90 parts, the copies start failing (so ~90GB). It seems the server returned a 500. We might have to contact b2 and ask them if they have size restrictions on file copies.
-
@girish Thanks, it did fail though it seem to succeed up to 146 parts and then the logs start so I'm assuming a timeout of the backup job.
Would it be possible to extend the timeout on the Cloudron backup task?
... 2020-08-21T01:08:00.405Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploading backup 261505M@8MBps (cloud.alxdr.ca)"} 2020-08-21T01:08:10.403Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploading backup 261566M@8MBps (cloud.alxdr.ca)"} 2020-08-21T01:08:20.924Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploading backup 261631M@8MBps (cloud.alxdr.ca)"} 2020-08-21T01:08:30.844Z box:shell backup-snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a (stdout): 2020-08-21T01:08:30.844Z box:storage/s3 Uploaded snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc: {"Location":"/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc","Bucket":"my-adrw-xyz","Key":"snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc","ETag":"26a95d5fe8b46062247614c72801c523-2617"} 2020-08-21T01:08:30.844Z box:shell backup-snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a (stdout): 2020-08-21T01:08:30.844Z box:backupupload upload completed. error: null 2020-08-21T01:08:30.844Z box:backups runBackupUpload: result - {"result":""} 2020-08-21T01:08:31.020Z box:backups cloud.alxdr.ca uploadAppSnapshot: snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a done. 32181.605 seconds 2020-08-21T01:08:31.043Z box:backups Rotating app backup of c4dd3c6d-806e-454b-beba-4cdd3f29865a to id 2020-08-20-160520-230/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a_2020-08-21-010831-020_v4.6.2 2020-08-21T01:08:32.129Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying 0-1. 0 errors so far. concurrency set to 10 (cloud.alxdr.ca)"} 2020-08-21T01:08:32.130Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying (multipart) snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc (cloud.alxdr.ca)"} 2020-08-21T01:08:32.505Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 1 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=0-1073741824 (cloud.alxdr.ca)"} 2020-08-21T01:09:31.700Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 1 - Etag: \"d2ad221b465e2361c366950c1eefb51c\" (cloud.alxdr.ca)"} 2020-08-21T01:09:31.700Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 2 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=1073741825-2147483649 (cloud.alxdr.ca)"} 2020-08-21T01:10:23.055Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 2 - Etag: \"34af51edde0b3805ac418e080a860660\" (cloud.alxdr.ca)"} 2020-08-21T01:10:23.055Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 3 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=2147483650-3221225474 (cloud.alxdr.ca)"} 2020-08-21T01:13:38.471Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 3 - Etag: \"9c2fa7630ffb42e4879fa3454c52ba63\" (cloud.alxdr.ca)"} 2020-08-21T01:13:38.471Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 4 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=3221225475-4294967299 (cloud.alxdr.ca)"} ... 2020-08-21T04:02:05.786Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 143 - Etag: \"c2379cf098ff6e944931add463449190\" (cloud.alxdr.ca)"} 2020-08-21T04:02:05.786Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 144 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=153545080975-154618822799 (cloud.alxdr.ca)"} 2020-08-21T04:03:08.359Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 144 - Etag: \"d163000653b34764f469adf7ef1038cf\" (cloud.alxdr.ca)"} 2020-08-21T04:03:08.359Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 145 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=154618822800-155692564624 (cloud.alxdr.ca)"} 2020-08-21T04:04:20.333Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 145 - Etag: \"4b38a4e475683fefcb35e3d20af5d93d\" (cloud.alxdr.ca)"} 2020-08-21T04:04:20.333Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 146 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=155692564625-156766306449 (cloud.alxdr.ca)"}
-
@adrw The timeout comes from https://git.cloudron.io/cloudron/box/-/blob/master/src/storage/s3.js#L71 . It's set to 600 seconds. Maybe we can try changing it to 1000 ?
-
@girish I adjusted the timeout for the L71 you mentioned to 10000 and it the task is still getting killed for taking too long. Is there any broader Cloudron task wide timeout or backup configuration for the entire task vs the L71 which seems to be more focused on the individual network call timeout?
Task 1735 timed out Aug 28 10:57:29 box:tasks 1735: {"percent":71.58823529411765,"message":"Copying part 157 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=167503724700-168577466524 (cloud.alxdr.ca)"} Aug 28 10:58:20 box:tasks 1735: {"percent":71.58823529411765,"message":"Uploaded part 157 - Etag: \"f6e793337c89eb34cb8c62448a4fe6f7\" (cloud.alxdr.ca)"} Aug 28 10:58:20 box:tasks 1735: {"percent":71.58823529411765,"message":"Copying part 158 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=168577466525-169651208349 (cloud.alxdr.ca)"} Aug 28 10:59:18 box:tasks 1735: {"percent":71.58823529411765,"message":"Uploaded part 158 - Etag: \"81a168929546a77af7262d1d1304854d\" (cloud.alxdr.ca)"} Aug 28 10:59:18 box:tasks 1735: {"percent":71.58823529411765,"message":"Copying part 159 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=169651208350-170724950174 (cloud.alxdr.ca)"}
-
@adrw said in Tar Backups timing out on too large part number:
Task 1735 timed out
@adrw Something else is going on. The above error message suggests that the backup task took more than 12 hours. Is that the case? Was backup running for more than 12 hours?
-
@adrw In that case, the real issue is https://git.cloudron.io/cloudron/box/-/blob/master/src/backups.js#L1235 . There is a timeout of 12 hours for backup tasks. Can you bump it to like 36 or something?
For copy concurrency, there is already a slider for that. Don't you see it under advanced? Note that where it is failing now is that it has to copy the parts of the same file (basically, the single file is so big that we have to split it up into many parts). I have to check if parallel multi-part copy is allowed. If it is, it's easy to do.
-
@adrw It seems parallel multi-part copy will work per https://dzone.com/articles/amazon-s3-parallel-multipart . Looks like a good change to make.
-
-
@adrw I made the timeout now 24 hours. The timeout is really just there to kill "stuck" backups. This can usually only happen when there is a bug in our code.