Tar Backups timing out on too large part number
-
@girish The task failed again but for a different error. Seems to have lost connection when doing some backup cleaning/copying but succeeded in the huge upload?
2020-08-20T03:28:39.027Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261266M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:28:49.028Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261294M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:28:59.029Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261317M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:29:09.030Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261345M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:29:19.031Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261390M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:29:29.034Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261457M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:29:39.032Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261522M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:29:49.032Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261601M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:29:56.402Z box:tasks 1670: {"percent":69.75,"message":"Uploading backup 261637M@8MBps (cloud.alxdr.ca)"} 2020-08-20T03:30:06.215Z box:shell backup-snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a (stdout): 2020-08-20T03:30:06.214Z box:storage/s3 Uploaded snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc: {"Location":"/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc","Bucket":"my-adrw-xyz","Key":"snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc","ETag":"fbc3fed6990815ff54f0592796eca14c-2617"} 2020-08-20T03:30:06.215Z box:shell backup-snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a (stdout): 2020-08-20T03:30:06.215Z box:backupupload upload completed. error: null 2020-08-20T03:30:06.215Z box:backups runBackupUpload: result - {"result":""} 2020-08-20T03:30:06.242Z box:backups cloud.alxdr.ca uploadAppSnapshot: snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a done. 32823.048 seconds 2020-08-20T03:30:06.380Z box:backups Rotating app backup of c4dd3c6d-806e-454b-beba-4cdd3f29865a to id 2020-08-19-181531-419/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a_2020-08-20-033006-242_v4.6.2 2020-08-20T03:30:07.425Z box:tasks 1670: {"percent":69.75,"message":"Copying 0-1. 0 errors so far. concurrency set to 10 (cloud.alxdr.ca)"} 2020-08-20T03:30:07.425Z box:tasks 1670: {"percent":69.75,"message":"Copying (multipart) snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc (cloud.alxdr.ca)"} 2020-08-20T03:30:07.760Z box:tasks 1670: {"percent":69.75,"message":"Copying part 1 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=0-1073741824 (cloud.alxdr.ca)"} 2020-08-20T03:31:20.207Z box:tasks 1670: {"percent":69.75,"message":"Uploaded part 1 - Etag: \"1f5ae6cb5ec75ed806045d2f442cf177\" (cloud.alxdr.ca)"} 2020-08-20T03:31:20.207Z box:tasks 1670: {"percent":69.75,"message":"Copying part 2 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=1073741825-2147483649 (cloud.alxdr.ca)"} 2020-08-20T03:32:31.745Z box:tasks 1670: {"percent":69.75,"message":"Uploaded part 2 - Etag: \"74e13bdc924657c1c0c9eb034a3bf9a8\" (cloud.alxdr.ca)"} 2020-08-20T03:32:31.745Z box:tasks 1670: {"percent":69.75,"message":"Copying part 3 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=2147483650-3221225474 (cloud.alxdr.ca)"} 2020-08-20T03:33:30.986Z box:tasks 1670: ... {"percent":69.75,"message":"Copying part 90 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=95563022425-96636764249 (cloud.alxdr.ca)"} 2020-08-20T05:46:18.866Z box:tasks 1670: {"percent":69.75,"message":"Uploaded part 90 - Etag: \"f3d111e24e97840df094997cbd0c64e0\" (cloud.alxdr.ca)"} 2020-08-20T05:46:18.867Z box:tasks 1670: {"percent":69.75,"message":"Copying part 91 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=96636764250-97710506074 (cloud.alxdr.ca)"} 2020-08-20T05:47:20.624Z box:tasks 1670: {"percent":69.75,"message":"Uploaded part 91 - Etag: \"dacc244b6a06c615164bdc672fa99ee7\" (cloud.alxdr.ca)"} 2020-08-20T05:47:20.625Z box:tasks 1670: {"percent":69.75,"message":"Copying part 92 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=97710506075-98784247899 (cloud.alxdr.ca)"} 2020-08-20T05:48:29.146Z box:tasks 1670: {"percent":69.75,"message":"Uploaded part 92 - Etag: \"3f2e9e925b07a5a986c3ea30af6848fd\" (cloud.alxdr.ca)"} 2020-08-20T05:48:29.146Z box:tasks 1670: {"percent":69.75,"message":"Copying part 93 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=98784247900-99857989724 (cloud.alxdr.ca)"} 2020-08-20T05:53:29.466Z box:tasks 1670: {"percent":69.75,"message":"Retrying (1) multipart copy of snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc. Error: TimeoutError: Connection timed out after 300000ms undefined (cloud.alxdr.ca)"} 2020-08-20T05:58:49.907Z box:tasks 1670: {"percent":69.75,"message":"Retrying (2) multipart copy of snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc. Error: TimeoutError: Connection timed out after 300000ms undefined (cloud.alxdr.ca)"} 2020-08-20T06:04:10.395Z box:tasks 1670: {"percent":69.75,"message":"Retrying (3) multipart copy of snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc. Error: TimeoutError: Connection timed out after 300000ms undefined (cloud.alxdr.ca)"} 2020-08-20T06:06:46.823Z box:tasks 1670: {"percent":69.75,"message":"Retrying (4) multipart copy of snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc. Error: InternalError: connection pool timed out for https://pod-000-1071-01.backblaze.com/b2api/v2/b2_upload_part/4_z6736ccf3aafae32a7f1d0310_f200cfd9cb756e501_d20200820_m033007_c000_v0001071_t0039/0000 500 (cloud.alxdr.ca)"} 2020-08-20T06:12:07.075Z box:tasks 1670: {"percent":69.75,"message":"Retrying (5) multipart copy of snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc. Error: TimeoutError: Connection timed out after 300000ms 500 (cloud.alxdr.ca)"}
-
@adrw It looks like the upload succeeded! What has failed is the copying part number (which also has a part size). This is hardcoded to 5GB. It seems B2 takes more than 300 seconds to copy a 5GB file. Can you please adjust this line - https://git.cloudron.io/cloudron/box/-/blob/master/src/storage/s3.js#L258
const largeFileLimit = apiConfig.provider === 'exoscale-sos' ? 1024 * 1024 * 1024 : 5 * 1024 * 1024 * 1024;
to
const largeFileLimit = apiConfig.provider === 'exoscale-sos' ? 1024 * 1024 * 1024 : 1 * 1024 * 1024 * 1024;
So we are reducing part size to 1GB above. I will make copy part size configurable as well then.
-
@adrw It looks like my suggestion was wrong. We always only do 1GB copies, I read the code wrong. Re-reading the logs it seems that after around 90 parts, the copies start failing (so ~90GB). It seems the server returned a 500. We might have to contact b2 and ask them if they have size restrictions on file copies.
-
@girish Thanks, it did fail though it seem to succeed up to 146 parts and then the logs start so I'm assuming a timeout of the backup job.
Would it be possible to extend the timeout on the Cloudron backup task?
... 2020-08-21T01:08:00.405Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploading backup 261505M@8MBps (cloud.alxdr.ca)"} 2020-08-21T01:08:10.403Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploading backup 261566M@8MBps (cloud.alxdr.ca)"} 2020-08-21T01:08:20.924Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploading backup 261631M@8MBps (cloud.alxdr.ca)"} 2020-08-21T01:08:30.844Z box:shell backup-snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a (stdout): 2020-08-21T01:08:30.844Z box:storage/s3 Uploaded snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc: {"Location":"/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc","Bucket":"my-adrw-xyz","Key":"snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc","ETag":"26a95d5fe8b46062247614c72801c523-2617"} 2020-08-21T01:08:30.844Z box:shell backup-snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a (stdout): 2020-08-21T01:08:30.844Z box:backupupload upload completed. error: null 2020-08-21T01:08:30.844Z box:backups runBackupUpload: result - {"result":""} 2020-08-21T01:08:31.020Z box:backups cloud.alxdr.ca uploadAppSnapshot: snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a done. 32181.605 seconds 2020-08-21T01:08:31.043Z box:backups Rotating app backup of c4dd3c6d-806e-454b-beba-4cdd3f29865a to id 2020-08-20-160520-230/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a_2020-08-21-010831-020_v4.6.2 2020-08-21T01:08:32.129Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying 0-1. 0 errors so far. concurrency set to 10 (cloud.alxdr.ca)"} 2020-08-21T01:08:32.130Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying (multipart) snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc (cloud.alxdr.ca)"} 2020-08-21T01:08:32.505Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 1 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=0-1073741824 (cloud.alxdr.ca)"} 2020-08-21T01:09:31.700Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 1 - Etag: \"d2ad221b465e2361c366950c1eefb51c\" (cloud.alxdr.ca)"} 2020-08-21T01:09:31.700Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 2 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=1073741825-2147483649 (cloud.alxdr.ca)"} 2020-08-21T01:10:23.055Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 2 - Etag: \"34af51edde0b3805ac418e080a860660\" (cloud.alxdr.ca)"} 2020-08-21T01:10:23.055Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 3 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=2147483650-3221225474 (cloud.alxdr.ca)"} 2020-08-21T01:13:38.471Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 3 - Etag: \"9c2fa7630ffb42e4879fa3454c52ba63\" (cloud.alxdr.ca)"} 2020-08-21T01:13:38.471Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 4 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=3221225475-4294967299 (cloud.alxdr.ca)"} ... 2020-08-21T04:02:05.786Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 143 - Etag: \"c2379cf098ff6e944931add463449190\" (cloud.alxdr.ca)"} 2020-08-21T04:02:05.786Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 144 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=153545080975-154618822799 (cloud.alxdr.ca)"} 2020-08-21T04:03:08.359Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 144 - Etag: \"d163000653b34764f469adf7ef1038cf\" (cloud.alxdr.ca)"} 2020-08-21T04:03:08.359Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 145 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=154618822800-155692564624 (cloud.alxdr.ca)"} 2020-08-21T04:04:20.333Z box:tasks 1680: {"percent":71.58823529411765,"message":"Uploaded part 145 - Etag: \"4b38a4e475683fefcb35e3d20af5d93d\" (cloud.alxdr.ca)"} 2020-08-21T04:04:20.333Z box:tasks 1680: {"percent":71.58823529411765,"message":"Copying part 146 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=155692564625-156766306449 (cloud.alxdr.ca)"}
-
@adrw The timeout comes from https://git.cloudron.io/cloudron/box/-/blob/master/src/storage/s3.js#L71 . It's set to 600 seconds. Maybe we can try changing it to 1000 ?
-
@girish I adjusted the timeout for the L71 you mentioned to 10000 and it the task is still getting killed for taking too long. Is there any broader Cloudron task wide timeout or backup configuration for the entire task vs the L71 which seems to be more focused on the individual network call timeout?
Task 1735 timed out Aug 28 10:57:29 box:tasks 1735: {"percent":71.58823529411765,"message":"Copying part 157 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=167503724700-168577466524 (cloud.alxdr.ca)"} Aug 28 10:58:20 box:tasks 1735: {"percent":71.58823529411765,"message":"Uploaded part 157 - Etag: \"f6e793337c89eb34cb8c62448a4fe6f7\" (cloud.alxdr.ca)"} Aug 28 10:58:20 box:tasks 1735: {"percent":71.58823529411765,"message":"Copying part 158 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=168577466525-169651208349 (cloud.alxdr.ca)"} Aug 28 10:59:18 box:tasks 1735: {"percent":71.58823529411765,"message":"Uploaded part 158 - Etag: \"81a168929546a77af7262d1d1304854d\" (cloud.alxdr.ca)"} Aug 28 10:59:18 box:tasks 1735: {"percent":71.58823529411765,"message":"Copying part 159 - /my-adrw-xyz/snapshot/app_c4dd3c6d-806e-454b-beba-4cdd3f29865a.tar.gz.enc bytes=169651208350-170724950174 (cloud.alxdr.ca)"}
-
@adrw said in Tar Backups timing out on too large part number:
Task 1735 timed out
@adrw Something else is going on. The above error message suggests that the backup task took more than 12 hours. Is that the case? Was backup running for more than 12 hours?
-
@adrw In that case, the real issue is https://git.cloudron.io/cloudron/box/-/blob/master/src/backups.js#L1235 . There is a timeout of 12 hours for backup tasks. Can you bump it to like 36 or something?
For copy concurrency, there is already a slider for that. Don't you see it under advanced? Note that where it is failing now is that it has to copy the parts of the same file (basically, the single file is so big that we have to split it up into many parts). I have to check if parallel multi-part copy is allowed. If it is, it's easy to do.
-
@adrw It seems parallel multi-part copy will work per https://dzone.com/articles/amazon-s3-parallel-multipart . Looks like a good change to make.
-
-
@adrw I made the timeout now 24 hours. The timeout is really just there to kill "stuck" backups. This can usually only happen when there is a bug in our code.