@nebulon said in Backup fails due to long runtime - backup hangs:
@chymian I am not sure about the specific issue and why the upload stalls at some point. This is strange indeed. Is there any throttling happening for incoming connections over time with your latest storage product?
no, it's working very well. did some heavy load test and it behaves very performant.
For the incremental backup, this is only supported using the "rsync" strategy. Depending on the backend, it either uses hardlinks or server side copying to avoid duplicate uploads.
I had made some not so good experiences with rsync to s3, it's more suited to a real FS, or do you have other experiences. what works best with rsync & hardlinks?
Generally our aim is to rather upload more than optimize for storage space as such.
to be on the safe side, I see. but there that also puts more load on the server, which is not really necessary.
We have had various discussions already about using systems like borg backup and such, but so far have always decided that we will not use them, since we are no experts on those systems and we are talking about backups, where it is sometimes require to have deep knowledge about how exactly backups are stored on the backend in order to restore from broken setups (which is of course one of the main use-cases for backups) Problem is, if anything gets corrupted in state with a more complex backup system, it is very or impossible to recover.
I see your point but at some point, an system architect/admin has always go into trust mode and test some new software to develop further, besides that, i.e. restic is battle-proofed.
and with the overall check & repair functions, this could enhance the whole backup-security/relaybility.
Already the encrypted tarball backup has a similar drawback, where say a few blocks of that tarball are corrupted, it is impossible to recover the rest,
that could be seen as a call to find another solution.
as mentioned, I have no exp. with borg, but restic and bareos and these have a validation function – which tar-balls don't have – which gives an extra layer of reliability.
so from our perspective the simpler to understand and recover, the better, with the drawback of maybe using more space or slower backups overall. It is a tradeoff.
philosopher could look horns about that - for sure.
there will be a time, when the old system just cannot keep up with the development and a change is needed.
one point is the pure amount of data which hast to be backed up.
Regarding btrfs and zfs, we actually started out with btrfs long ago.…
However in the end we had to step away from it, since in our experience while it works 99% of the time, if it does not, you might end up with a corrupted filesystem AND corrupted backup snapshots.
I had the same experiences with the early versions of BTRFS, but that's long ago.
meanwhile even proxmox, which is definitely more on the conservative side of system-setups, are using it in PVE 7.x
with a ext4, we will never know about bitrod!
Which is the worst case for a backup system of course. Problem is also that those corruptions might be unnoticed for a long time.
a nightly/weekly scrub
can easily find and if possible repair them
but in compare to zfs, a sysadmin had to setup these cron-jobs manually, which most people didn't do/know – me included, which would had saved me some trouble in these days…
Further with regards to filesystems, we have seen VPS provider specific characteristics, which is why we essentially only rely on ext4 as the most stable one.
don't know what you are refering to?
but maybe xfs, which also gained snapshot-features lately, and has a much better reputation then BTRFS, will be a choice.
is cloudron still strictly tied to ext4, or can it be installed on xfs, btrfs on own responsability?
Having said all this, I guess we have to debug your system in order to see why it doesn't work for you.
I figured out at least one point so far:
the server is hosted with ssdnodes.com and the tend to oversubscribe there systems. after monitoring the HD-throughput for a few days, I realized, there are times, when it goes down.
support fixed the for me yesterday, and at least last night the backup run for the first time in a week without pbls.
that leaves the point, why - lets say on an not so highly performance system – the backup stalls completely – after having transferred all data – for hours till it runs into timeout?
the log is here