Struggling to Replace MinIO - Advice Welcome!

davejgreen

We have been using MinIO for our company backups for some time. Each nightly backup with MinIO takes about 2-3 hours. When Cloudron updated from v8 to v9, something broke with the MinIO regions, and we need to find an alternative anyway as MinIO has gone into maintenance mode.

Requirements: We have about 150GB of data, increasing by a few GB a week. It is made up of a large number of small files, with new ones being added while many old ones stay the same. We want frequent (7 daily, 4 weekly, 12 monthly, etc.) de-duplicated backups. So the first backup would be the full 150GB, and subsequent backups would be snapshots that include the changed or new files (only a few GB) and "hard-links"(?) to the unchanged ones. We can tolerate an initial backup taking longer, but subsequent daily backups should be, e.g. 5-6 hours max. We had our Cloudron server (in "the cloud") backing up to an on-premises device with a ZFS file system and plenty of storage space. We are open to either using this set-up, or having storage somewhere else in "the cloud" for our backups.

Can anyone advise on a setup that would be a suitable replacement for MinIO?

Does anyone know which backup options in Cloudron are intended to provide de-duplicated, incremental backup snapshots?

Below is a breakdown of what we have tried so far and what the problems were:

Garage on the same on-premises device as MinIO. Was difficult to set up, but we got there in the end. However, we found each backup took exponentially longer, 5 hours, 7 hours, then 16 hours. I think the de-duplication was making things take longer the more snapshots we had. (We also did not like that the files were stored in a non-human-readable way. With MinIO, we can browse the backup files as a normal file system, but Garage just has chunks of bytes, so you can only access the data by using Garage.)

NFS (and SSHFS) with rsync (as I believe tarball would just do a full copy of the data each time). These were just too slow. We first tried this when we had more data, and they would run for 24 hours and then get killed by Cloudron for taking too long. After reducing our data, we managed a complete backup, but it took around 13 hours each day, which isn't really workable.

Backblaze B2 (rsync, in "the cloud") The first backup seemed to work fine, but subsequent backups did not appear to be de-duplicating. We had four 150GB Backblaze backups, but the bucket was showing as 860GB in size, so far more than if all four backups were full copies of the data. Is Backblaze meant to de-duplicate? We ticked the encrypted option in Cloudron - would it de-duplicate if it was not encrypted? Does encrypting it make it bigger?

S3 API Compatible (v4) with PeaSoup in "the cloud": Too slow, and no de-duplication.

Has anyone with similar requirements found a true replacement for MinIO yet?

Teiluj

I have not tried this personally, but in a similar situation, I would maybe explore https://github.com/rustfs/rustfs

jdaviescoates

Not a replacement for Minio but FYI I backup 240GB to a Hetzner Storage Box using SSHFS and targz and it takes 4-5 hours. I imaging for rsync it'd be much quicker after the first run (I'll soon experiment with creating a 2nd backup site to a Scaleway bucket and will report back...)

luckow

@jdaviescoates 311.37 GB | 151456 file(s) | 13 app(s) to a Hetzner Storage Box using SSHFS and rsync in 27 Min., 26 Sek. But that wasn't the initial backup.

robi

There is an effort to explore minio alternatives in another thread.
Sponsorship may speed things along.

davejgreen

Thanks for the responses. We are particularly interested in de-duplication, does anyone know if Cloudron backing up to a Hetzner Storage Box will do de-duplicated backups? I was surprised when Backblaze didn't, but maybe I configured something wrong?

jadudm

Depending on your appetite for loss, I would consider backups-in-depth. That is, one backup site is not a backup.

Use rsync-based backup over SSHFS to Hetzner or similar. You will want to select "use hardlinks" and, if you want it, encryption. The use of hardlinks is, essentially, your de-duplication. (See below.)
For a second layer of depth, I would consider a (daily? weekly? monthly?) backup of your primary backup site to a secondary. This could be a sync to AWS S3, for example. Note that any S3-based backup (B2, Cloudflare ObjectSomething, etc.) will have both a storage cost and an API cost. If you are dealing with millions of small files in your backups, the API costs will become real, because dedupe requires checking each object, and then possibly transferring it (multiple PUT/GET requests per file).
1. S3 has the ability to automatically keep multiple versions of a file. You could use this to have an in-place rotation/update of files.
2. If you are doing an S3 backup, you can use lifecycle rules to automatically move your S3 content to Glacier. This is much cheaper than "hot" S3 storage. But, you pay a penalty if you download/delete to early/too often.
As a third, cheap-ish option, go get a 2- or 4-bay NAS that can run TrueNAS, and put a pair of 8-12TB HDDs in it. Configure the disks in a ZFS mirrored pair. Run a cron job once per day/week to pull down the contents of the Hetzner box. (Your cron will want to, again, use rsync with hardlinks.) You now have a local machine mirroring your hot backups. It is arguably more expensive than some other options (~600USD up front), but you don't have any "we might run out of space" issues. And, because you're using it to pull, you don't have any weird networking problems: just SCP the data down. (Or, rsync it down over SSH.)

Whatever you are doing, consider targeting two different destinations at two different times (per day/alternating/etc.). Or, consider having some combination of backups that give you multiple copies at multiple sites. That could be Hetzner in two regions, with backups run on alternating days, or it could be you backup to a storage box and pull down a clone every day to a local NAS, or ... or ...

Ultimately, your 150GB is small. If you're increasing by a few GB per week, you're saying that you are likely to have 1TB/year. Not knowing your company's finances, this is generally considered a small amount of data. Trying to optimize for cost, immediately, is possibly less important than just getting the backups somewhere.

Other strategies could involve backing up to the NAS locally first, and then using a cron to borg or rsync to a remote host (possibly more annoying to set up), etc. But, you might have more "dedupe" options then. (borg has dedupe built in, I think, but...)

I have a suspicion that your desire to use object storage might be a red herring. But, again, I don't know your constraints/budget/needs/concerns.

Deduplication: If you use rsync with hardlinks, then each daily backup will automatically dedupe unchanged files. A hardlink is a pointer to a file. So, if you upload super_ai_outputs_day_1.md to your storage on Monday, and it remains unchanged for the rest of time, then each subsequent day is going to be a hardlink to that file. It will, for all intents and purposes, take up zero disk space. So, if you are backing up large numbers of small-to-medium sized files that do not change, SSHFS/rsync with hardlinks is going to naturally dedupe your unchanging old data.

This will not do binary deduplication of different files. So, if you're looking for a backup solution that would (say) identify that two, 1GB files where the middle 500GB are identical, and somehow dedupe that... you need more sophisticated tools and strategies. Rsync/hardlinks just makes sure that the same file, backed up every day, does not take (# days * size) space. It just takes the original size of the file plus an inode in the FS for each link.

Note, though, if you involve a snapshot of your hardlinked backups to an object store, every file may take the full size of every file for every day. I'm possibly wrong on that, but I'm not confident that most tools would know what to do with those hardlinks when you're copying to an object store. I think you'd end up multiplying your disk usage significantly, because your backup tool will have to create a copy of each file into the object store. (Most object stores do not have a notion of symlinks/hardlinks.) An experiment with a subset of the data, or even a few files, will tell you the answer to that question.

If you have other questions, you can ask here, or DM me.

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Struggling to Replace MinIO - Advice Welcome!