Migration from one server to another with a floating IP and minimizing downtime
-
I'm looking at possibly migrating Cloudron from one server to a different server over the next days to weeks (definitely by early December if I can). I have a floating/failover IP address that is pointing at my primary one (Server A) currently and my idea is to restore from a Cloudron backup on the new server (Server B) and then run something like rsync to keep various items in sync with each other (so if a new mail message comes in for example or a new file added in WordPress or edited, it syncs back to the new server that's running as the secondary server) until I'm ready to pull the plug on the primary server for promoting the secondary server to primary role.
Basically... Server A (current primary) and Server B (to use Cloudron backup and sync with Server A), and then eventually when comfortable will move the failover IP address to point to Server B instead and Server B then becomes the new primary server and I decommission Server A. I want to keep anything that is modified on Server A in sync with Server B so that Server B is always a duplicate/clone of Server A until I'm ready to pull the plug on Server A.
The purpose of this is to minimize downtown to near-zero if I can help it. I have always just generally done backups on Server A for example and then when during backup when it gets to the mail part I disable the mail server to prevent lost emails, then restore the backup on Server B, and flip the switch quickly by updating DNS to use the new IP. However this time I have a floating IP address I want to use and minimize the downtime as it takes an awfuly long time to backup and restore everything with how much data there is and using object storage for backups. I have more clients which are a bit more "mission critical" now than I had in the past and as I have more clients I also want to reduce the impact where possible and figured this migration may be a good test to nail down a good process for these types of things in the future for migrations of Cloudron data.
So I guess my question becomes... once I've restored the latest Cloudron backup to Server B, what particular paths are recommended to be synced back to Server B from Server A to keep in-sync? I assume basically everything under the /home/yellowtent/ path, is that accurate or is that too wide or missing anything obvious? Has anyone done a similar kind of migration using floating IPs and having Server B already setup and ready to go with a diff-sync of files modified on Server A since the restore on Server B to ensure nothing is lost?
Curious for your thoughts / recommendations. Thanks in advance! Of course when I run through this if I encounter anything I can share back that I learn and can help others, I'll do so as always.
-
I recently migrated, but I went from old hardware to new, with the same IP.
I didn't care if the server was offline inbetween the short change over time, because other mail servers should simply queue the mail for delivery for around 72 hours normally. I had some SQL corruption upon restore, and support (girish) fixed this in good time. Adding DNS propagation to the recipe and wanting minimal downtime, seems ambitious. -
@AartJansen said in Migration from one server to another with a floating IP and minimizing downtime:
I didn't care if the server was offline inbetween the short change over time, because other mail servers should simply queue the mail for delivery for around 72 hours normally.
Correct, incoming emails will be re-sent if the mail server is offline however because of the time it takes to back up and restore (even just the mail portion) the data in my Cloudron instance it can mean downtime of about 2-3 hours. If it could all be done in less than an hour I’d probably just stick to my regular process and deal with it, but as I’ve on-boarded more and more clients with more data, my backups (and in-turn the restore) takes quite a while to do.
I think that is also a difference here... In your case you were fine with downtime and in my case I’m wanting to avoid it or at least keep it at a minimum at this point. I’ve migrated Cloudron successfully several times in the past few years but luckily was in a position to tolerate a hour or so of downtime without any complaints from my clients. However my freelance business has grown to a point now where I really need to take any kind of downtown as a higher priority than I used to.
So in an effort to grow and mature my business, I want to look at a migration method that would allow for minimal downtime for my clients. I think I have a plan that will work but I haven’t done it yet, so going to run a test in the coming days I think. Wanted to see if anyone has done something similar.
Adding DNS propagation to the recipe and wanting minimal downtime, seems ambitious.
DNS propagation won’t be an issue here at all as the hostnames already point to the failover/floating IP. In other words, the DNS is already propagated. So once I’m ready to make the cut over to Server B from Server A, all I have to do is update which server/service the floating IP is assigned to in OVH and it’s a change that takes place pretty much immediately.
-
@d19dotca said in Migration from one server to another with a floating IP and minimizing downtime:
Has anyone done a similar kind of migration using floating IPs and having Server B already setup and ready to go with a diff-sync of files modified on Server A since the restore on Server B to ensure nothing is lost?
Never tried this, but if you do let us know.
Are Server A and Server B in completely different VPS providers or are you switching regions? Since you have a floating IP, I imagine it is the latter. You can ask if your VPS has the tools for this. Usually, one can use disk replication like https://core.vmware.com/resource/vsphere-replication-faq but of course these tools are not exposed to end user. Despite all this technology, in DO and AWS, server snapshots take forever. It's 1-3 minutes per GB. So, for 100GB, this is like couple of hours! I don't know why it's so slow within the same region. And transferring the snapshot to another region takes a lot of time as well.
So, in the end, I agree. Would be nice to have some sort of incremental restore or something. If there are ideas outside Cloudron we can copy, we can take a look.
-
@girish said in Migration from one server to another with a floating IP and minimizing downtime:
Are Server A and Server B in completely different VPS providers or are you switching regions? Since you have a floating IP, I imagine it is the latter.
Neither of those funny enough. I am moving from one server to another in the same region with the same host (OVH) thankfully. Technically a different data centre I suppose (BHS 2 to BHS 6) but still the same overall region (BHS at OVH). As it’s also a difference instance type entirely though (I.e. dedicated to public cloud) so unfortunately moving ’images’ isn’t a possibility here.
So, in the end, I agree. Would be nice to have some sort of incremental restore or something. If there are ideas outside Cloudron we can copy, we can take a look.
I think it’d be awesome if Cloudron implemented a kind of sync or something so that it can automatically periodically sync with another Cloudron instance not only for my use-case of a migration but for a possibly more common use-case of simply having a standby server up and ready. Likely neither are super popular scenarios but they certainly do happen so would be good to have. Barring that for now, I am hoping something like
rsync
will be enough to do the job.Do you think all I need to sync is the /home/yellowtent/* directory? Or are there other spots that would be required? All the main data is in that directory isn’t it so presumably that’s all we’d need to sync? And I guess I could be even more targeted than that so I’m not having to pull in logs and stuff too I guess.
-
@girish Interestingly, I'm getting close I think but I am getting a lot of these kind of errors in an rsync dry run and this seems to then conclude in a "code 23" error.
The command I'm running is this:
rsync --progress --human-readable --delete-delay --archive --compress --rsync-path="sudo rsync" -e 'ssh -p {port}' {user}@{IP}:/home/yellowtent/ /home/yellowtent/ --dry-run
rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/mysql/dfe0390f47991409/wp_usermeta.MYD": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/mysql/dfe0390f47991409/wp_usermeta.MYI": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/mysql/dfe0390f47991409/wp_usermeta_2014.sdi": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/mysql/dfe0390f47991409/wp_users.MYD": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/mysql/dfe0390f47991409/wp_users.MYI": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/mysql/dfe0390f47991409/wp_users_2015.sdi": Permission denied (13) [...] rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/PG_VERSION": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/pg_hba.conf": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/pg_ident.conf": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/postgresql.auto.conf": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/postgresql.conf": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/postmaster.opts": Permission denied (13) rsync: [generator] recv_generator: failed to stat "/home/yellowtent/platformdata/postgresql/14/main/postmaster.pid": Permission denied (13) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [generator=3.2.7]
I assume this has to do with needing to use the yellowtent user or something, but still working on it. Figure I'd give a quick update in case there was some suggestions to get past it.
-
If I modify the command to this, I get a slightly different set of errors but ultimately still permission denied.
Command:
rsync --progress --human-readable --delete-delay --archive --compress --rsync-path="sudo -u yellowtent rsync" -e 'ssh -p {port}' {user}@{IP}:/home/yellowtent/ /home/yellowtent/ --dry-run
rsync: [sender] opendir "/home/yellowtent/platformdata/mysql/e895bea3fd021766" failed: Permission denied (13) rsync: [sender] opendir "/home/yellowtent/platformdata/mysql/f426f5f438dd9395" failed: Permission denied (13) rsync: [sender] opendir "/home/yellowtent/platformdata/mysql/fcd9711342a78f4d" failed: Permission denied (13) rsync: [sender] opendir "/home/yellowtent/platformdata/mysql/mysql" failed: Permission denied (13) rsync: [sender] opendir "/home/yellowtent/platformdata/mysql/performance_schema" failed: Permission denied (13) rsync: [sender] opendir "/home/yellowtent/platformdata/mysql/sys" failed: Permission denied (13) rsync: [sender] opendir "/home/yellowtent/platformdata/postgresql/14/main" failed: Permission denied (13) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [generator=3.2.7]
-
Btw, this looks wrong to me somehow... shouldn't this all be owned by yellowtent?
ubuntu@my:~$ ls -alh /home/yellowtent/platformdata/ total 112K drwxr-xr-x 21 yellowtent yellowtent 4.0K Oct 12 08:17 . drwxr-xr-x 7 yellowtent yellowtent 4.0K Nov 13 10:01 .. -rw-r--r-- 1 yellowtent yellowtent 5 Nov 19 2022 CRON_SEED -rw-r--r-- 1 yellowtent yellowtent 1.3K Nov 13 10:01 INFRA_VERSION -rw-r--r-- 1 yellowtent yellowtent 5 Nov 13 10:01 VERSION drwxr-xr-x 2 yellowtent yellowtent 4.0K Nov 22 10:10 acme drwxr-xr-x 3 yellowtent yellowtent 4.0K Oct 12 08:17 addons drwxr-xr-x 2 yellowtent yellowtent 4.0K Nov 25 2022 backup drwxr-xr-x 2 yellowtent yellowtent 4.0K Nov 25 2022 cifs drwxr-xr-x 2 yellowtent yellowtent 4.0K Nov 19 2022 collectd -rw-r--r-- 1 yellowtent yellowtent 826 Nov 19 2022 dhparams.pem -rw-r--r-- 1 yellowtent yellowtent 6.6K Nov 23 03:53 diskusage.json -rw-r--r-- 1 yellowtent yellowtent 245 Nov 23 21:42 features-info.json drwxr-xr-x 2 yellowtent yellowtent 4.0K Dec 30 2022 firewall drwxr-xr-x 3 tss sgx 4.0K Apr 4 2023 graphite drwxr-xr-x 2 root root 4.0K Nov 22 06:00 logrotate.d drwxr-xr-x 75 yellowtent yellowtent 4.0K Nov 19 17:30 logs drwxr-xr-x 8 tss root 4.0K Nov 23 22:25 mongodb drwxr-xr-x 33 tss sgx 4.0K Nov 23 05:22 mysql drwxr-xr-x 4 yellowtent yellowtent 4.0K Nov 23 10:10 nginx drwxr-xr-x 2 yellowtent yellowtent 4.0K Jul 5 00:37 oidc drwxr-xr-x 3 root root 4.0K Apr 4 2023 postgresql drwxr-xr-x 25 root root 4.0K Nov 19 17:30 redis drwxr-xr-x 3 yellowtent yellowtent 4.0K Nov 19 2022 sftp drwxr-xr-x 2 yellowtent yellowtent 4.0K Nov 19 2022 sshfs drwxr-xr-x 2 yellowtent yellowtent 4.0K Dec 1 2022 tls drwxr-xr-x 2 yellowtent yellowtent 4.0K Nov 13 08:34 update
Maybe this is the issue, that some of these files/directories are owned by something called "tss" instead?
-
@d19dotca said in Migration from one server to another with a floating IP and minimizing downtime:
Maybe this is the issue, that some of these files/directories are owned by something called "tss" instead?
The usernames in containers appear differently as usernames in host. This
tss
is possibly uid 1000 or something like that on host (check /etc/shadow) -
@girish Ah you called it - sorry for forgetting the
sudo
part, can't believe I forgot that, lol. Been trying too many things today I guess, haha.Here's the command the seems to work and it's stats output, although this seems maybe incorrect to me given the size of the data and how much needs to be created and such considering the backup I restored from is only 8 hours old or so. However as a percentage compared to how many files exist total it's fairly low what needs to be deleted or created so I guess maybe it makes sense still.
Command:
sudo rsync --stats --human-readable --delete-delay --archive --compress --rsync-path="sudo -u yellowtent rsync" -e 'ssh -p {port}' {user}@{IP}:/home/yellowtent/ /home/yellowtent/ --dry-run
Number of files: 555,244 (reg: 512,959, dir: 42,208, link: 77) Number of created files: 2,308 (reg: 2,213, dir: 95) Number of deleted files: 702 (reg: 690, dir: 12) Number of regular files transferred: 11,365 Total file size: 92.88G bytes Total transferred file size: 18.64G bytes Literal data: 0 bytes Matched data: 0 bytes File list size: 14.17M File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 286.61K Total bytes received: 24.61M sent 286.61K bytes received 24.61M bytes 1.42M bytes/sec total size is 92.88G speedup is 3,731.42 (DRY RUN)
-
Ah I figured out why it's so many, when I used the
--itemized-changes
flag, it shows that a lot of the reasons it's bringing over so many files is due to the timestamp. It seems it's because the timestamp in the source is older than the timestamp in the destination due to the timestamps created on some of these files were the time of the backup restoration time on Server B. Since I'm using the--archive
flag, it is trying to keep all of that in sync with each other so that the timestamps on the destination match the source. I believe that's why it's picking up so many more changes than I expected. -
Okay, I've migrated servers today and flipped the switch, things seem to be running steady although since I've been combing through the logs I do see a few concerns but not sure if this is related to the migration or not. I'll write up more descriptive tasks I took and details for those who want to attempt this in the future by the way.
I'm seeing this somewhat often in my Mail logs:
Nov 23 17:34:25[ERROR] [224C5AD2-AF35-4EE2-A3F3-05980A9896E2.1] [limit] conn_concur_decr:Error: MISCONF Redis is configured to save RDB snapshots, but it's currently unable to persist to disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
I assume that's the issue perhaps that I hadn't realized was related to the changelog mentioned here?
@girish said in Email Event Log loading very slowly, seems tied to overall Email domain list health checks:
This seems related to the redis issue . Think it gets fixed with https://git.cloudron.io/cloudron/box/-/commit/e64182d79134e8828c2fa953c676a8f6b08247b7
One issue I did run into btw was around MongoDB, where it refused to startup and it kept complaining about possible corruption. No matter what I did it wouldn't work so I just ended up backing it up, then removing the files in the `/home/yellowtent/platformdata/mongodb/``` directory and then running an rsync again to bring over the files from the working server, and restarted the MongoDB service and all went well again.
-
@girish I'm also seeing this in my logs repeatedly for box logs, seems related perhaps to the email hosting part:
Nov 23 17:39:31box:server no such route: GET eventlog?page=1&per_page=20&search=&types=&access_token=<redacted> [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client at new NodeError (node:internal/errors:399:5) at ServerResponse.setHeader (node:_http_outgoing:645:11) at ServerResponse.header (/home/yellowtent/box/node_modules/express/lib/response.js:794:10) at ServerResponse.send (/home/yellowtent/box/node_modules/express/lib/response.js:174:12) at ServerResponse.json (/home/yellowtent/box/node_modules/express/lib/response.js:278:15) at ServerResponse.send (/home/yellowtent/box/node_modules/express/lib/response.js:162:21) at /home/yellowtent/box/node_modules/connect-lastmile/lib/index.js:80:28 at Layer.handle_error (/home/yellowtent/box/node_modules/express/lib/router/layer.js:71:5) at trim_prefix (/home/yellowtent/box/node_modules/express/lib/router/index.js:326:13) at /home/yellowtent/box/node_modules/express/lib/router/index.js:286:9 Nov 23 17:39:31box:server no such route: GET solr_config?access_token=<redacted> [ERR_HTTP_HEADERS_SENT]: Cannot set headers after they are sent to the client at new NodeError (node:internal/errors:399:5) at ServerResponse.setHeader (node:_http_outgoing:645:11) at ServerResponse.header (/home/yellowtent/box/node_modules/express/lib/response.js:794:10) at ServerResponse.send (/home/yellowtent/box/node_modules/express/lib/response.js:174:12) at ServerResponse.json (/home/yellowtent/box/node_modules/express/lib/response.js:278:15) at ServerResponse.send (/home/yellowtent/box/node_modules/express/lib/response.js:162:21) at /home/yellowtent/box/node_modules/connect-lastmile/lib/index.js:80:28 at Layer.handle_error (/home/yellowtent/box/node_modules/express/lib/router/layer.js:71:5) at trim_prefix (/home/yellowtent/box/node_modules/express/lib/router/index.js:326:13) at /home/yellowtent/box/node_modules/express/lib/router/index.js:286:9 Nov 23 17:39:34box:server no such route: GET usage?domain={domain}&access_token=<redacted>
Any suggestions on this one?
-
@d19dotca said in Migration from one server to another with a floating IP and minimizing downtime:
One issue I did run into btw was around MongoDB, where it refused to startup and it kept complaining about possible corruption.
yeah, this is why the rsync solution is not entirely recommended. The databases probably hold state in memory as well . When we do a live rsync when the databases as running, it's possible that we are copying semi-baked stuff.
-
@d19dotca said in Migration from one server to another with a floating IP and minimizing downtime:
Any suggestions on this one?
Fix for this is coming next release, you can ignore it.
edit: fixed in https://git.cloudron.io/cloudron/box/-/commit/a056bcfdfe6c7bcb6d2f1cea2017c54f2ba6750f
-
@girish said in Migration from one server to another with a floating IP and minimizing downtime:
@d19dotca said in Migration from one server to another with a floating IP and minimizing downtime:
One issue I did run into btw was around MongoDB, where it refused to startup and it kept complaining about possible corruption.
yeah, this is why the rsync solution is not entirely recommended. The databases probably hold state in memory as well . When we do a live rsync when the databases as running, it's possible that we are copying semi-baked stuff.
It seemed to work fine but just meant I had to delete the MongoDB files and bring back over from the source server. Everything was good after that. But yeah not the simplest migration process.
It worked though. I’m running fully on the new server since Friday and so far have seen no issues beyond discovering some known ones I hadn’t seen earlier which you’ve confirmed are bug fixes coming soon. The amount of down time was maybe 15-20 minutes (and mostly just for the apps using MongoDB of which I only had 2 apps using it), and that was mostly due to restarting while trying to figure out the solution to the MongoDB errors which once solved the two apps depending on it came back up again and would be this much less downtime next time now that I know how to avoid the MongoDB stuff. For the apps that didn’t rely on MongoDB it was maybe 5-10 minutes. Much better than the 2+ hours of downtime needed doing a normal migration due to backup and restore times with object storage. Going to try and write up some more notes on my experience for others who it may help in similar situations.