Mail Server redundancy - Establish backup/failover mail server on another (Cloudron) VPS instance
-
I am reaching out to the community with this question (if it is covered somewhere already, my apologies).
I am interested in creating a "Cloudron Mail Server redundancy". Let's say my primary Cloudron instance goes down for an extended period of time (including the mail server), what's the best way of having a secondary server "jumping" in, so that e-mails don't get lost.
I am interested in full redundancy, which means the secondary mail server should hold all e-mails from the primary server.Does anyone have knowledge about such a setup?
-
I am interested in the same thing actually.
One thing I've considered is using a backup MX service provider for any extended outages of my own Cloudron server, but this requires a little bit of manual intervention too so it isn't completely seamless when there's an outage. But this can be done as a partial solution, it'll at least make sure no messages are missed, but it may still mean your own clients can't send during that window.
The above is not a complete answer to your request though because that is not considered "full redundancy". For full redundancy, I suppose you could achieve this (never done this though yet) with a secondary Cloudron server running concurrently as your primary one, and then doing maybe an rsync or something for all the mails after the secondary Cloudron server was used (which requires restoring from a backup first), so they stay in sync. The one thing I can't quite picture here though is how this works seamlessly again because they server automatically uses "my.<domain>.<tld>", so you'd almost have to have the other Cloudron server running on a different domain to have them concurrently operational, and in such a case you'd still likely need to manually switch DNS records and such for clients to still send and receive mail.
Oh and of course a third option is simply to do it manually (which is my method at the moment until I find a more automatic approach) of restoring the Cloudron server to a new VPS if the one goes completely kabosh, and then everything should be resumed as normal again with maybe only an hour downtime or so (which is not ideal obviously but is seen as just a worst-case scenario).
The above is just thinking out loud though. This similar project has been on the back of my mind too so if I come across any good plans, I'll be sure to share them.
-
This answer sidesteps your question but for email, email servers going up and down is normal operation. Mail servers will retry for upto 3 days usually if your server is down.
I think achieving true HA with Cloudron is not easy. It requires the mail server to be built from the ground up to be HA. For example, the mail storage has to use some external storage/object store for a start. And then the mail server itself has to be designed for HA (maybe it has caches etc that will interfere if you have multiple instances of it). Haraka is highly concurrent but I don't know if it works well in a distributed setup.
Finally, some of our customers use external products like duocircle as the primary MX and then that service will proxy to Cloudron. See https://cloudron.io/documentation/email/#alternate-mx (we did this in 5.2 iirc)
-
@girish @d19dotca Thank you for your feedback, good exchange.
I think I need to crawl, before I walk meaning I can work with manual input to get this done.
@d19dotca I think the idea with rsync (or maybe even better lsyncd,
https://github.com/axkibe/lsyncd) is a good one, and I believe I could handle this part. That would pretty much replicate a live backup to another cloudron instance on another domain.How would one handle the user database? All (e-mail) users would have to exist on the other instance, too, right?
And then - once the primary cloudron instance would go down, one would have to do the DNS re-mapping to the new instance, right?
Conceptually, is this what one would have to do? @d19dotca just following your footsteps and thoughts...
-
@Mallewax If you're okay with the manual method, then to answer your question around how to handle the users, I think that's where you'd need to do a restore of Cloudron to keep all of them consistent, and then everything in the file system will be identical for the user mailboxes and you can do the rsync on that at that point.
Another suggestion I just thought of - if you are using a service like DigitalOcean or anything that offers you a "floating IP" address (sometimes referred to as different names like "dynamic IP"), you can have all your DNS records set to that, and then instead of having to manually update DNS entries and wait for those to propagate when something fails, you can just point the IP address to the other server, reboot, and you should be good to go at that point. Again, not tested but I suspect something like this will work and cut down on downtime a bit further.
-
As I'm thinking about similar issues - how do you handle this in practice?
I run Cloudron at Home on a NUC like device and currently think about redundancy in case of hardware failure or internet provider outages. Even for Cloudron on a VPS this could be interesting (keeping in mind OVH outage last year that @d19dotca reported about).
So I’m thinking about
- cloning and keeping up-to-date my main hard drive on a separate drive (so that I can switch to a separate NUC-like device);
- keeping a „dormant“ clone installation of my Cloudron on a VPS that I can activate when needed.
My main concern is mail as all other services are not time critical in a sense of you’d lose data (i.e. incoming mail) irrevocably if the system is down.
Is something like this possible? What’s the best approach?
-
@necrevistonnezr Great questions! I'm still pondering this myself, but got delayed with some other projects.
I don't have an HA setup yet, but I'd assume the easiest way to achieve it may be to duplicate the server (i.e. restore Cloudron backup on a new server), then use rsync to keep them in sync on the boxdata directory for example, then use DNS for failover where the MX server address on the other server can be used with a higher weight to be last in the selection and thus only used if the first MX server isn't responding.
Just my initial two cents anyways, but haven't thought it out thoroughly yet. Will probably give it more thought once the Cloudron multi-server management feature is in-place as it may make things a little bit easier.