Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out"
-
Hello,
I noticed there was an incident with the smtp.live.com about 3 hours ago according to the Cloudron notifications page. It was only for five domains, but they all shared the same SMTP endpoint so I suspect there was a blip on Microsoft's side. Just an FYI.
Not concerned because I know it's a false alert, but it did get me thinking... would it not be better to perhaps try one more SMTP destination if the first one reports a failure by the healthcheck? That would likely avoid false-positives like this one.
-
@d19dotca Thing is since we don't control external services, it's hard to tell why something failed. Did they blacklist the server IP? Was it because outbound port 25 is blocked? Was it because the service died temporarily or even permanently (like the case for this post).
Atleast, when I wrote the code, I didn't expect these services to go away By now, all but 2 services remain. We started with around 5 services, 5 years ago. Anyway, I have now deployed
port25check.cloudron.io
and the code from next release will use that to check connectivity. Since, we don't blacklist there and will keep it running, we can be fairly certain that the VPS outbound port 25 is blocked. Let's see. -
I continue to get periodic failures to specifically the smtp.live.com server, by the way, causing random health check failures even though it's really fine overall.
Jan 30 00:01:58 box:mail Ignored error - relay : Connect to smtp.live.com timed out. Jan 30 00:01:58 box:mail Ignored error - relay : Connect to smtp.live.com timed out. Jan 30 00:01:58 box:mail Ignored error - relay : Connect to smtp.live.com timed out. Jan 30 00:01:59 box:mail Ignored error - relay : Connect to smtp.live.com timed out. Jan 30 00:01:59 box:mail Ignored error - relay : Connect to smtp.live.com timed out. Jan 30 00:01:59 box:mail Ignored error - relay : Connect to smtp.live.com timed out.
-
I have also been getting random healthcheck failures since yesterday. Always one or two mail domains randomly show red and if I refresh the page they show green but probably a different one or two will show red. Mail is working fine though and if I check the status tab of a domain showing red on the overview page, everything is green. Looks like a DNS error / timeout as @timconsidine mentioned.
-
@timconsidine & @ccfu - I don't think this is a DNS error at all (if it was I'd expect different log entries). This is just a simple timeout. It knows where smtp.live.com is and tries to connect but it times out (in other words smtp.live.com isn't responding within a specified time). I'm pretty sure the issue is on Microsoft's side in this case.
@archos - as long as it's intermittent for you, then yes it should be nothing to worry about. It's likely the same checks to smtp.live.com as I'm experiencing too.
@staff - I think it'd be great if we could have some redundancy built-in. This isn't the first time this has happened to my knowledge. Sometimes free SMTP services have issues. I think it'd be great to change the logic to be "Connect to SMTP A and see if it succeeds. If Connect to SMTP A fails, then attempt one more Connect but this time to SMTP B to verify the failure. If SMTP B is a success, mark as success. If both SMTP A and SMTP B are failures, mark as failure."
-
FYI - MXToolbox reports the same issue: https://mxtoolbox.com/SuperTool.aspx?action=smtp%3Asmtp.live.com&run=toolpage
Connecting to 204.79.197.212 1/30/2022 10:56:31 AM Connection attempt #1 - Unable to connect after 15 seconds. [15.02 sec] LookupServer 15082ms
-
Perhaps the short list here needs to be updated with a few more too: https://git.cloudron.io/cloudron/box/-/blob/master/src/mail.js#L171-176
In doing some testing, I'd suggest adding these three to the list for checks too...
smtp.mail.yahoo.com
(Report: https://mxtoolbox.com/SuperTool.aspx?action=smtp%3Asmtp.mail.yahoo.com&run=toolpage)smtp.aol.com
(Report: https://mxtoolbox.com/SuperTool.aspx?action=smtp%3Asmtp.aol.com&run=toolpage)mail.gmx.com
(Report: https://mxtoolbox.com/SuperTool.aspx?action=smtp%3Amail.gmx.com&run=toolpage) -
-
Just adding to the chorus of people noticing this happen to them.
I just spotted this notification from 12 hours ago:
Relay error: Connect to smtp.live.com timed out. Check if port 25 (outbound) is blocked
I wonder if the timeout settings on smtp.live.com have recently changed or something to make it time out quicker.
-
Indeed,
smtp.live.com
is apparently gone or does not respond to port 25 anymore.Some background: Cloudron tries to connect to some well know servers on port 25 for diagnostic purposes. It uses this to check if outbound port 25 is allowed on the VPS. It's not really used for anything else. The list of servers comes from https://git.cloudron.io/cloudron/box/-/blob/master/src/mail.js#L172
The warning can be ignored, for the moment. I have removed it in the next release.
I think we will try to create a
smtpdiag.cloudron.io
or something to test port 25 reach ability. -
@girish Hi Girish! I think it's a good idea to add in a Cloudron-controlled SMTP server for testing purposes. I still would suggest we have a two-check failure workflow to avoid false-positives like this, as that would be best practice in similar scenarios outside of Cloudron (like liveness probes in Kubernetes which will generally work with multiple failure points to avoid false-positives). If it's too much work though I understand, I just still think it'd be really helpful for these types of scenarios and would so I'd love to see health checks done in such a way to avoid false-positives like this kind of issue.
-
@d19dotca Thing is since we don't control external services, it's hard to tell why something failed. Did they blacklist the server IP? Was it because outbound port 25 is blocked? Was it because the service died temporarily or even permanently (like the case for this post).
Atleast, when I wrote the code, I didn't expect these services to go away By now, all but 2 services remain. We started with around 5 services, 5 years ago. Anyway, I have now deployed
port25check.cloudron.io
and the code from next release will use that to check connectivity. Since, we don't blacklist there and will keep it running, we can be fairly certain that the VPS outbound port 25 is blocked. Let's see. -
@girish said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
Thing is since we don't control external services, it's hard to tell why something failed.
For sure, but that's also why double-checking in the event of a failure would be best in order to avoid false-positives instead of one failure generating a ton of alerts. Logic to show that one failure would then cause Cloudron to perhaps not use it for a few hours would allow for rate-limiting or blacklisting to be resolved in time on its own, and would avoid needing to wait for an entire new release to update the list of SMTP servers as they change, etc. If one fails but one succeeds, we automatically know port 25 outbound is not blocked.
I have now deployed port25check.cloudron.io and the code from next release will use that to check connectivity. Since, we don't blacklist there and will keep it running, we can be fairly certain that the VPS outbound port 25 is blocked.
That's awesome and will add to the troubleshooting ability!! Happy to see that too.
Personally I'd still love to see redundancy in place, as there will certainly be the rare outage on your end too as with other services, but this will at least add a bit more under your control to help lessen the likelihood of false positives which is still a step in the right direction. If I'm banging the drum well past my allotted time on this then that's understandable as it certainly isn't major, just something I'd love to see improved further still. I'll let it go now.
Thanks for everything you do!
-
@d19dotca no worries I think the issue is that let's say we add another external service for dependency and the connection does not work, what should we show in the UI? That outbound port 25 works or it does not? Is it useful to have messages like "We managed to connect to port25check.cloudron.io but not to smtp.live.com" (or any of those combinations). I suspect users will come back with same questions/confusion as they do now. Atleast, the code currently is written with the assumption that connectivity (or not) is a "reliable" indicator of outbound port 25. Maybe I misunderstood what you mean by redundancy.
-
@girish said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
what should we show in the UI? That outbound port 25 works or it does not? Is it useful to have messages like "We managed to connect to port25check.cloudron.io but not to smtp.live.com" (or any of those combinations).
Good question! I don't actually think anything should be shown in the UI if only one SMTP test fails out of two, as that scenario would imply a false-positive.
So what I envision is the following (hopefully this explains it better):
-- Cloudron runs periodic checks on one of several SMTP servers for testing purposes.
---- If the check succeeds, then wait for next check 30 minute interval.
---- If the check fails, then run one more test right away (or even 60 seconds later to avoid network blips on the VPS) to a second/different SMTP server to validate the finding.
------ If the second SMTP server succeeds, then ignore the initial failure and mark as successful. Possibly make a log entry, but nothing needed in the UI.
------ If the second SMTP server fails, log the errors with more details (mention both SMTP servers that were checked and failed). In the UI, show a message similar toRelay error: SMTP connection tests failed. Check if port 25 (outbound) is blocked. View the Cloudron logs for more details.
I don't really think the exact servers need to be listed in the UI if they're already in the logs. If both SMTP servers fail, it'll be with much higher confidence that port 25 outbound is blocked and that should be the admin's focus. If they can confirm that it's not blocked, then they can use the logs to get more details and run additional tests from their server.
That's how I picture it anyways. I see that as helping avoid false-positives while also providing enough details in the logs for when an issue is actually detected (and more confidently in that case too). The UI can be a simplified in a small way to refer the admin to their logs for further details while still suggesting that port 25 may be blocked.
Side note: I just checked and the "troubleshooting" hyperlink at the bottom of the alert message overall leads to an incorrect spot. May need to be updated to perhaps https://docs.cloudron.io/email/#outbound-smtp or something like that.
-
@girish are we supposed to be doing anything with our email dashboards ? I have a number of domains which are shown as red, but checking some of them in that domain panel status, all shows as green.
Not too worried, just not sure what we should be doing. -
@timconsidine yes, correct, nothing to worry here. It will be fixed in the upcoming update.
-
@girish
https://port25check.cloudron.io/ produces an error, as the cert is only for api.cloudron.io