Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out"
-
Just adding to the chorus of people noticing this happen to them.
I just spotted this notification from 12 hours ago:
Relay error: Connect to smtp.live.com timed out. Check if port 25 (outbound) is blocked
I wonder if the timeout settings on smtp.live.com have recently changed or something to make it time out quicker.
-
Indeed,
smtp.live.com
is apparently gone or does not respond to port 25 anymore.Some background: Cloudron tries to connect to some well know servers on port 25 for diagnostic purposes. It uses this to check if outbound port 25 is allowed on the VPS. It's not really used for anything else. The list of servers comes from https://git.cloudron.io/cloudron/box/-/blob/master/src/mail.js#L172
The warning can be ignored, for the moment. I have removed it in the next release.
I think we will try to create a
smtpdiag.cloudron.io
or something to test port 25 reach ability. -
@girish Hi Girish! I think it's a good idea to add in a Cloudron-controlled SMTP server for testing purposes. I still would suggest we have a two-check failure workflow to avoid false-positives like this, as that would be best practice in similar scenarios outside of Cloudron (like liveness probes in Kubernetes which will generally work with multiple failure points to avoid false-positives). If it's too much work though I understand, I just still think it'd be really helpful for these types of scenarios and would so I'd love to see health checks done in such a way to avoid false-positives like this kind of issue.
-
@d19dotca Thing is since we don't control external services, it's hard to tell why something failed. Did they blacklist the server IP? Was it because outbound port 25 is blocked? Was it because the service died temporarily or even permanently (like the case for this post).
Atleast, when I wrote the code, I didn't expect these services to go away By now, all but 2 services remain. We started with around 5 services, 5 years ago. Anyway, I have now deployed
port25check.cloudron.io
and the code from next release will use that to check connectivity. Since, we don't blacklist there and will keep it running, we can be fairly certain that the VPS outbound port 25 is blocked. Let's see. -
@girish said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
Thing is since we don't control external services, it's hard to tell why something failed.
For sure, but that's also why double-checking in the event of a failure would be best in order to avoid false-positives instead of one failure generating a ton of alerts. Logic to show that one failure would then cause Cloudron to perhaps not use it for a few hours would allow for rate-limiting or blacklisting to be resolved in time on its own, and would avoid needing to wait for an entire new release to update the list of SMTP servers as they change, etc. If one fails but one succeeds, we automatically know port 25 outbound is not blocked.
I have now deployed port25check.cloudron.io and the code from next release will use that to check connectivity. Since, we don't blacklist there and will keep it running, we can be fairly certain that the VPS outbound port 25 is blocked.
That's awesome and will add to the troubleshooting ability!! Happy to see that too.
Personally I'd still love to see redundancy in place, as there will certainly be the rare outage on your end too as with other services, but this will at least add a bit more under your control to help lessen the likelihood of false positives which is still a step in the right direction. If I'm banging the drum well past my allotted time on this then that's understandable as it certainly isn't major, just something I'd love to see improved further still. I'll let it go now.
Thanks for everything you do!
-
@d19dotca no worries I think the issue is that let's say we add another external service for dependency and the connection does not work, what should we show in the UI? That outbound port 25 works or it does not? Is it useful to have messages like "We managed to connect to port25check.cloudron.io but not to smtp.live.com" (or any of those combinations). I suspect users will come back with same questions/confusion as they do now. Atleast, the code currently is written with the assumption that connectivity (or not) is a "reliable" indicator of outbound port 25. Maybe I misunderstood what you mean by redundancy.
-
@girish said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
what should we show in the UI? That outbound port 25 works or it does not? Is it useful to have messages like "We managed to connect to port25check.cloudron.io but not to smtp.live.com" (or any of those combinations).
Good question! I don't actually think anything should be shown in the UI if only one SMTP test fails out of two, as that scenario would imply a false-positive.
So what I envision is the following (hopefully this explains it better):
-- Cloudron runs periodic checks on one of several SMTP servers for testing purposes.
---- If the check succeeds, then wait for next check 30 minute interval.
---- If the check fails, then run one more test right away (or even 60 seconds later to avoid network blips on the VPS) to a second/different SMTP server to validate the finding.
------ If the second SMTP server succeeds, then ignore the initial failure and mark as successful. Possibly make a log entry, but nothing needed in the UI.
------ If the second SMTP server fails, log the errors with more details (mention both SMTP servers that were checked and failed). In the UI, show a message similar toRelay error: SMTP connection tests failed. Check if port 25 (outbound) is blocked. View the Cloudron logs for more details.
I don't really think the exact servers need to be listed in the UI if they're already in the logs. If both SMTP servers fail, it'll be with much higher confidence that port 25 outbound is blocked and that should be the admin's focus. If they can confirm that it's not blocked, then they can use the logs to get more details and run additional tests from their server.
That's how I picture it anyways. I see that as helping avoid false-positives while also providing enough details in the logs for when an issue is actually detected (and more confidently in that case too). The UI can be a simplified in a small way to refer the admin to their logs for further details while still suggesting that port 25 may be blocked.
Side note: I just checked and the "troubleshooting" hyperlink at the bottom of the alert message overall leads to an incorrect spot. May need to be updated to perhaps https://docs.cloudron.io/email/#outbound-smtp or something like that.
-
@girish are we supposed to be doing anything with our email dashboards ? I have a number of domains which are shown as red, but checking some of them in that domain panel status, all shows as green.
Not too worried, just not sure what we should be doing. -
@timconsidine yes, correct, nothing to worry here. It will be fixed in the upcoming update.
-
@girish
https://port25check.cloudron.io/ produces an error, as the cert is only for api.cloudron.io -
@roundhouse1924 that's expected. it's not a website and not meant to be connected via http/https. It's only on port 25. you can try
telnet port25check.cloudron.io 25
. -
-
Is anybody try one of these delisting process ??
This one seams to be specifically for live.com and ...
https://support.microsoft.com/en-us/supportrequestform/8ad563e3-288e-2a61-8122-3ba03d6b8d75
https://sendersupport.olc.protection.outlook.com/snds/index.aspx
I did the 2 first one, the first is pretty quick you receive an email and validate if the IP of your server is in their internal block list,
the second is a form is a little bit more elaborate they ask for error message and if you have a website related to that domain. -
Hi,
I have the same problem on my Cloudron right now:
Relay error: Connect to port25check.cloudron.io timed out. Check if port 25 (outbound) is blocked
Port 25 is not blocked.
-
Hi,
it does work on my server:
telnet port25check.cloudron.io 25 Trying 165.227.67.76... Connected to api.cloudron.io. Escape character is '^]'. works Connection closed by foreign host.
Is it important? I am using Postmark as a mail relay on all my outgoing mails, so I think it is not neccesary to have port 25 open in general, because it is never used?
-
If you use a mail relay for all your domains, then this should not be relevant. I do wonder why it tests for it then and also why the check fails, since the code also just checks like that. Can you open the mail status tabs on all domains to see if this was just a temporary issue?
-
@nebulon said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
If you use a mail relay for all your domains, then this should not be relevant.
Thanks for clarification!
@nebulon said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
an you open the mail status tabs on all domains to see if this was just a temporary issue?
I will check and let you know if I found something.
-
I had one domain using an external relay and having port 25 closed on the VPS. The above error was present, but disappeared when port 25 was opened.
So, the port 25 check seems to be unnecessary and confusing for domains that use external relays.
-
@RoundHouse1924 said in Email healthcheck notification: "Relay error: Connect to smtp.live.com timed out":
So, the port 25 check seems to be unnecessary and confusing for domains that use external relays.
The port 25 check is skipped for domains with a relay. If you find otherwise, please let us know, cause it's a bug. I just tested it with a relay and it is skipped.
-
@girish
The situation I described was with v7.3.6 when I had only one outgoing domain and it used an external relay.Now with v7.4.1, I have 3 outgoing domains. One via the same external relay; the other 2 using the internal SMTP. Port 25 is open on the VPS and all 3 status lights are green.
So, in order to test your answer, I blocked outgoing Port 25 on the VPS firewall.
As expected, the 2 direct domains go red.
However, the external relay domain's Cloudron status page shows:-
MX record = Current value: [not set]
DMARC record = Current value: [not set]
SMTP Status Outbound SMTP (Relay) = Connection timeoutLooks to me that, with Port 25 closed, the SMTP check is made, but times out.
The puzzler is to know what could be causing the MX and DMARC record checks to fail --- just because Port 25 is closed.
EDIT:
With Port 25 closed, Uptime Kuma and Tiny Tiny RSS cannot do their stuff, so I've now reopened it.