Unusual Mail Issues

michaelpope

Hey all,

I was hoping some of you might have a bit of guidance on some email issues I've been having. If not, that's cool too though :).

I'm running about 300G of email on a Cloudron System. Currently, right now, I'm running into a case where sporadically, some IMAP and POP folder reads will time out. These aren't huge folders either - some only have 2-3 files. SMTP is not affected.

There's some extraneous situations which might be related to this. Two slightly problematic things happened over the past few weeks. (1) I ran an external backup disk out of storage - it isn't tied to the disk the mail is on in any way. (2) I had to Hard-stop the server on the VPS provider side of things because of a configuration issue.

I'm not entirely sure which of the two instances caused the issue, but it has been occurring ever since. I'm wondering if some files got corrupted or something.

One other thing I've noticed while doing this is that the mail docker instance is taking up a lot more memory. I've raised it from 500K to 1G to 2G, and sometimes it still even fills up that. I'm not sure if that's related or not - but the issue does occur whether it is full or not - I've had cases where the issue occurs when only using about 1G memory out of 2G allocated to it. It does tend to happen a bit more when it's full though.

When I've run a Cloudron Email Server in the past, it's not really had this issue of memory filling up, but they've also been for far less email (10G or so). Then again, those have all been sending email directly, not through SES.

Restarting the mail process fixes the issue until it happens again. However, it's very difficult to predict when the issue will happen again.

I've gone into the mail image inside and edited the dovecot config and turned on mail_deubg and auth_debug - I'll post some information from the logs when I get some information from this. EDIT: Not seen anything significant in logs. Will probably wait for another occurrence and see if I can spot any patterns.

I'm trying not to rely on the Cloudron Support team because I have not bought Premium yet (will probably do in the future, but I'm not quite ready to do so at the moment).

Have any of you seen issues like this before? If so, how have you proceeded towards fixing it?

girish

@michaelpope have you checked the mail logs already? It's a bit hidden, but Mail -> Logs (top right corner) -> Logs again in top right corner. This will open the raw logs of the mail server.

Now, do the problematic IMAP and POP reads. Do you see any error/crashes?

michaelpope

@girish Thanks.

Yeah, I have checked them, but unfortunately nothing really stands out.

Occasionally you get an error which seems to be related to [dkim_sign], that goes kinda like this:

2022-10-04T20:05:45.000Z [INFO] [-] [core] [outbound] Sending email as a transaction
2022-10-04T20:05:45.000Z [NOTICE] [-] [dkim_sign] skipped: no private key for domain_name_of_someone_else_here.com
2022-10-04T20:05:45.000Z [ERROR] [-] [core] missing key data for domain_name_of_someone_else_here.com
022-10-04T20:05:45.000Z [INFO] [-] [core] [outbound] Processing delivery for domain: insert_my_domain_name_here.com

It looks like the first and last message relate to an outgoing mail, but I'm not sure about the middle two. However, either way, it doesn't look like this is preventing anything as it's pretty commonly occurring (over 900 times).

I don't see any crashes though. I think that might be partly because SMTP continues to function well even when this occurs. So it's probably not an Auth issue I'm guessing.

I don't actually see any logs for accessing IMAP or POP3 at all in the log section you specified. Which is weird. Just outgoing and incoming.

I went and exported dovecot.log from the docker image (which seemed to have IMAP and POP3 logs I think) and was able to capture twice where dovecot restarted. One of them is preceeded by a few connections closing because they went idle, but... I think that's pretty normal for a mail server. I'll try and capture a few more logs tomorrow when the issue happens again and see if I can establish a pattern. It might also be a concurrent connection issue because one of the restarts had over 50 messages saying things like

Oct 04 21:32:50 imap(mail_box@insert_my_domain_here.com)<356><aOCBCjvqbfitxVGS>: Info: Server shutting down. in=3703 out=353254 deleted=0 expunged=0 trashed=0 hdr_count=4 hdr_bytes=2358 body_count=3 body_bytes=325302

I'll try and raise the limit in configuration and see if that does anything... although I doubt it will because we were not getting any concurrent connection error messages in our email clients.

doveadm logs error isn't listing anything... so nothing there.

The issue gets fixed (temporarily) when I get into the docker container using docker exec -ti mail /bin/bash and then run supervisorctl restart dovecot. However, a few hours later (usually when the server is rather busy... well not too busy, as it isn't stressing the server by any means, but busier than at dead times), the issue re-occurs. I am hesitant to write a script continuously restarting dovecot because... welll.. what if that's making the issue worse?

I'm thinking a little it might be tied to specific inboxes, because there are certain people it doesn't effect at all. However, I know of three that it absolutely does. So I might try migrating them back and forth and seeing if that resolves the issue.

Anyways, thanks for responding. Hope you have an awesome evening.

michaelpope

Update: No errors, even in the full dovecot logs with mail_debug turned on. However, I am noticing that this issue is occurring often when people log into a certain email account. As a result... I'm going to try and imapsync the data over to another account and imapsync it back... and see if this resolves the issue. May also try different email clients... perhaps this is only caused by certain email clients connecting.

girish

@michaelpope What clients are they using?

michaelpope

@girish Thunderbird, Outlook 2007, Outlook 365. Only receiving the issues on Outlook 2007 and Thunderbird, and only on some clients, so it seems to be tied to certain accounts.

I'm trying a few things to see if there are corrupt mails in inboxes though... this might be tied to that instead. I'll let you know.

michaelpope

Update.

Had some more issues early today even after the export, deletion, re-creation, and import of the mailboxes.

However... due to some circumstances, I had to power off the VPS. When I rebooted it, I reduced the mail process down to 1GB of memory (raising it to 3GB had caused the crash)... and have not had any issues since then. Crossing fingers it stays that way.

top is showing about 50% memory use, so only a little swap, which is fine. Hopefully I can get to no swap use in the future though.

However, the real test will be seeing what happens on Monday and Tuesday. If it's good on those days, I will be very happy :).

michaelpope

Some updates:

The issues is still happening despite the server reboot.

I was looking into disk issues, but I'm not so sure of that anymore.

Some things to note:

Sogo Webmail is still working when the issue occurs. This leads me to conclude a few things:

It means the issue is probably in 1 of 4 places:

It could be an issue in how port 9393 is forwarded over to 993 (or at least I'm assuming there is port forwarding from 9393 to 993 - I can check if this is the source the next time the issue occurs using Sogo to connect to both ports)
It could be an issue in how the mail is tied to the domain somehow. Sogo attempts to connect to the domain mail and it works fine. IMAP connects to my.insertdomainnamehere.com. (Once again, I can check this with Sogo the next time there is an issue).
There could be an issue in how dovecot is handling long lasting TCP connections (think the ones that go to IDLE). (don't know how I'd test this though tbh)
I suppose there could still be an issue in the underlying storage (an RDMA) that is causing issues - though I don't see why it wouldn't list any errors in the logs at that point.

It does mean that it's probably not an auth issue as Sogo is logging in fine

In the main time... probably gonna cronjob supervisorctl restart dovecot every few minutes. Won't be perfect but it restarts within a second, and is a lot quicker than restarting the entire mail process, so it probably won't be noticeable for end users.

@girish I was wondering... perhaps the 'mail' docker image is having an issue (I doubt this, but it's I guess a possibility). Is there a way to re-download the docker mail container for Cloudron without causing any issues? (should I just docker image pull it, or would that break things?)

michaelpope

An update on those possible issues.

Port 993 over SSL worked in SOGO, even when a desktop client was having issues.

This means that it is likely an issue with (3). I'm going to see if I can change dovecot.conf so dovecot kills long lasting connections quicker and see if that makes the problem go away or exacerbate it next.

michaelpope

Another update. Updated the 2007 Outlook Client to Office 365. Haven't had any issues with them so far, but need to wait to be sure. There's a small chance that this whole thing was actually two separate email client issues, one with Thunderbird, and one with Outlook.

Certain versions of Outlook actually have an issue where they don't respond to Dovecot's request to see if they are 'still there' and if an IDLE IMAP connection is closed, they will go offline automatically (https://wiki.dovecot.org/Timeouts and also see point 2 and 3 of the last post in https://social.technet.microsoft.com/Forums/exchange/en-US/b418d42a-ce66-4cfa-adb9-e6e24ec769f4/outlook-2007-error-your-imap-server-closed-the-connection). I'm thinking this might have been the issue. Not seeing this issue on Office 365 clients other users are running.

As for the Thunderbird issue... found one account of someone running into something similar here: https://www.mail-archive.com/dovecot%40dovecot.org/msg82805.html . Still gotta work on that one, and do some measurements to see if we are hitting the process limit when we have the issue. This would be a weird case though as we are only seeing this issue with some Thunderbird clients, despite all running the same version and with similar software on the computers... which again points to a mailbox level issue. Gotta do some more sleuthing.

michaelpope

Just wanted to give everyone an update.

Last week was pretty crazy for me, so I wasn't able to do much testing.

I will probably resume doing some more tests this week. In any case, restarting dovecot every hour does seem to be a suitable mitigation for anybody who has this issue (at least pending until a solution is found).

scooke

@michaelpope Keep up the good work. Updates like what you are providing are invaluable for those who come later with the same, or similar, issues.

michaelpope

Another update, a bit further down the line.

Here's what I discovered so far:

Some older Outlook clients just don't disconnect. As of such, they can sometimes run into issues where the server times out. Best way to fix this is to occasionally docker exec mail supervisorctl restart dovecot.
Modern Thunderbird clients might occasionally run into a situation where they can't connect to the server very consistently (except maybe on the Inbox). Best solution is to close Thunderbird and then run docker exec mail supervisorctl restart dovecot (as well). The client needs to be closed when you do this.

girish

@michaelpope this seems like maybe dovecot is unable to accept new connections and restarting it fixes it.

The next time you hit this: can you check the contents of /run/dovecot/dovecot.log in the mail container ? That should have some error as to why dovecot is not accepting new connections.

michaelpope

@girish Sure thing!

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Unusual Mail Issues