Backup failing with "copy code: 1, signal: null": "cannot create hard link" "operation not permitted"
-
@d19dotca I guess this is rsync backup? Merely deleting the backups won't suffice, Can you try to quickly change the backup method to disabled/no-op and set it up again? This will clear the cache of what Cloudron thinks has been backed up so far (i.e the incremental backup state). I think after that you will see it takes a full backup.
As for the original issue, this
cp -al
is the exact issue we had before and it seems this keeps popping up because of CIFS. I think @nebulon knows better about this, I will let him comment. -
@girish Okay I can report that did the trick in terms of at least getting a proper full backup now, so I have at least one backup. It was 17 GB in size. Thanks for that suggestion.
My ext4 hard disk is not mounted with CIFS or NFS, so I'm not certain that's a factor in this case. Here is what the line is in /etc/fstab:
/dev/vdc /cloudron-backups ext defaults 0 2
Since this worked fine after switching from no-op to Filesystem again (although I had also rm -rf * the /cloudron-backups/ directory before that too), I think this lends more evidence to this being an odd Cloudron issue (or at the very least a rare inconsistency/compatibility issue) rather than an infrastructure issue.
Thoughts?
-
@girish Yes, it seems all hardlinks were created properly this time. I'll let you know if it fails again, for sure. Definitely seems like an odd issue and at first I assumed it was my infrastructure because of the other thread back in April, but this seems at this point to be more of a Cloudron issue. I can't quite get past the fact that this always is an issue on emails too, it's never on a file in an app or anything like that, always email. Maybe a coincidence since mail is also taking up most of the disk space, but still... just very strange to me. Hopefully this can be fixed in a future release once we know the root cause. Thanks again for the help to get me out of that jam.
-
It's happening again, I'm afraid. Just got notified a short bit ago of this.
Aug 06 13:46:42 box:shell copy (stdout): /bin/cp: cannot create hard link '/cloudron-backups/2020-08-06-204501-046/box_2020-08-06-204642-043_v5.4.1/mail/vmail/user@example.com/mail/.A - PC + FD Referrals INBOX/cur/1596744663.M967808P27952.6cc2d7e3254e,S=2825,W=2866:2,S' to '/cloudron-backups/snapshot/box/mail/vmail/user@example.com/mail/.A - PC + FD Referrals INBOX/cur/1596744663.M967808P27952.6cc2d7e3254e,S=2825,W=2866:2,S': Operation not permitted Aug 06 13:46:50 box:shell copy code: 1, signal: null
If you think it's helpful to remote in or anything, just let me know and I can enable that option for you.
-
@girish Is there anything I need to be doing to help you guys out before I need to erase the backups again and start fresh? It's failed a second time now so I suspect this will always be like this until I force clear it again but I don't want to take away anything that'd help you narrow down what the issue is either.
-
Update: The backup was successful just 55 minutes ago, without me needing to do anything this time. Still very interested in getting this resolved though as I'm sure it'll come back at some point again since this is several times this has happened this year now.
-
@girish I just wanted to say that I'm still having this issue. It seems to happen very randomly, sometimes it doesn't happen for days or almost a week or more, but then it might happen for two backup attempts in a row. It's very intermittent, but still an issue. Hoping we can get to the bottom of this at some point (considering backups are a pretty critical piece of the solution), but I totally understand this is a rare issue so will likely be hard to nail down. If there's anything I can do for you to help narrow it down, I'd be happy to.
-
@girish This continues to be an issue. Anything I can do to help narrow this one down? It's quite annoying when every couple of days or so or even multiple times on the same day it fails to completely backup. Thankfully it seems all the apps get backed up, it's always doing this when backing up emails, getting stuck on different emails.
-
@girish Never heard back on this, but I know you guys are super busy too. I'd love to work together to fix this issue though as it's been much more prevalent to me the past few months and I suspect it coincides with email usage increase from my COVID-19 clinics since they're using email insanely often. My email history on the Email tab used to show upwards of sometimes an hour or two in time as there wasn't a ton of activity on it, but now it's maybe 3-5 minutes of activity at any given time, because those clinic emails are constantly being used. I suggest this possible relationship because every time the error occurs it's almost always (if not always) thrown on an email message file. Not always the clinic's email account files, but it just seems like it's become a much more frequent issue since the email use itself has skyrocketed on my server.
-
What I found so far is that the link operation is denied by the kernel.
dmesg
has the following lines:[563683.439933] audit: type=1702 audit(1601496162.706:33): op=linkat ppid=5039 pid=10646 auid=4294967295 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001 egid=1001 sgid=1001 fsgid=1001 tty=(none) ses=4294967295 comm="cp" exe="/bin/cp" res=0 [563683.439938] audit: type=1302 audit(1601496162.706:34): item=0 name="/path/dovecot-uidlist.lock" inode=2246683 dev=fc:20 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
The type= means this from audit.h
1702 /* Suspicious use of file links */ 1302 /* Filename path information */
And according to https://access.redhat.com/articles/4409591 it's triggering for some
ANOM_LINK
event type. So far, there is little to no information on what all this means. -
@girish this sounds to me like a resource or policy type exhaustion issue. Like when ulimit it too low or we run out of inodes.
Is anything else running in the kernel, like SE Linux?
Are we hitting limits on hardlinks with large enough backups? I believe the limit on ext4 is 65k
It would be interesting to switch filesystems and see if it happens on xfs for example.
Do you have an Object Store target option via S3?
-
I think the 65k is the number of hardlinks on a file and not the hardlinks on a file system.
The rabbit hole goes as deep as we want to
I think I found the problem though of course I have to try it out.
audit_log_path_denied
here - https://elixir.bootlin.com/linux/latest/source/fs/namei.c#L955 is where the audit log is raised. I am no kernel expert but a casual reading of the comment "Allowed if owner and follower match" suggests that the owner of file and the linker is not matching. The symlinking process runs as useryellowtent
.root@my:/cloudron-backups/snapshot/box/mail# find . -user root -type f ./blah/blah/dovecot-uidlist.lock ./blah/blah/1600972556.M892349P24001.69d0c668883d,S=6124,W=6234:2,S
Bingo! For some reason, these 2 specific files are not owner
yellowtent
and are root. Looks like some bug/race in the code that creates snapshot. Curiously, both the files above are of 0 size, so maybe that's causing some strange event ordering.