nfs volumes gone until container reboot
-
If my nfs server goes offline, the nfs volumes in cloudron are in a failure state(or, even if you refresh the volume in dashboard, files don't show up in container) until the containers using them are restarted. Remounting the volume in dashboard should fix any connection issues without manual restarting containers.
-
In this case, if you remount the nfs mount and then manually restart the app via the dashboard, does it work again or do we need to actually recreate the container for the volume to work again?
Good question overall though what should happen if a mountpoint used as a volume into an app goes bad. I guess the apps using the volume could be stopped to avoid any strange behavior.
-
I guess this is the linux behavior. Maybe volume remount can restart the apps that use it?
-
@girish said in nfs volumes gone until container reboot:
I guess this is the linux behavior. Maybe volume remount can restart the apps that use it?
No, there is a more graceful way to recover NFS mounts, just need to find what it is in this combination stack.
-
@robi said in nfs volumes gone until container reboot:
No, there is a more graceful way to recover NFS mounts
What is it?
-
@girish I don't know what the code does on container start and what it doesn't do when the connection drops (NFS server goes down, then comes back). [but I can guess.. see below]
Is it NFS over TCP only or UDP too? Sessions have a harder time recovering than session-less connections.
However, if you detect that disconnection and repeat the process for when it does come back, the volume + container + app won't need a restart (which is overkill).
This is highly dependent on mount options (private, slave, shared) which are explained here:
https://unix.stackexchange.com/questions/292999/mounting-a-nfs-directory-into-host-volume-that-is-shared-with-dockerIe. private mounts seem to need a restart of the container to regain connectivity, while the others do not.
-
-
I got to the bottom of this. The issue is that
mountpoint -q
doesn't exit if NFS hangs and because we call execSync in node, the whole dashboard hangs. I have fixed this in https://git.cloudron.io/cloudron/box/-/commit/3da3ccedcbd8962ab309c591f6ae3561e5df4f28The mount propagation flags like rslave don't apply to us. They only apply for recursive/shared mounts. That's my reading of it anyway. It doesn't seem to matter in my tests.
-
-