Update to 7.6.1 failing
-
@eganonoa Quite strange. Searching randomly for this ... https://github.com/wallabag/docker/issues/290 , https://forums.docker.com/t/could-not-pull-image-caused-by-failed-to-register-layer-error-creating-overlay-mount-to-var-lib-docker-overlay2-too-many-levels-of-symbolic-links/123219/ etc have similar error with no real resolution.
If you can drop us a mail at support@cloudron.io , I can try to debug . And maybe I can re-create full docker storage and see if that helps.
-
@girish I found the same when trying to work this out. Seems like a totally random error, with the only solution seemingly to run
docker system prune
which I am hesitant to do as I don't know what will happen.
I've raised a ticket via the cloudron dashboard so that you can see my subscription info.
Many thanks.
-
@girish I think I understand what is happening here, so thought it best to write on this open thread as it may be "helpful" to others (helpful in quotations because there doesn't seem to be a nice solution!).
The problem appears to be that at some point, for whatever reason, the initial attempt to pull the cloudron/mail:3.11.2 image failed unexpectedly mid-pull (a system reboot perhaps). This seems to create a set of "dangling layers" related to that image that are hidden somewhere and cannot be found or removed with "docker system prune".
When I try to pull cloudron/mail:3.11.2 from the CLI, I see the following:
root@cloudron:/var/lib/docker# docker pull registry.docker.com/cloudron/mail:3.11.2 3.11.2: Pulling from cloudron/mail 445a6a12be2b: Already exists 4cfe0cdc770e: Already exists e6a0eb1fa9b7: Already exists e995e5b957f9: Already exists e6d226089461: Already exists b3243df2776e: Already exists debd247c1af3: Already exists ea1f575bfbef: Already exists 566e1eaf48e1: Already exists 68da526a8544: Already exists 2f3677647d18: Already exists 90984d402264: Already exists 802deede2955: Already exists 1861003a8fe7: Already exists 524cf22ec2b3: Already exists 7604fee16283: Already exists 930850c4bc07: Already exists 844343c15467: Already exists 367e21d14918: Already exists 6880f0889c4f: Already exists 5a544c4b0196: Already exists 7fb39aa7d081: Already exists c10af7a3bade: Already exists 30c062ec98da: Already exists 956058cafb63: Already exists e5c1b069b2dc: Already exists 27d50608d341: Already exists d6d99d73528f: Already exists 35ad04d78685: Already exists 0fa0098bd9c2: Already exists 1289a176c743: Extracting [==================================================>] 690B/690B bc9e5abd8c84: Download complete 8c8b3e2950c7: Download complete 025029896e5c: Download complete fc053eff195d: Download complete failed to register layer: error creating overlay mount to /var/lib/docker/overlay2/af6ccf9d7fbcd56d80cf2b84a6cedadc60e0f1615d06cf0947b140f244fb200d/merged: too many levels of symbolic links
As you can see, most layers are found as already existing, but when it hits layer 1289a176c743 it extracts it and then fails with the symbolic link error. No matter what I can do I cannot find and delete that.
Others appear to have had this problem with docker preventing future pulls where there are problems with the initial pull. See e.g. here, here link text and the current open issue and here.
This docker problem is compounded with Cloudron. For other, more bespoke, systems you could just move to a different version of whatever it is you are pulling. But with Cloudron the whole package is delivered at once and as a complete package, so if one pull fails the entire update fails, leaving you stuck.
The only solution I can find is listed here, specifically
systemctl stop docker; rm -Rf /var/lib/docker; systemctl start docker
I don't like the sound of doing that! What would the consequences of that be for all my in-app configurations, etc. Would this not be a fresh install?
The good news is that I can pull all other cloudron/mail images, including 3.11.3 and 3.11.4. So I imagine I can move around this issue if there were an update available that bundled any other version of cloudron/mail than 3.11.2. Is that possible or will my cloudron never allow me to access that update until version 7.6.1 is applied?
-
Maybe this one image layer which is already cached on your system is corrupted.
You can try to purge all fetched images which are not yet associated with a running container withdocker image prune -a
this should make it redownload those image layers on update again. -
@nebulon Unfortunately, that doesn't work. Tried again just now, did it with and without docker restart and system restart, none worked.
From the various things I've found online it seems that when a pull is forcibly interrupted in some way that prevents docker from effectively clearing up the broken pull, docker is unable to find the broken fragments of the prior pull when you are trying to manually prune, but apparently can find it when trying another pull. From what I've seen this can happen for a number of reasons mid-pull, from a power outage, to a temporarily poor network connection, to a force shutdown of the host system.
And I've not seen anyone report being able to fix it other than to rm -Rf /var/lib/docker. The only reference to an actual fix I've been able to find comes from this issue in the Moby Project, which references a patch they've applied.
My assumption is that this hasn't been addressed, even though it looks on its face to be a major (and frankly ridiculous) problem and was raised as far back as 2017, because ultimately the issue "fixes" itself when a new version of whatever is being pulled from docker is released. This then allows whoever is experiencing the problem to simply move on keeping those broken fragments somewhere on their system doing nothing. So while the issue seems big, it isn't in practice.
What I'm worried about is that this will mean I'm stuck manually updating apps and not the cloudron instance until another version of cloudron is available because the whole update fails in perpetuity when a pull of a component app fails unexpectedly. And I can't remember whether cloudron will even show me that new version to update to until I've been able to update to 7.6.1 first.
-
One option maybe to try is to stop docker and then move the image folder on the disk. Hopefully docker will pull all images fresh in this case when docker is restarted and it tries to bring the container back up. This may result in a broken system though if it doesn't so make sure to have backups to restore it on a fresh system then.
-
@nebulon That sounds horribly drastic! Whenever there is a later cloudron release (e.g. 7.6.2) or release candidate will I be able to update to that and skip 7.6.1 entirely? I'd probably rather wait than effectively going the "rm -Rf /var/lib/docker" route.
-
@eganonoa said in Update to 7.6.1 failing:
Whenever there is a later cloudron release (e.g. 7.6.2) or release candidate will I be able to update to that and skip 7.6.1 entirely?
There is already 7.6.2 and 7.6.3 but you can't skip versions.
-
@jdaviescoates said in Update to 7.6.1 failing:
@eganonoa said in Update to 7.6.1 failing:
Whenever there is a later cloudron release (e.g. 7.6.2) or release candidate will I be able to update to that and skip 7.6.1 entirely?
There is already 7.6.2 and 7.6.3 but you can't skip versions.
Ouch! So I'm stuck and my system is effectively broken.
-
@eganonoa good debugging. So, it looks like docker storage is corrupt. Best is to re-create docker. To alleviate your fears a bit, it is totally safe to nuke /var/lib/docker . Cloudron is designed for immutable infrastructure and containers can be re-created without any loss of data.
Before you do anything: take a full backup . If possible, do a VM backup also.
To re-create docker, here's what you have to do:
systemctl stop box
- stops the box code.systemctl stop docker
- annoyingly docker might continue to run despite this because of socket activation and prevent the next step. see below.rm -rf /var/lib/docker
- nukes the docker storage. this can fail, see below.mkdir /var/lib/docker
- recreate the docker storage.systemctl start docker
- start docker . (if you disabled docker, see below, thensystemctl enable docker
as well)docker network create --subnet=172.18.0.0/16 --ip-range=172.18.0.0/20 --gateway 172.18.0.1 cloudron --ipv6 --subnet=fd00:c107:d509::/64
- this creates the docker network used by cloudron- Edit the file
/home/yellowtent/platformdata/INFRA_VERSION
. In the top, there is a line like"version": "49.5.0"
. This version will be different for you. Just add one to the last part of this value. For example, 49.5.1. IMPORTANT: Don't change the first part i.e don't make it 50. Only increment the last part . This "hack" lets cloudron know that the version of infrastructure has changed and it will thus re-pull and re-create all containers systemctl restart box
- When box code start up, it will see infra changed and re-create everything. if you disabled box, see below, thensystemctl enable box
. You can see logs withtail -f /home/yellowtent/platformdata/logs/box.log
.
The biggest annoyance is that sometimes
rm -rf /var/lib/docker
won't work. It will complain about 'Device or resource busy' etc. If this happens, then disable box code and docker usingsystemctl disable box
andsystemctl disable docker
. Then, you have to reboot the server (no other way). Once the server is online, nowrm -rf /var/docker
and it should work.As always, if this is too complicated, we can do this for you. Reach out on support@cloudron.io .
-
@girish I'm pleased to say that it worked. I had to disable docker and box and reboot, even though I didn't get the exact error you mentioned but something else related to creating the network (duplicate network). I also had to reboot after enabling and restarting box because otherwise the upgrade got stuck, I believe for a similar network creation problem. But once I did that everything worked well and I'm now happily sitting with version 7.6.3 after two subsequent updates. Many thanks for the help with this! Hopefully this thread will be useful for anyone else who might have this issue in future.
-
-