Update to 7.6.1 failing

eganonoa

@girish I think I understand what is happening here, so thought it best to write on this open thread as it may be "helpful" to others (helpful in quotations because there doesn't seem to be a nice solution!).

The problem appears to be that at some point, for whatever reason, the initial attempt to pull the cloudron/mail:3.11.2 image failed unexpectedly mid-pull (a system reboot perhaps). This seems to create a set of "dangling layers" related to that image that are hidden somewhere and cannot be found or removed with "docker system prune".

When I try to pull cloudron/mail:3.11.2 from the CLI, I see the following:

root@cloudron:/var/lib/docker# docker pull registry.docker.com/cloudron/mail:3.11.2
3.11.2: Pulling from cloudron/mail
445a6a12be2b: Already exists 
4cfe0cdc770e: Already exists 
e6a0eb1fa9b7: Already exists 
e995e5b957f9: Already exists 
e6d226089461: Already exists 
b3243df2776e: Already exists 
debd247c1af3: Already exists 
ea1f575bfbef: Already exists 
566e1eaf48e1: Already exists 
68da526a8544: Already exists 
2f3677647d18: Already exists 
90984d402264: Already exists 
802deede2955: Already exists 
1861003a8fe7: Already exists 
524cf22ec2b3: Already exists 
7604fee16283: Already exists 
930850c4bc07: Already exists 
844343c15467: Already exists 
367e21d14918: Already exists 
6880f0889c4f: Already exists 
5a544c4b0196: Already exists 
7fb39aa7d081: Already exists 
c10af7a3bade: Already exists 
30c062ec98da: Already exists 
956058cafb63: Already exists 
e5c1b069b2dc: Already exists 
27d50608d341: Already exists 
d6d99d73528f: Already exists 
35ad04d78685: Already exists 
0fa0098bd9c2: Already exists 
1289a176c743: Extracting [==================================================>]     690B/690B
bc9e5abd8c84: Download complete 
8c8b3e2950c7: Download complete 
025029896e5c: Download complete 
fc053eff195d: Download complete 
failed to register layer: error creating overlay mount to /var/lib/docker/overlay2/af6ccf9d7fbcd56d80cf2b84a6cedadc60e0f1615d06cf0947b140f244fb200d/merged: too many levels of symbolic links

As you can see, most layers are found as already existing, but when it hits layer 1289a176c743 it extracts it and then fails with the symbolic link error. No matter what I can do I cannot find and delete that.

Others appear to have had this problem with docker preventing future pulls where there are problems with the initial pull. See e.g. here, here link text and the current open issue and here.

This docker problem is compounded with Cloudron. For other, more bespoke, systems you could just move to a different version of whatever it is you are pulling. But with Cloudron the whole package is delivered at once and as a complete package, so if one pull fails the entire update fails, leaving you stuck.

The only solution I can find is listed here, specifically

systemctl stop docker; rm -Rf /var/lib/docker; systemctl start docker

I don't like the sound of doing that! What would the consequences of that be for all my in-app configurations, etc. Would this not be a fresh install?

The good news is that I can pull all other cloudron/mail images, including 3.11.3 and 3.11.4. So I imagine I can move around this issue if there were an update available that bundled any other version of cloudron/mail than 3.11.2. Is that possible or will my cloudron never allow me to access that update until version 7.6.1 is applied?

nebulon

Maybe this one image layer which is already cached on your system is corrupted.
You can try to purge all fetched images which are not yet associated with a running container with docker image prune -a this should make it redownload those image layers on update again.

eganonoa

@nebulon Unfortunately, that doesn't work. Tried again just now, did it with and without docker restart and system restart, none worked.

From the various things I've found online it seems that when a pull is forcibly interrupted in some way that prevents docker from effectively clearing up the broken pull, docker is unable to find the broken fragments of the prior pull when you are trying to manually prune, but apparently can find it when trying another pull. From what I've seen this can happen for a number of reasons mid-pull, from a power outage, to a temporarily poor network connection, to a force shutdown of the host system.

And I've not seen anyone report being able to fix it other than to rm -Rf /var/lib/docker. The only reference to an actual fix I've been able to find comes from this issue in the Moby Project, which references a patch they've applied.

My assumption is that this hasn't been addressed, even though it looks on its face to be a major (and frankly ridiculous) problem and was raised as far back as 2017, because ultimately the issue "fixes" itself when a new version of whatever is being pulled from docker is released. This then allows whoever is experiencing the problem to simply move on keeping those broken fragments somewhere on their system doing nothing. So while the issue seems big, it isn't in practice.

What I'm worried about is that this will mean I'm stuck manually updating apps and not the cloudron instance until another version of cloudron is available because the whole update fails in perpetuity when a pull of a component app fails unexpectedly. And I can't remember whether cloudron will even show me that new version to update to until I've been able to update to 7.6.1 first.

nebulon

One option maybe to try is to stop docker and then move the image folder on the disk. Hopefully docker will pull all images fresh in this case when docker is restarted and it tries to bring the container back up. This may result in a broken system though if it doesn't so make sure to have backups to restore it on a fresh system then.

eganonoa

@nebulon That sounds horribly drastic! Whenever there is a later cloudron release (e.g. 7.6.2) or release candidate will I be able to update to that and skip 7.6.1 entirely? I'd probably rather wait than effectively going the "rm -Rf /var/lib/docker" route.

jdaviescoates

@eganonoa said in Update to 7.6.1 failing:

Whenever there is a later cloudron release (e.g. 7.6.2) or release candidate will I be able to update to that and skip 7.6.1 entirely?

There is already 7.6.2 and 7.6.3 but you can't skip versions.

eganonoa

@jdaviescoates said in Update to 7.6.1 failing:

@eganonoa said in Update to 7.6.1 failing:

Whenever there is a later cloudron release (e.g. 7.6.2) or release candidate will I be able to update to that and skip 7.6.1 entirely?

There is already 7.6.2 and 7.6.3 but you can't skip versions.

Ouch! So I'm stuck and my system is effectively broken.

girish

@eganonoa good debugging. So, it looks like docker storage is corrupt. Best is to re-create docker. To alleviate your fears a bit, it is totally safe to nuke /var/lib/docker . Cloudron is designed for immutable infrastructure and containers can be re-created without any loss of data.

Before you do anything: take a full backup . If possible, do a VM backup also.

To re-create docker, here's what you have to do:

systemctl stop box - stops the box code.
systemctl stop docker - annoyingly docker might continue to run despite this because of socket activation and prevent the next step. see below.
rm -rf /var/lib/docker - nukes the docker storage. this can fail, see below.
mkdir /var/lib/docker - recreate the docker storage.
systemctl start docker - start docker . (if you disabled docker, see below, then systemctl enable docker as well)
docker network create --subnet=172.18.0.0/16 --ip-range=172.18.0.0/20 --gateway 172.18.0.1 cloudron --ipv6 --subnet=fd00:c107:d509::/64 - this creates the docker network used by cloudron
Edit the file /home/yellowtent/platformdata/INFRA_VERSION . In the top, there is a line like "version": "49.5.0". This version will be different for you. Just add one to the last part of this value. For example, 49.5.1. IMPORTANT: Don't change the first part i.e don't make it 50. Only increment the last part . This "hack" lets cloudron know that the version of infrastructure has changed and it will thus re-pull and re-create all containers
systemctl restart box - When box code start up, it will see infra changed and re-create everything. if you disabled box, see below, then systemctl enable box. You can see logs with tail -f /home/yellowtent/platformdata/logs/box.log .

The biggest annoyance is that sometimes rm -rf /var/lib/docker won't work. It will complain about 'Device or resource busy' etc. If this happens, then disable box code and docker using systemctl disable box and systemctl disable docker . Then, you have to reboot the server (no other way). Once the server is online, now rm -rf /var/docker and it should work.

As always, if this is too complicated, we can do this for you. Reach out on support@cloudron.io .

eganonoa

@girish Thanks. That is reassuring. I just found your response to my ticket in my spam folder for some reason. I'll see if I can work this out and will report back.

eganonoa

@girish I'm pleased to say that it worked. I had to disable docker and box and reboot, even though I didn't get the exact error you mentioned but something else related to creating the network (duplicate network). I also had to reboot after enabling and restarting box because otherwise the upgrade got stuck, I believe for a similar network creation problem. But once I did that everything worked well and I'm now happily sitting with version 7.6.3 after two subsequent updates. Many thanks for the help with this! Hopefully this thread will be useful for anyone else who might have this issue in future.

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Update to 7.6.1 failing