Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps - Status | Demo | Docs | Install
  1. Cloudron Forum
  2. Support
  3. Server crashes caused by stopped app's runner container stuck in restart loop

Server crashes caused by stopped app's runner container stuck in restart loop

Scheduled Pinned Locked Moved Solved Support
domainscron
17 Posts 5 Posters 682 Views 5 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    mendoksai
    wrote on last edited by joseph
    #1

    A domain expired for one of my apps. I stopped the app via the Cloudron dashboard. However, the runner container remained in "Created" state and kept trying to join the network namespace of the stopped app container, causing cascading failures:

    1. Runner repeatedly fails with: Cannot restart container <appid>-runner: cannot join network namespace of container: Container <id> is restarting, wait until the container is running
    2. This eventually causes Docker DNS resolution failures (internal Docker DNS timeouts)
    3. Host MySQL becomes unreachable (ECONNREFUSED 127.0.0.1:3306)
    4. SSH stops accepting connections
    5. Server becomes completely unresponsive, requiring hard reboot

    This has been happening daily for the past week.

    What I did

    • Stopped the app via Cloudron dashboard → runner remained in "Created" state
    • docker rm -f <appid>-runner removed the stuck runner
    • Main container shows "Exited (0)" and redis addon is still running — both untouched

    Questions

    1. Will Cloudron's scheduler recreate the runner container for a stopped app? If so, how do I prevent this?
    2. Is there a proper way to fully stop an app including its runner when the domain has expired?
    3. Should I also stop the redis addon container for this app?

    Relevant box.log pattern (repeating every 15-60 min):

    box:scheduler could not run task runner: (HTTP code 500) server error - Cannot restart container <appid>-runner: cannot join network namespace of container
    

    Also seeing on every boot:

    Error: listen EADDRNOTAVAIL: address not available 172.18.0.1:3003
    
    cloudron-support --troubleshoot
    Vendor: System manufacturer Product: System Product Name
    Linux: 6.8.0-106-generic
    Ubuntu: noble 24.04
    Execution environment: none
    none
    Processor: Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz
    BIOS Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz       To Be Filled By O.E.M. CPU @ 3.4GHz x 8
    RAM: 32796076KB
    
    Disk: /dev/sda3       909G
    [OK]    node version is correct
    [OK]    IPv6 is enabled and public IPv6 address is working
    [OK]    docker is running
    [OK]    docker version is correct
    [OK]    MySQL is running
    [OK]    netplan is good
    [OK]    DNS is resolving via systemd-resolved
    [OK]    unbound is running
    [OK]    nginx is running
    [OK]    dashboard cert is valid
    [OK]    dashboard is reachable via loopback
    [FAIL]  Database migrations are pending. Last migration in DB: /20260217120000-mailPasswords-create-table.js. Last migration file: /package.json.
            Please run 'cloudron-support --apply-db-migrations' to apply the migrations.
    [OK]    Service 'mysql' is running and healthy
    [OK]    Service 'postgresql' is running and healthy
    [OK]    Service 'mongodb' is running and healthy
    [OK]    Service 'mail' is running and healthy
    [OK]    Service 'graphite' is running and healthy
    [OK]    Service 'sftp' is running and healthy
    [OK]    box v9.1.3 is running
    [OK]    Dashboard is reachable via domain name
    [OK]    Domain  is valid and has not expired
    
    1 Reply Last reply
    2
    • M Offline
      M Offline
      mendoksai
      wrote on last edited by
      #2

      Update: Confirmed that Cloudron recreates the runner container on every boot, even though the app is stopped via the dashboard.

      After each reboot:

      • Main container: Exited (0) ✓
      • Runner container: Created ← this is the problem
      • Redis addon: Up ← also still running

      The runner in "Created" state triggers the scheduler loop → "cannot join network namespace" errors every 15-60 min → eventually cascading into Docker DNS failure → MySQL unreachable → full server lockup.

      I've been manually removing the runner with docker rm -f <appid>-runner after each reboot, but this is not sustainable.

      Is there a way to prevent the scheduler from recreating the runner for a stopped app? Or should I uninstall the app entirely to stop this cycle? The app's domain has expired but I'd like to keep the data for when I renew it.

      Thanks for any guidance.

      1 Reply Last reply
      1
      • girishG Offline
        girishG Offline
        girish
        Staff
        wrote on last edited by
        #3

        @mendoksai the container getting created is not a problem. The container is created but not run for stopped apps (i.e docker container create vs docker container run). The issue is also not related to domains (and it's expiry).

        I haven't been able to reproduce this issue though.

        I think the issue is actually that box code is unable to control docker. Or maybe docker is not running commands properly. For example, Container <id> is restarting, wait until the container is running . This already indicates the stopped app is in incorrect state. The Container has to be in stopped state. Are there any errors in journalctl -u docker -fa ? The rest of the errors like redis not stopping, cron container error is all the same issue of docker not running containers properly.

        Error: listen EADDRNOTAVAIL: address not available 172.18.0.1:3003 is similar. Docker is supposed to create the cloudron network in that IP. Can't see how it can be unavailable.

        Can you give more information on your environment? Are other apps running properly?

        1 Reply Last reply
        0
        • M Offline
          M Offline
          mendoksai
          wrote on last edited by
          #4

          Thanks @girish for looking into this.

          You're right — this isn't just about the stopped app. After collecting detailed logs, I found multiple containers in incorrect states on every boot:

          <appid-1>-runner         Created          (stopped app - Lychee)
          <appid-2>                Restarting (1)   (Mattermost)
          <appid-3>                Restarting (1)   (Kimai)
          

          The Mattermost container is the main culprit. On boot, it tries to connect to MySQL before it's ready, fails, and enters an infinite restart loop:

          error: Failed to ping DB  error="dial tcp 172.18.30.1:3306: connect: connection refused"
          Error: failed to initialize platform: cannot create store: error setting up connections
          

          This restart loop seems to degrade Docker networking over time. The Docker journal shows a clear cascade:

          1. Boot → Mattermost enters restart loop (MySQL not ready yet)
          2. Docker resolver starts failing — first external DNS timeouts, then internal (172.18.0.1:53)
          3. Error: listen EADDRNOTAVAIL: address not available 172.18.0.1:3003 on every boot
          4. Eventually host MySQL becomes unreachable → full server lockup

          For journalctl -u docker, there are no explicit error-level entries from Docker daemon itself — only info level "ignoring event" / "cleaning up dead shim" messages repeating every 5 minutes for the same container, plus error level DNS timeout entries from the resolver.

          I've stopped both Mattermost and the Lychee runner for now. Will monitor.

          Environment details:

          • Cloudron 9.1.3
          • Ubuntu 24.04.4 LTS, Kernel 6.8.0-106-generic
          • Dedicated Server: 8 CPUs, 32GB RAM
          • ~35 containers on the cloudron network
          • Docker: Cgroup Driver: cgroupfs, Cgroup Version: 2
          • Hardware check by Hetzner: all clean (CPU, disks, NIC)
          • Issue started ~3 weeks ago, persisted through kernel 5.15 → 6.8 upgrade

          Happy to provide SSH access or full logs if needed.

          1 Reply Last reply
          1
          • M Offline
            M Offline
            mendoksai
            wrote on last edited by
            #5

            Quick update — I just noticed cloudron-support --troubleshoot was reporting:

            [FAIL] Database migrations are pending. Last migration in DB: /20260217120000-mailPasswords-create-table.js
            

            This migration has been pending since Feb 17 — which is exactly when the instability started. I missed this earlier. Just applied it:

            cloudron-support --apply-db-migrations
            [OK] Database migrations applied successfully
            

            I've also stopped the Mattermost container that was in a restart loop (it was failing to connect to MySQL on boot and never recovering).

            Will monitor for the next few days and report back. Fingers crossed this was the missing piece.

            J 1 Reply Last reply
            2
            • M mendoksai

              Quick update — I just noticed cloudron-support --troubleshoot was reporting:

              [FAIL] Database migrations are pending. Last migration in DB: /20260217120000-mailPasswords-create-table.js
              

              This migration has been pending since Feb 17 — which is exactly when the instability started. I missed this earlier. Just applied it:

              cloudron-support --apply-db-migrations
              [OK] Database migrations applied successfully
              

              I've also stopped the Mattermost container that was in a restart loop (it was failing to connect to MySQL on boot and never recovering).

              Will monitor for the next few days and report back. Fingers crossed this was the missing piece.

              J Offline
              J Offline
              joseph
              Staff
              wrote last edited by
              #6

              @mendoksai said:

              Quick update — I just noticed cloudron-support --troubleshoot was reporting:

              [FAIL] Database migrations are pending. Last migration in DB: /20260217120000-mailPasswords-create-table.js

              This is a bug in the tool and not a real problem. It's fixed in 9.1.5.

              1 Reply Last reply
              0
              • M Offline
                M Offline
                mendoksai
                wrote last edited by
                #7

                Happened again. Every a few days. 😕

                1 Reply Last reply
                0
                • nebulonN Offline
                  nebulonN Offline
                  nebulon
                  Staff
                  wrote last edited by
                  #8

                  Do you by any chance have made some custom modifications to your ubuntu system like applying apt updates or so?

                  1 Reply Last reply
                  0
                  • M Offline
                    M Offline
                    mendoksai
                    wrote last edited by
                    #9

                    Yes, I followed your upgrade docs as you suggest to upgrade due to discontinuing of the support old Ubuntu version, since then this problem happens. And it just happened again, right now. Twice in today.

                    1 Reply Last reply
                    1
                    • M Offline
                      M Offline
                      mendoksai
                      wrote last edited by
                      #10

                      @nebulon Yes, here's the full timeline of changes:

                      1. Server was stable on Ubuntu 20.04 + kernel 5.4 for months
                      2. Upgraded to Ubuntu 22.04 + kernel 5.15 (following Cloudron upgrade docs) — instability started
                      3. Upgraded to Ubuntu 24.04 + kernel 6.8 (following Cloudron upgrade docs) — issue persists
                      4. Installed fail2ban and smartmontools via apt
                      5. No other custom modifications

                      All upgrades were done following the official Cloudron documentation. The crashes happen on both kernel 5.15 and 6.8, so it doesn't seem kernel-specific.

                      One thing that may be relevant: Docker is using cgroupfs driver with cgroup v2. The Cloudron systemd unit explicitly sets --exec-opt native.cgroupdriver=cgroupfs. Could there be a compatibility issue with Ubuntu 24.04's default cgroup v2?

                      The server just crashed again twice in one hour. Happy to provide SSH access if that would help debug this. This is urgent as my mail server runs on this machine.

                      1 Reply Last reply
                      1
                      • M Offline
                        M Offline
                        mendoksai
                        wrote last edited by
                        #11

                        Update: I renewed the expired domain and the app (Lychee) is now running properly. No containers in restart loop currently. The earlier crashes today were likely caused by the runner container still being in a stale state from before the domain renewal.

                        I have a cron job cleaning up zombie runners every 5 minutes, which seems to be working (log shows it removed 5 runners since setup).

                        Will monitor for the next few days and report back. If it stays stable, I'll mark this as resolved.

                        Thank you @girish @nebulon @joseph for your help!

                        1 Reply Last reply
                        2
                        • M Offline
                          M Offline
                          mendoksai
                          wrote last edited by
                          #12

                          @girish @nebulon Server crashed again last night. But this time the pattern is different — no containers in restart loop, no runner issues. The cron cleanup job is working. All containers were stable (Up 11 hours) before the crash.

                          The Docker journal shows the DNS resolver dying on its own:

                          23:38 - External DNS timeouts begin (185.12.64.2)
                          23:57 - Internal Docker DNS fails (172.18.0.1:53 i/o timeout)
                          23:59 - [resolver] connect failed: dial tcp 172.18.0.1:53: i/o timeout
                          00:xx - Server becomes unresponsive
                          

                          There's also a container (different ID each time) producing "ignoring event" / "cleaning up dead shim" messages every minute — not sure if related.

                          This happens roughly at the same time every night (~23:00-00:00 UTC). All previous fixes applied (no restart loops, domain renewed, hardware clean). I'm running out of ideas on my end.

                          Would it be possible to get SSH-level support to debug this? I can provide access anytime. This is really urgent as it's been impacting my mail service daily for weeks now.

                          Thank you.

                          J 1 Reply Last reply
                          0
                          • M mendoksai

                            @girish @nebulon Server crashed again last night. But this time the pattern is different — no containers in restart loop, no runner issues. The cron cleanup job is working. All containers were stable (Up 11 hours) before the crash.

                            The Docker journal shows the DNS resolver dying on its own:

                            23:38 - External DNS timeouts begin (185.12.64.2)
                            23:57 - Internal Docker DNS fails (172.18.0.1:53 i/o timeout)
                            23:59 - [resolver] connect failed: dial tcp 172.18.0.1:53: i/o timeout
                            00:xx - Server becomes unresponsive
                            

                            There's also a container (different ID each time) producing "ignoring event" / "cleaning up dead shim" messages every minute — not sure if related.

                            This happens roughly at the same time every night (~23:00-00:00 UTC). All previous fixes applied (no restart loops, domain renewed, hardware clean). I'm running out of ideas on my end.

                            Would it be possible to get SSH-level support to debug this? I can provide access anytime. This is really urgent as it's been impacting my mail service daily for weeks now.

                            Thank you.

                            J Offline
                            J Offline
                            joseph
                            Staff
                            wrote last edited by
                            #13

                            @mendoksai yes, write to me at support@cloudron.io . I can investigate.

                            1 Reply Last reply
                            1
                            • M Offline
                              M Offline
                              mendoksai
                              wrote last edited by
                              #14

                              Server was stable for 14 days after I fixed the DNS configuration myself. The original daily crash issue was resolved.

                              This morning I received Cloudron's security reboot email. Rebooted via dashboard. Server never came back. Ping responds, SSH returns kex_exchange_identification: Connection reset by peer. Hard reset via Hetzner Robot didn't help either.

                              So now I'm locked out of my own server because of an automatic security update that I didn't ask for and had no control over. My mail server is down, again.

                              I have to ask: is anyone actually testing these updates before pushing them? Every major issue I've had in the past two months has been triggered by an automatic update or upgrade. The previous instability started after a Cloudron update in February. Now this.

                              I need:

                              1. Help getting my server back online — I'll likely need to use Hetzner rescue mode
                              2. A way to permanently disable automatic security updates so I can apply them manually at a time that works for me
                              3. Some assurance that updates are being properly tested before being pushed to production servers

                              This is a production server running critical mail services. I can't keep being the QA tester for untested updates.

                              Are you guys vibe coding?

                              1 Reply Last reply
                              0
                              • jamesJ Offline
                                jamesJ Offline
                                james
                                Staff
                                wrote last edited by
                                #15

                                Hello @mendoksai

                                Sorry to read that you are having issues.
                                This sounds like something went wrong with the unattended security updates.

                                @mendoksai said:

                                because of an automatic security update that I didn't ask for and had no control over

                                This is default for every Cloudron installation.
                                Even if you don't ask for it now, you will ask for it if your server gets compromised due to apt packages not being updated with security updates.


                                When the server is running, even tho you can not connect, try to open the console from the hetzner dashboard and see what is written in the terminal.
                                First we need to understand what the issue is before tinkering with the system.

                                1 Reply Last reply
                                0
                                • M Offline
                                  M Offline
                                  mendoksai
                                  wrote last edited by
                                  #16

                                  I used rescue mode and I saw the issue was with kernel and fixed the kernel version, disabled docker, Cloudron so I could connect. Then I did another investigation which seems Raid Controller had problem so I asked Hetzner checked they replaced Raid Controller and then things back to normal. So, I think the issue was more like happing from Raid Controller but just timing was with the upgrade and reboot gave the wrong impression, although I'm not 100% sure. Anyway, for now, it seems okay. You may close this ticket. Thank you.

                                  1 Reply Last reply
                                  1
                                  • jamesJ james has marked this topic as solved
                                  • jamesJ Offline
                                    jamesJ Offline
                                    james
                                    Staff
                                    wrote last edited by
                                    #17

                                    Hello @mendoksai
                                    That checks out with our other topic.
                                    A RAID failure can cause such a catastrophic issue.

                                    1 Reply Last reply
                                    1

                                    Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                                    Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                                    With your input, this post could be even better 💗

                                    Register Login
                                    Reply
                                    • Reply as topic
                                    Log in to reply
                                    • Oldest to Newest
                                    • Newest to Oldest
                                    • Most Votes


                                    • Login

                                    • Don't have an account? Register

                                    • Login or register to search.
                                    • First post
                                      Last post
                                    0
                                    • Categories
                                    • Recent
                                    • Tags
                                    • Popular
                                    • Bookmarks
                                    • Search