Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


    Cloudron Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular

    Automatically repair app when the HealthCheck goes down (Not Responding)

    Feature Requests
    health monitoring
    9
    46
    1089
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Lonkle
      Lonkle last edited by Lonkle

      I think that there should be an option for a singular "automatic repair" of an app as soon as it shows up as "Not Responding" (what do you have to lose at that point really?). I think that it should be a call to the /repair endpoint and not the /restart endpoint in your alomst-fully-documented REST API. šŸ˜‰ /repair almost never fails, and if this feature is automatic (in the background), it really doesn't matter how long it takes for an app to "restart" and /repair has a much higher likelihood of successfully doing so. So a hard reset /repair is better than a soft reset /restart if you just do it once as soon as the app goes down.

      fbartels 1 Reply Last reply Reply Quote 2
      • fbartels
        fbartels App Dev @Lonkle last edited by fbartels

        @Lonk such a functionality would need to have some parameters it needs to work within. Like "needs to be unresponsive for x checks" and "only try restarting for y times, before giving up".

        Supervisor can actually already help for some of these cases, i think if one it it's processes fail it tries restarting it.

        Lonkle 1 Reply Last reply Reply Quote 2
        • mehdi
          mehdi App Dev last edited by

          I do not understand what you are proposing here. Should app implement this /repair / /restart api endpoint ? If so, how are they exected to respond to it if they are already unresponsive by that point ? Or is it supposed to be on the platform ? Or do they already exist and you are propsing a change of behaviour ? I am completely lost here ^^

          Lonkle 1 Reply Last reply Reply Quote 0
          • girish
            girish Staff last edited by

            I would ideally like to remove Cloudron's healthcheck field and replace it with Docker's own HEALTHCHECK (https://github.com/moby/moby/pull/22719). When we started out, that feature didn't exist in docker and maybe it replaces what Cloudron does internally. Once we do that, we can get automatic restarts etc from upstream docker. Even though I note that https://github.com/moby/moby/issues/28400 is open for over 2 years now.

            Lonkle fbartels 4 Replies Last reply Reply Quote 6
            • robi
              robi last edited by

              Perhaps upvote this or add additional comments there to make it happen.

              Life of Advanced Technology

              1 Reply Last reply Reply Quote 2
              • Lonkle
                Lonkle @fbartels last edited by

                @fbartels said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                @Lonk such a functionality would need to have some parameters it needs to work within. Like "needs to be unresponsive for x checks" and "only try restarting for y times, before giving up".

                boxalready works in a similar fashion. It waits a good few minutes and fails like 30 Healthchecks before being labeled unresponsive which makes it a perfect opportunity to use the /repair endpoint on the app because you literally have nothing to lose, the app isn’t responding and using the /repair undocumented API endpoint has solved way more issues than a simple restart.

                This is all within the Dashboard and box system. And my ideal times to try is once. If one /repair doesn’t fix it. It’s unlikely a second will.

                1 Reply Last reply Reply Quote 1
                • Lonkle
                  Lonkle @mehdi last edited by

                  @mehdi said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                  I do not understand what you are proposing here. Should app implement this /repair / /restart api endpoint ? If so, how are they exected to respond to it if they are already unresponsive by that point ? Or is it supposed to be on the platform ? Or do they already exist and you are propsing a change of behaviour ? I am completely lost here ^^

                  Box / The dashboard would /repair the app. It’s an undocumented Cloudron endpoint and always fixes any issue I have with apps, unlike simply restarting them. Right now, apps could stop responding to the HEALTHCHECK and within 5 - 10 minutes, the Cloudron labels them unresponsive because they aren’t responsive.

                  I’m saying why wouldn’t the system try a /repair at that point. There’s nothing to lose and a working app to gain.

                  mehdi 1 Reply Last reply Reply Quote 1
                  • Lonkle
                    Lonkle @girish last edited by Lonkle

                    @girish said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                    I would ideally like to remove Cloudron's healthcheck field and replace it with Docker's own HEALTHCHECK (https://github.com/moby/moby/pull/22719). When we started out, that feature didn't exist in docker and maybe it replaces what Cloudron does internally. Once we do that, we can get automatic restarts etc from upstream docker. Even though I note that https://github.com/moby/moby/issues/28400 is open for over 2 years now.

                    Completely agree. Ideally we’d be using DOCKER HEALTHCHECKS but until then, this seems like a single line of code, if app becomes unresponsive, then /repair it. Maybe it’ll become responsive again, maybe it won’t - but at least your system tried to fix it automatically before notifying a human who has to take the /repair step manually anyway (and who knows what time it is).

                    1 Reply Last reply Reply Quote 0
                    • Lonkle
                      Lonkle @girish last edited by

                      @girish said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                      Even though I note that https://github.com/moby/moby/issues/28400 is open for over 2 years now.

                      You made a home grown health check before them. So you could make a home grown /repair once after ā€œUNHEALTHYā€ (or ā€œNot Respondingā€ in the Dashboard sense). What do we have to lose by doing so especially given it’s just a single endpoint already exposed in your API. This would be a quick option to add and could be an extra benefit to your home grown healthcheck even before you switch to Docker’s internal one.

                      robi 1 Reply Last reply Reply Quote 2
                      • robi
                        robi @Lonkle last edited by

                        @Lonk @girish I had the same thought too, if docker now reports correct status, easy to grab the status from docker ps and restart the container, like the guy does in cron.

                        Life of Advanced Technology

                        1 Reply Last reply Reply Quote 2
                        • fbartels
                          fbartels App Dev @girish last edited by

                          @girish said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                          I would ideally like to remove Cloudron's healthcheck field and replace it with Docker's own HEALTHCHECK

                          +1 for that

                          1 Reply Last reply Reply Quote 4
                          • mehdi
                            mehdi App Dev @Lonkle last edited by

                            @Lonk said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                            Box / The dashboard would /repair the app. It’s an undocumented Cloudron endpoint and always fixes any issue I have with apps, unlike simply restarting them.

                            But what does this /repair do ? I am not clear on how an endpoint can magically repair an app in all case..

                            Lonkle 1 Reply Last reply Reply Quote 0
                            • Lonkle
                              Lonkle @mehdi last edited by

                              @mehdi said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                              @Lonk said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                              Box / The dashboard would /repair the app. It’s an undocumented Cloudron endpoint and always fixes any issue I have with apps, unlike simply restarting them.

                              But what does this /repair do ? I am not clear on how an endpoint can magically repair an app in all case..

                              It destroys the container and rebuilds it (it’s undocumented but exists). If there’s something wrong with the container - this will fix it. If there’s something wrong with NGINX, this will fix it.

                              Restarting the container can help. But only 10% of the time when 90% of the time, the repair endpoint fixes a ā€œNot Respondingā€ app for me.

                              1 Reply Last reply Reply Quote 2
                              • nebulon
                                nebulon Staff last edited by

                                The idea for the repair is to be done consciously and also ideally be monitored by the admin. So I don't think it is good to just run it for good measure.

                                I think what would be more important to get down to why things have failed and then see how to prevent that instead. This may mean we have to add better logging or status reporting.

                                Lonkle 1 Reply Last reply Reply Quote 3
                                • Lonkle
                                  Lonkle @nebulon last edited by Lonkle

                                  @nebulon said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                                  The idea for the repair is to be done consciously and also ideally be monitored by the admin. So I don't think it is good to just run it for good measure.

                                  I think what would be more important to get down to why things have failed and then see how to prevent that instead. This may mean we have to add better logging or status reporting.

                                  The reason I brought this up in the forum is because I was going to make a script that checked the health status of all my apps and /repair them when they switch to Not Responding, which is about 5 minutes after they actually go down.

                                  Then I realized, what's the healthcheck for except for to react when an app goes down. I'm not saying don't log it in notifications. If it's a one-off thing, then it didn't require human intervention to get working again (which could take awhile if you infrequently monitor the Dashboard).

                                  So, give us the info, sure, not a bad idea - but automatically trying to repair the app first and noting that it was successfully repaired after being down in notifications is a better solution. Having this be a manual step so that the admin "knows about it" is silly if it's a random one-off thing is silly and irrelevant if they get a notification about it. If the /repair endpoint works - then that's all the admin would do, review their notifications. It could be run a singular time to keep the site up and notify the admin and that sounds best IMO. Better logging, for sure, I can agree with that. But notifying + auto-repair (just one try) and then notifying if that repair was a success is better. Much better. It's the reason healthchecks exist - the native Docker now has "restart on failed healthcheck" (as does supervisor) - but you guys can take it beyond that with a /repair.

                                  I can still write my script. I can even keep it on the same Cloudron instance, monitoring the status of all the apps and do a /repair on one if it "stops responding". If it starts responding cool, I'll check the logs as soon as I can to make sure it doesn't happen again. If it doesn't start responding after that, I'll find the problem and fix it.

                                  I just don't get the stance "we could automate the first troubleshooting step that every admin will take to get their site up and running again, but we don't think it's a good idea because we want them to intentionally take that step." My question to that is, why when you could be losing on precious uptime?

                                  1 Reply Last reply Reply Quote 1
                                  • Lonkle
                                    Lonkle last edited by

                                    It took me awhile to find that /repair endpoint and think it's brilliant. Not even the built-in Docker auto-restart is as thorough. So, why keep a site down when it could automatically be up again as that's the point of HEALTCHECKs IMO.

                                    1 Reply Last reply Reply Quote 0
                                    • Lonkle
                                      Lonkle last edited by Lonkle

                                      Cloudron already restarts an app when it starts using too much memory. Cloudron could just stop the app using too much memory and have the admin notified that it stopped the app so they can start it back up manually (which, of course, sounds ridiculous - and I don't see the difference between this and that). It doesn't do that though, it restarts it, why? I'd say to keep the app running.

                                      So not at least attempting a single /repair when an app stops responding for the sake of the admin to do so manually feels needlessly manual and doesn't keep the app running when it could be (say the admin is the only admin and they're on vacation for a week, this feature could save them).

                                      1 Reply Last reply Reply Quote 1
                                      • nebulon
                                        nebulon Staff last edited by

                                        I still think we should not hide issues, which may be fixed running repair. This is a bit like just having bandage over an issue lurking in the dark. Generally if repair does indeed fix the issue, then this indicates either problems in the app package, the runtime management of apps in the platform or maybe external dependencies should be handled better (like DNS setup)

                                        This is a bit like docker issues we encounter every now and then. Sometimes docker restart fixes them, but the conclusion should probably not be to restart docker every now and then if an app fails for good measure.

                                        My be just me, but I haven't experienced many systems which work well in a self-healing manner and I am afraid of hiding real bugs through this. To me this conversation should be about the issue which triggered your thinking of automatically running repair in the first place.

                                        Lonkle 1 Reply Last reply Reply Quote 2
                                        • Lonkle
                                          Lonkle @nebulon last edited by Lonkle

                                          @nebulon said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                                          1. My goal is not to hide anything from the admin, just make the apps have higher up times by cutting out a human step. I don't see a difference of handling this the same way you run out of memory. Post a notification on the Dashboard, display the error:

                                          "App was not responding due to x error - Cloudron automatically repaired the app and it's running but please look at the logs here to see what might have happen to have caused this (and the log screen should show the last bit of the log file before the app stopped responding).

                                          1. For me personally, this has only happened once. A Wordpress blog that wasn't even in use (no plugins yet - aside from the pre-installed ones) stopped responding and a restart didn't work. A "repair" did though. I still don't know what was wrong but it never went down after that (it's been a month now). So this feature is more of a comfort for me to know that edge case would be covered when I'm on "vacation for a week."

                                          2. As a developer, I don't want "band-aids" - I want to fix the problem, but if your notification system gives me the date, time, and logs of when it had to /repair the app, then I can still do that with virtually only 5 minutes downtime of my app (instead of me, a human, getting around to checking the logs, try to figure out what happened, meanwhile I can't repair without adding more to the log, so a snapshot of the log, and a automatic singular attempt to "repair" the app seems like a better flow for user and developer alike). I, of course, don't mind an app stopping since I only use Cloudron for development - but I'm thinking more about your everyday users. If people get this proposed repair notification feature I'm asking for, then there app doesn't stay down, which is good - but it allows them to post on the forums the logs and the app (mine was Wordpress Unmanaged, and like I said, it only happened once).

                                          3. As @fbartels mentioned, surpervisor does things like this. And as @girish mentioned, the latest Docker itself now does this. So, this is definitely a valid suggestion. Why would all of these other app managers do this is if it wasn't useful? We can make it even more useful by snapshotting the logs and adding a notification saying that something may need to be fixed even though a auto-repair was successful (then click on the snapshotted logs and dig in).

                                          1 Reply Last reply Reply Quote 3
                                          • ruihildt
                                            ruihildt last edited by ruihildt

                                            Today, after the update to latest cloudron version, I had between 15-20 apps in a failed state.

                                            Clicking the repair fixed every one of them. And I had a client send me a message about downtime. I didn't look more into it, having spent at least 30 minutes going into each not responding app settings individually, clicking repair and waiting for it to successfully get back online. (And I wish there was a button to repair all apps at once^^)

                                            As much as I agree with not bandaidging issues, for my client and my reputation, I wished not responding apps would have been repaired automatically and errors to be reported.

                                            Between repairing an app and getting to run or start an asynchronous debugging on the forum, I'll always click on the repair button first.

                                            If repairing is detrimental to bug solving/reporting, I would suggest to put a place where not responding logs/errors can be retrieved in one click.

                                            Lonkle 1 Reply Last reply Reply Quote 3
                                            • Lonkle
                                              Lonkle last edited by

                                              Note 1: Even though it's only happened to me once organically, I would feel at peace knowing there's at least, an extra added / precaution / attempt there to keep a production app up.

                                              Note 2: The reason I encounter this often on purpose is when I update the code of the VPN Client app I'm working on, it breaks any apps containers that were connected to it, and those apps take 8 mins and then go to "Not responding..." status in the dashboard. Their container needs to be re-created and /repair does that while /restart does not so I consider /repair to be more thorough when attempting to bring back up an app that went down (again, just one attempt) for any reason. @girish and I will make certain that when the OpenVPN Client on the app store gets updated, any apps connected to it gets "re-connected" properly but if I hadn't have noticed that big and we did a v1.1 release of the OpenVPN Client app, then people's apps could have gone down and they'd have no way to know why but if that /repair protection in place, it's likely neither they nor the users would notice the 8 minute downtime.

                                              Of course, as the developer of this app, I would eventually notice the behavior and fix it. That's why I'm working with @girish on box code to make sure this doesn't happen.

                                              But I just explained to you a real case that could have happened and it helps users and doesn't hurt developers. All pros, no cons. So, you tell me why a user wouldn't want that added protection (but they should still be notified it happened like I said above)? Cause I understand your belief system, but when I see all pros with no cons, I have to fight my case.

                                              1 Reply Last reply Reply Quote 0
                                              • Lonkle
                                                Lonkle @ruihildt last edited by Lonkle

                                                @ruihildt said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                                                Today, after the update to latest cloudron version, I had between 15-20 apps in a failed state.

                                                Clicking the repair fixed every one of them. And I had a client send me a message about downtime. I didn't look more into it, having spent at meats 30 minutes going into each not responding app settings individually, clicking repair and waiting for it to successfully get back online. (And I wish there was a button to repair all apps at once^^)

                                                As much as I agree with not bandaidging issues, for my client and my reputation, I wished not responding apps would have been repaired automatically and errors to be reported.

                                                Between repairing an app and getting to run or start an asynchronous debugging on the forum, I'll always click on the repair button first.

                                                If repairing is detrimental to bug solving/reporting, I would suggest to put a place where not responding logs/errors can be retrieved in one click.

                                                Nevermind, ignore my potential "use case" for adding this protection because it was theoretical and instead @ruihildt's is real (we literally posted at the same time and the cases were oddly similar) and I think it should be heavily considered and talked about with @girish and yourself ( @nebulon ). This is a real problem that occurred with one of your users in a production environment recently and this protection I'm proposing would have made the issue transparent to the user and his client. But us developers will eventually catch all the bugs as we build out better unit tests and whatnot. User experience is my number one priority, and nothing says it better than @ruihildt's testimony.

                                                1 Reply Last reply Reply Quote 1
                                                • Lonkle
                                                  Lonkle @girish last edited by Lonkle

                                                  This would be a precaution added until @girish is ready to add what he wrote below (I don't know if dockerode has it which would be a huge difference in difficulty in adding that as a feature):

                                                  @girish said in Automatically "/repair" app when the HealthCheck goes down (Not Responding):

                                                  I would ideally like to remove Cloudron's healthcheck field and replace it with Docker's own HEALTHCHECK (https://github.com/moby/moby/pull/22719). When we started out, that feature didn't exist in docker and maybe it replaces what Cloudron does internally. Once we do that, we can get automatic restarts etc from upstream docker. Even though I note that https://github.com/moby/moby/issues/28400 is open for over 2 years now.

                                                  1 Reply Last reply Reply Quote 0
                                                  • nebulon
                                                    nebulon Staff last edited by

                                                    @ruihildt ideally we would know what the issue was if it affected so many apps. Do you still know the error? If you do maybe we can discuss that in a new forum thread to fix what caused it for the next update.

                                                    Lonkle 1 Reply Last reply Reply Quote 0
                                                    • d19dotca
                                                      d19dotca last edited by

                                                      I think this is a great conversation, and great points from different points of view. For what it's worth, my two cents is that the health check should indeed do more than just simply log it, it should take a predetermined action. This is how other health checks work on different platforms and I think is the general expectation of users coming to Cloudron from managing a Docker Swarm cluster, or using Kubernetes, etc.

                                                      I think the appropriate measure would be for health checks to be logged as they already are, and for a new option for an admin to set an action to be taken whenever a health check state goes to "Error", for example, such as automatically restarting the node. I've personally never encountered an issue yet where an app in "Error" state isn't fixed by a simple reboot of the app, and even fi it wasn't and was then in a reboot loop, well if it wasn't working anyways then not much has really changed so it can only really help in a way not harm.

                                                      --
                                                      Dustin Dauncey
                                                      www.d19.ca

                                                      1 Reply Last reply Reply Quote 2
                                                      • Lonkle
                                                        Lonkle @nebulon last edited by Lonkle

                                                        @nebulon And that's what you should do, of course, but it'd be a lot easier of him to do if he'd gotten a notification with the portion of the log of when the app went down, and had it automatically repaired (both the logs shortly before it repaired and shortly after would be delivered with the notification). He could get you that info at anytime (if a notification system like I proposed is created). It's far too late for that now - making his ability to report what happened, effectively more diminished than if we agreed on a position in this thread on where we should go in terms of reporting to devs and pre-determined actions. I argue pre-determined actions are the point of a healthcheck but, hmm. šŸ¤”

                                                        This is an ideal you and I fundamentally disagree on and no matter what you have the final say, but I'm feeling meeting in the middle with my proposal to increase the ease of use / ability for users to report to the devs when apps go down, and it increases uptime.

                                                        That was me taking what you wanted (accountability and log reporting), and what I wanted (higher uptime, even if it is only a 0.0001% difference, it'll make me and other users feel safe in that way). I'm a developer, but user experience is key - so I always put myself in their shoes while debating the way to make these kinds of infrastructure changes.

                                                        Basically, @nebulon, what I'm asking you is - does my proposal not only solve both of our problems at once, but actually increases the ability for users to do what you wanted them to do in the end anyway (hit repair manually and report the log) as long as we do it in the right way as has been discussed?

                                                        1 Reply Last reply Reply Quote 0
                                                        • Lonkle
                                                          Lonkle last edited by Lonkle

                                                          I would also like to point out that @girish said in this thread that this is going to be a feature down the road as Cloudron gets more Docker features. So, it's already a Docker feature, and they must have also discussed this. I wonder if I can find the ticket for their discussion on this very subject. But thanks @robi for finding and posting this, I thought it was a good read about this whole issue:

                                                          https://github.com/moby/moby/issues/28400#issuecomment-712510304

                                                          1 Reply Last reply Reply Quote 0
                                                          • girish
                                                            girish Staff last edited by girish

                                                            @nebulon probably didn't communicate properly but it's not about implementing the feature or how it can be implemented. He is saying that we need to understand the root cause of apps not responding in the first place. Once we understand the problem, we can think of the correct solution.

                                                            We have 5-6 Cloudrons ourselves and essentially never have to repair/restart apps. In fact, this whole repair stuff was only added some releases ago and we thought even that was uncommon šŸ™‚

                                                            Anyway, this thread exists so people can tell us if apps are not responding often. Like things stop working every day? every week? every month? Depending on the various experiences, we can try to figure out how to solve it. For example, if you had to restart like once a month, it's already not a priority with respect to our massive back log. So far, I have noted two, let's hear from more users. But we made Cloudron so one doesn't have to deal with all this stuff about apps going up and down.

                                                            Lonkle 1 Reply Last reply Reply Quote 0
                                                            • Lonkle
                                                              Lonkle @girish last edited by Lonkle

                                                              @girish Like I said, it happened to me only once between updates so the system is pretty stable. I outlined a theoretical situation where it would be needed. Then another user mentioned an important use case for simply adding a single endpoint to reaction to a Not Responding status dashboard function.

                                                              But you yourself said Docker supports this, and multiple people have expressed that customizable actions based on "status changes" of apps are beneficial. I even outlined how to make @nebulon's and your main issue with this work even better while still increasing uptime. Increasing uptime for users was important enough for Docker to implement it themselves with their heathchecks. Even supervisor.

                                                              The notifications can literally say "We restarted your app x, if you notice this happening often with this app - click this to send the log directly to the developers."

                                                              Only @mehdi appears opposed to this though I couldn't tell, I'd say it's almost unanimous and why wouldn't we compromise to make everyone happy app devs, Cloudron developers (you and @nebulon), and especially users.

                                                              But the point you guys made (yourself and @nebulon) is valid, it's just overruled by using the notification system you already have in place to make this situation so much easier for the users to repot more to mitigate your concerns while simultaneously increasing user reporting and increasing app uptimes. Win-win - does anyone disagree with that?

                                                              Note: I agree with @girish that we should hear from more users and their opinions btw.

                                                              mehdi 1 Reply Last reply Reply Quote 0
                                                              • Lonkle
                                                                Lonkle last edited by Lonkle

                                                                Also, @girish, I wasn't saying that /repair wasn't overkill, I was saying, why not use the overkill option that already exists and has claimed (by users) to fix more issues than /restart to try to fix something as bad as literal website downtime. What if @ruihildt had been on vacation and Cloudron auto-updates (as it does) and it caused the need for a manual repair of all of those apps (he said 20 per installation). Yes, a dumb admin user shouldn't exist but what if he didn't know what to do and then @ruihildt had to be interrupted to fix these apps, each one by one. When he could have, after his vacation, clicked the "Report to Developers" button in his Notification Center when he sees them. He's more likely to report, you're more likely to have the data you want, he's a happier user. And the rest of the users feel comfortable with that one more protection against downtime.

                                                                Btw, the one time it happened to me was after a Cloudron auto-update, with the Wordpress Unmanaged app. So there could be a common factor here. But the fact that we don't know it and this solution accounts for this and other potential user-downtime-impacting situations. It's just very important to consider for developers and users alike. But I think we've all made it a point to consider this so maybe we'll revisit it later.

                                                                1 Reply Last reply Reply Quote 0
                                                                • mehdi
                                                                  mehdi App Dev @Lonkle last edited by

                                                                  @Lonk said in Automatically repair app when the HealthCheck goes down (Not Responding):

                                                                  Only @mehdi appears opposed to this though I couldn't tell, I'd say it's almost unanimous and why wouldn't we compromise to make everyone happy app devs, Cloudron developers (you and @nebulon), and especially users.

                                                                  Nope, not opposed to it ! I just did not understand what /repair does, more precisely what the difference was between it and /restart šŸ™‚

                                                                  To be clear, I'm not opposed, but I am not lobbying for implementing this either. I've literally never had to restart / repair an app to fix it => I don't really have an opinion on the matter.

                                                                  Lonkle 1 Reply Last reply Reply Quote 1
                                                                  • Lonkle
                                                                    Lonkle @mehdi last edited by Lonkle

                                                                    @mehdi That makes perfect sense, since I've encountered it personally; it makes me a little more opinionated on the resilience and uptime of the apps in an app manager. I'm only proposing a single repair as soon as an app's status moves to status of "Failed" or "Not responding...". If it doesn't work, it doesn't work. If it does, then we only have uptime to gain here.

                                                                    1 Reply Last reply Reply Quote 0
                                                                    • robi
                                                                      robi last edited by

                                                                      It sounds like we need a multi-step approach.

                                                                      First, improve logging, capturing and surfacing of the actual errors in general, per cloudron. Notifications that an app went down then came back up are next to useless.

                                                                      Second, add more resilience in an escalating manner, and at different timeout lengths and counts. Maybe restart first and if that doesn't do it, repair.

                                                                      Third, have a way to get telemetry for the Cloudron team from all running Cloudrons, so they have an auto generated view of which apps need more attention because of triggered restarts, repairs and errors.

                                                                      Less chaff/noise, more signal/automation. That's what we love about Cloudron.

                                                                      Life of Advanced Technology

                                                                      1 Reply Last reply Reply Quote 3
                                                                      • Lonkle
                                                                        Lonkle last edited by

                                                                        @robi Exactly, I want to add automation to this without losing what @girish and @nebulon want, a compromise where we all win.

                                                                        I really liked where you were going with that analytics of apps thing. If an app restarts more than once a day, in a week we can send a report to the admin.

                                                                        I'm advocating we treat a suddenly unresponsive app in the same way we treat an app that takes up too much memory. We restart it, and the process might fix itself.

                                                                        Girish did say when we update to the latest Cloudron we'll eventually get this anyway. I'm just saying, why wait a year and instead hash out the fundamental disagreements between developers and users so that everyone could benefit. I liked whenere @fbartels was going with it in the beginning of this thread, and I like how the analytics you bring up could help send admin maybe a weekly notification that their app went down a few time, it may be "unhealthy" and need to be looked into. Something like that. I think notifications have their place, just not sure 100% how which is why I like this thread, to hash it out.

                                                                        1 Reply Last reply Reply Quote 0
                                                                        • Lonkle
                                                                          Lonkle last edited by

                                                                          Just read this:

                                                                          https://github.com/moby/moby/issues/28400#issuecomment-713457999

                                                                          And liked this quote: ā€œIdeological answers are not very useful when people ask for pragmatic solutions to real-life problems.ā€ Although that person is fighting for configurable options on healthcheck fail. I just want a single repair.

                                                                          Because the healthcheck isn’t any more useful than UptimeRobot otherwise. And you handle running out of RAM the same way (by restarting and notifying the user it happened).

                                                                          I still want to find a solution where users, app devs, and the main developers all can compromise and figure out a solution to fit our needs / wants on this platform.

                                                                          1 Reply Last reply Reply Quote 0
                                                                          • nebulon
                                                                            nebulon Staff last edited by

                                                                            As @girish mentioned, the repair was added as a last resort and nearly always it just covers up a real underlying issue, which should be tackled to avoid repair runs in the future. Given that those issues are not well understood and known currently, essentially what it does is to tear everything down and start the app fresh. It may be docker issues or other things, if we would know what logs or hints we should attach then we would do that, but it usually isn't that trivial.

                                                                            So for future reference, if you hit a situation needing a repair, please copy the error shown for the last task (visible in the repair UI) and also download app logs and ideally do some basic investigation on the server, like running cloudron-support and save the resulting link. Do all this before hitting repair, since by using it errors might be obfuscated and hard to find afterwards. Also if multiple apps are in this state, see if there is some correlation between the error. For example it could just be that the system got overloaded temporarily after a reboot or such. In many cases there are solutions we can build into the platform, but we have to first understand the underlying issue.

                                                                            Lonkle ruihildt 2 Replies Last reply Reply Quote 2
                                                                            • Lonkle
                                                                              Lonkle @nebulon last edited by Lonkle

                                                                              @nebulon said in Automatically repair app when the HealthCheck goes down (Not Responding):

                                                                              if you hit a situation needing a repair, please copy the error shown for the last task (visible in the repair UI) and also download app logs and ideally do some basic investigation on the server, like running cloudron-support and save the resulting link

                                                                              Since you were able to explain how to do it manually, I thought about it and literally everything you just said could be automated and shown in a notification and there could be a ā€œSend log to developers button.ā€ So, again, if trying to auto-repair once doesn’t work; that’s it. I don’t see -a single con but a huge pro being more uptime by removing an unnecessary human first debug step. Automation is why we use Cloudron in the first place.

                                                                              Let me put it this way, why do you not just keep apps that run out of memory, stopped, why do you restart them? Uptime is the answer and it’s what the users care about. This isn’t a developer platform where we stare and love logs and make a million forum posts (like me). Users want a real solution to a real-world problem.

                                                                              This is already built into Docker now - so we’ll eventually get it anyway when you guys update Docker. That’s probs a year away though, so why not band aid it till we all have official Docker support for repair on health check fail.

                                                                              Also, don’t think I don’t understand why you don’t want to band aid an issue. But instead of band aid-ing your suggestion is to tell users to do work that could be automated and save them time. And like @ruihildt proved. Users don’t actually report this stuff unless it actually is consistent. They never would have told their story without my post.

                                                                              What I’m proposing is automatic and only serves to increase uptime. Your solution increases downtime and user annoyance if it was a one off platform thing which appears that it is. It only happened to me once after an auto-update but I only had one app. Having 20 do all the same thing and me not being there to repair them and their in a production environment. Yikes.

                                                                              I’m not asking for an endless loop. Just a preemptive action to see if we can keep an app up without human intervention when it would otherwise stay up with my proposal.

                                                                              I don’t believe Docker should remove this behavior, nor should supervisor. I think there’s a solution out there for all of us and I want to discuss in good faith ways to solve your problems with the proposal as well as putting users over developers in terms of UX.

                                                                              1 Reply Last reply Reply Quote 0
                                                                              • ruihildt
                                                                                ruihildt @nebulon last edited by

                                                                                @nebulon Would it be complicated to automate that 3 steps reporting? (For example, clicking repair would trigger this by default)

                                                                                Like I said, if I have clients complaining and if I'm not in front of my computer with free time on hand, I'm going to hit that repair button to get back online ASAP.

                                                                                This was my situation yesterday, I'm sick and stuck in bed with just my smartphone, it would have helped if it was automated, for you and for me.

                                                                                Lonkle 1 Reply Last reply Reply Quote 1
                                                                                • Lonkle
                                                                                  Lonkle @ruihildt last edited by

                                                                                  @ruihildt said in Automatically repair app when the HealthCheck goes down (Not Responding):

                                                                                  @nebulon Would it be complicated to automate that 3 steps reporting? (For example, clicking repair would trigger this by default)

                                                                                  Like I said, if I have clients complaining and if I'm not in front of my computer with free time on hand, I'm going to hit that repair button to get back online ASAP.

                                                                                  This was my situation yesterday, I'm sick and stuck in bed with just my smartphone, it would have helped if it was automated, for you and for me.

                                                                                  Exactly, what @nebulon and @girish want could be automated and in doing so - it increases what they want (more data to fix underlying problems) while at the same time keeping the users apps uptime as high as possible. I want everyone to win in this scenario so I want to hear everyone’s pros and cons.

                                                                                  1 Reply Last reply Reply Quote 0
                                                                                  • nebulon
                                                                                    nebulon Staff last edited by

                                                                                    Well those steps are just generic debugging and investigation steps for sysadmins. They may or may not apply and are certainly not exhaustive, just what came to mind while writing this. Plus we don't generally just send information wholesome from your server to us also for privacy reasons. You can still manually issue a support ticket for that app from the support view, which will include the app logs.

                                                                                    The out-of-memory restart is something different though. The underlying issue is known here and restarting due to out-of-memory is the correct thing to do. There is also no solution code-wise, since we can't just up the memory limit automatically, risking over-provisioning the server. Nor can we add memory to the server automatically.

                                                                                    Again if you have concrete situations where a repair may solve the issue, we have to investigate. Keep in mind that in our experience this is not at all common across our users. Can you imagine Cloudron just issuing a server restart automatically since we found that this often fixed issues in the past?

                                                                                    Lonkle ruihildt 2 Replies Last reply Reply Quote 0
                                                                                    • Lonkle
                                                                                      Lonkle @nebulon last edited by Lonkle

                                                                                      @nebulon said in Automatically repair app when the HealthCheck goes down (Not Responding):

                                                                                      Can you imagine Cloudron just issuing a server restart automatically since we found that this often fixed issues in the past?

                                                                                      Well, of course not. That would defeat the entire purpose of this thread which is to have your apps have as much uptime as needed. Your opposition is, ā€œwe, developers need to make a perfect system and fix things so that our healthcheck isn’t even needed.ā€ I agree that that’s the goal but you’re not there yet (both my and the other users on this thread attest to that) so why not protect users to these once off oddities that will never be reported if it doesn’t happen more than once? Because users are still going to hit repair and not do a single thing about it. But if in the notification there was an ā€œignoreā€ button or ā€œSend before and after logs to developersā€ then they may just press that and automate all of your manual steps. Because you could make this more complicated if you want - but in the end, it boils down to a user choosing to send you logs to fix an underlying issue. Even if you choose not to repair automatically once per ā€œNon-responsiveā€ / ā€œFailedā€ app, you should then go further in your direction and make it easy for users to log these events and report to you. I still think a single courtesy attempted repair is best and Docker already has that built in so we’re gonna get it. Well, actually, I’m now making ā€œDot - The Repair Botā€ to monitor the status of my apps and repair them if needed. This was always something I could do - even from inside a Cloudron app. And I intend to do it and release it for everyone. But this seems more suited as a built in feature, a checkbox that says ā€œAttempt a single repair on app failure.ā€

                                                                                      And sure your steps weren’t exhaustive. But @ruihildt put it best. Average users don’t have time for any of that. I wasn’t advocating for automated sending of the logs. I was advocating for a button in the Notification Center that would send you, the developers, the relevant data you need - because users do not do what devs do (report bugs), but they would if it was a button click. Regardless of where you land on this matter, making it easier for devs to fix their apps is best if you’re going to say the only con to the pro of ā€œmore uptimeā€ is ā€œless developer communication.ā€

                                                                                      You guys for sure won’t be niche forever. So scaling ideas to reduce support requests should be considered heavily. Turns out for everyone in this thread, they repaired all of their apps and it never happened again. So no support requests when I could totally see someone complaining about having to manual repair 20 apps. The fact that you have a health check and this one-off repair has worked for all of us and we never reported it, means a singular automatic repair has no real-life con (aside from so far your theoretical con), but real life pros (uptime and reliability).

                                                                                      Coincidentally, this unintentionally happened to me again while rebooting the server. The app booted up not responding (1 of 4) and I repaired it and went on with my day. I’m a developer and didn’t think anything of it except that I wished it had been automatic in case I hadn’t noticed.

                                                                                      So I just want to ask, do you feel like my goal and your goal can possibly coexist? I can wait for the Docker update which adds this but I think I’ll create Dot to keep everything healthy automatically (which is something I would expect from an app manager). Average customer UX, saying this even as a developer, is more important than developer UX.

                                                                                      1 Reply Last reply Reply Quote 0
                                                                                      • nebulon
                                                                                        nebulon Staff last edited by

                                                                                        Well really if you hit this just now on a reboot, I don't understand why this would not be investigated. We are happy to help. This is simply not acceptable having to run a repair after reboot. Repair in this case would just mean it recreates the containers after they were already just recreated after reboot. This makes no sense why it would fail first and then not, maybe there is some timing or resource issue lurking. I suspect there is an issue in the healthchecker state and the apps would have become healthy after some time. But all this is up in the air without further info. Repair may actually re-runs the last task! So after a reboot it may run some random unrelated app configuration task.

                                                                                        I have seen many times where users would restart/repair apps just because they felt it took too long for them to start up, such actions are sometimes even risky if an app is starting up. This is also why we don't show the task cancellation button immediately. Randomly interrupting processes is not a good idea if it can be avoided, as it heavily depends on how an app is written and how it can recover from for example interrupted database migrations.

                                                                                        Lonkle 1 Reply Last reply Reply Quote 2
                                                                                        • ruihildt
                                                                                          ruihildt @nebulon last edited by

                                                                                          @nebulon I totally understand the privacy aspect, so that could get logged, but only sent when needed.

                                                                                          The idea would be to have a snapshot of the logs at the time with all necessary informational to further debug even if the app is restarted for obvious reasons.

                                                                                          Cloudron promise is to make selfhosting easy, and such a system would make easier the bug reporting.

                                                                                          Maybe that's unrealistic.

                                                                                          1 Reply Last reply Reply Quote 1
                                                                                          • Lonkle
                                                                                            Lonkle @nebulon last edited by Lonkle

                                                                                            @nebulon @nebulon We can 100% agree on that. But I have nothing to give you. There was just a down time notification until I Xed the notification and hit repair on the app. I tried rebooting again and it worked so my impression is it was some kind of race condition since I have four apps now and that’s the most I’ve ever had. I’ve rebooted plenty of times with nothing happening. But it’s kind of the point. Yes, this shouldn’t happen. But it does. It has. We shouldn’t be punishing users because we want a perfect system simply because we want more data. We should have safeguards for them and make logs easier for us to receive if something does happen. If I had a notification that auto-repaired my app and then I could hit ā€œSend before and after logs to developersā€ with a description of what happened; I’d do it.

                                                                                            I’m making Dot the Repair Bot for me to accomplish this. Since I know we’re not going to agree on this. But why wouldn’t we be trying to find a win for you and win for users. Cloudron has become my favorite platform, I presented an issue and possible solution and it was rejected for being to ā€œautomatedā€ but that’s the point of Cloudron, automating the redundant tasks of running apps on the web.

                                                                                            1 Reply Last reply Reply Quote 0
                                                                                            • Lonkle
                                                                                              Lonkle last edited by

                                                                                              PS. @nebulon, do you ever sleep? šŸ˜‚ #meneither

                                                                                              1 Reply Last reply Reply Quote 0
                                                                                              • Lonkle
                                                                                                Lonkle last edited by

                                                                                                PSPS. I hope no one takes any disrespect when having opposing ideas on here. I always try to debate in good faith. I view us, the forum members, as needing to decide on important infrastructure aspects like this. So it’s us vs the problem. Not my idea vs anyone else’s. The problem is that sometimes users have to repair their apps manually. It usually happens after auto-updates / reboots (the only two times I’ve experienced it or read about others experiencing it on here). So what’s the solution? My solution is to auto-repair apps like that until we get enough data to fix the underlying problem. Meaning, we need not only an auto-repair so we’re not punishing users for a bad update (that will always happen), but also despite automatic repairs, easier log reporting of these errors to us.

                                                                                                So, it’s us vying for more logs to fix underlying issues, and automatic repair when something really is a one-ff situation which sometimes don’t have an explanation. Or they do but an average user couldn’t find it. But if they hit the ā€œSend us crash logsā€ button then we might be able to fix it or find the issue.

                                                                                                1 Reply Last reply Reply Quote 0
                                                                                                • First post
                                                                                                  Last post
                                                                                                Powered by NodeBB