Automatically repair app when the HealthCheck goes down (Not Responding)

mehdi

@Lonk said in Automatically repair app when the HealthCheck goes down (Not Responding):

Only @mehdi appears opposed to this though I couldn't tell, I'd say it's almost unanimous and why wouldn't we compromise to make everyone happy app devs, Cloudron developers (you and @nebulon), and especially users.

Nope, not opposed to it ! I just did not understand what /repair does, more precisely what the difference was between it and /restart

To be clear, I'm not opposed, but I am not lobbying for implementing this either. I've literally never had to restart / repair an app to fix it => I don't really have an opinion on the matter.

Lonkle

@mehdi That makes perfect sense, since I've encountered it personally; it makes me a little more opinionated on the resilience and uptime of the apps in an app manager. I'm only proposing a single repair as soon as an app's status moves to status of "Failed" or "Not responding...". If it doesn't work, it doesn't work. If it does, then we only have uptime to gain here.

robi

It sounds like we need a multi-step approach.

First, improve logging, capturing and surfacing of the actual errors in general, per cloudron. Notifications that an app went down then came back up are next to useless.

Second, add more resilience in an escalating manner, and at different timeout lengths and counts. Maybe restart first and if that doesn't do it, repair.

Third, have a way to get telemetry for the Cloudron team from all running Cloudrons, so they have an auto generated view of which apps need more attention because of triggered restarts, repairs and errors.

Less chaff/noise, more signal/automation. That's what we love about Cloudron.

Lonkle

@robi Exactly, I want to add automation to this without losing what @girish and @nebulon want, a compromise where we all win.

I really liked where you were going with that analytics of apps thing. If an app restarts more than once a day, in a week we can send a report to the admin.

I'm advocating we treat a suddenly unresponsive app in the same way we treat an app that takes up too much memory. We restart it, and the process might fix itself.

Girish did say when we update to the latest Cloudron we'll eventually get this anyway. I'm just saying, why wait a year and instead hash out the fundamental disagreements between developers and users so that everyone could benefit. I liked whenere @fbartels was going with it in the beginning of this thread, and I like how the analytics you bring up could help send admin maybe a weekly notification that their app went down a few time, it may be "unhealthy" and need to be looked into. Something like that. I think notifications have their place, just not sure 100% how which is why I like this thread, to hash it out.

Lonkle

Just read this:

https://github.com/moby/moby/issues/28400#issuecomment-713457999

And liked this quote: “Ideological answers are not very useful when people ask for pragmatic solutions to real-life problems.” Although that person is fighting for configurable options on healthcheck fail. I just want a single repair.

Because the healthcheck isn’t any more useful than UptimeRobot otherwise. And you handle running out of RAM the same way (by restarting and notifying the user it happened).

I still want to find a solution where users, app devs, and the main developers all can compromise and figure out a solution to fit our needs / wants on this platform.

nebulon

As @girish mentioned, the repair was added as a last resort and nearly always it just covers up a real underlying issue, which should be tackled to avoid repair runs in the future. Given that those issues are not well understood and known currently, essentially what it does is to tear everything down and start the app fresh. It may be docker issues or other things, if we would know what logs or hints we should attach then we would do that, but it usually isn't that trivial.

So for future reference, if you hit a situation needing a repair, please copy the error shown for the last task (visible in the repair UI) and also download app logs and ideally do some basic investigation on the server, like running cloudron-support and save the resulting link. Do all this before hitting repair, since by using it errors might be obfuscated and hard to find afterwards. Also if multiple apps are in this state, see if there is some correlation between the error. For example it could just be that the system got overloaded temporarily after a reboot or such. In many cases there are solutions we can build into the platform, but we have to first understand the underlying issue.

Lonkle

@nebulon said in Automatically repair app when the HealthCheck goes down (Not Responding):

if you hit a situation needing a repair, please copy the error shown for the last task (visible in the repair UI) and also download app logs and ideally do some basic investigation on the server, like running cloudron-support and save the resulting link

Since you were able to explain how to do it manually, I thought about it and literally everything you just said could be automated and shown in a notification and there could be a “Send log to developers button.” So, again, if trying to auto-repair once doesn’t work; that’s it. I don’t see -a single con but a huge pro being more uptime by removing an unnecessary human first debug step. Automation is why we use Cloudron in the first place.

Let me put it this way, why do you not just keep apps that run out of memory, stopped, why do you restart them? Uptime is the answer and it’s what the users care about. This isn’t a developer platform where we stare and love logs and make a million forum posts (like me). Users want a real solution to a real-world problem.

This is already built into Docker now - so we’ll eventually get it anyway when you guys update Docker. That’s probs a year away though, so why not band aid it till we all have official Docker support for repair on health check fail.

Also, don’t think I don’t understand why you don’t want to band aid an issue. But instead of band aid-ing your suggestion is to tell users to do work that could be automated and save them time. And like @ruihildt proved. Users don’t actually report this stuff unless it actually is consistent. They never would have told their story without my post.

What I’m proposing is automatic and only serves to increase uptime. Your solution increases downtime and user annoyance if it was a one off platform thing which appears that it is. It only happened to me once after an auto-update but I only had one app. Having 20 do all the same thing and me not being there to repair them and their in a production environment. Yikes.

I’m not asking for an endless loop. Just a preemptive action to see if we can keep an app up without human intervention when it would otherwise stay up with my proposal.

I don’t believe Docker should remove this behavior, nor should supervisor. I think there’s a solution out there for all of us and I want to discuss in good faith ways to solve your problems with the proposal as well as putting users over developers in terms of UX.

ruihildt

@nebulon Would it be complicated to automate that 3 steps reporting? (For example, clicking repair would trigger this by default)

Like I said, if I have clients complaining and if I'm not in front of my computer with free time on hand, I'm going to hit that repair button to get back online ASAP.

This was my situation yesterday, I'm sick and stuck in bed with just my smartphone, it would have helped if it was automated, for you and for me.

Lonkle

@ruihildt said in Automatically repair app when the HealthCheck goes down (Not Responding):

@nebulon Would it be complicated to automate that 3 steps reporting? (For example, clicking repair would trigger this by default)

Like I said, if I have clients complaining and if I'm not in front of my computer with free time on hand, I'm going to hit that repair button to get back online ASAP.

This was my situation yesterday, I'm sick and stuck in bed with just my smartphone, it would have helped if it was automated, for you and for me.

Exactly, what @nebulon and @girish want could be automated and in doing so - it increases what they want (more data to fix underlying problems) while at the same time keeping the users apps uptime as high as possible. I want everyone to win in this scenario so I want to hear everyone’s pros and cons.

nebulon

Well those steps are just generic debugging and investigation steps for sysadmins. They may or may not apply and are certainly not exhaustive, just what came to mind while writing this. Plus we don't generally just send information wholesome from your server to us also for privacy reasons. You can still manually issue a support ticket for that app from the support view, which will include the app logs.

The out-of-memory restart is something different though. The underlying issue is known here and restarting due to out-of-memory is the correct thing to do. There is also no solution code-wise, since we can't just up the memory limit automatically, risking over-provisioning the server. Nor can we add memory to the server automatically.

Again if you have concrete situations where a repair may solve the issue, we have to investigate. Keep in mind that in our experience this is not at all common across our users. Can you imagine Cloudron just issuing a server restart automatically since we found that this often fixed issues in the past?

Lonkle

@nebulon said in Automatically repair app when the HealthCheck goes down (Not Responding):

Can you imagine Cloudron just issuing a server restart automatically since we found that this often fixed issues in the past?

Well, of course not. That would defeat the entire purpose of this thread which is to have your apps have as much uptime as needed. Your opposition is, “we, developers need to make a perfect system and fix things so that our healthcheck isn’t even needed.” I agree that that’s the goal but you’re not there yet (both my and the other users on this thread attest to that) so why not protect users to these once off oddities that will never be reported if it doesn’t happen more than once? Because users are still going to hit repair and not do a single thing about it. But if in the notification there was an “ignore” button or “Send before and after logs to developers” then they may just press that and automate all of your manual steps. Because you could make this more complicated if you want - but in the end, it boils down to a user choosing to send you logs to fix an underlying issue. Even if you choose not to repair automatically once per “Non-responsive” / “Failed” app, you should then go further in your direction and make it easy for users to log these events and report to you. I still think a single courtesy attempted repair is best and Docker already has that built in so we’re gonna get it. Well, actually, I’m now making “Dot - The Repair Bot” to monitor the status of my apps and repair them if needed. This was always something I could do - even from inside a Cloudron app. And I intend to do it and release it for everyone. But this seems more suited as a built in feature, a checkbox that says “Attempt a single repair on app failure.”

And sure your steps weren’t exhaustive. But @ruihildt put it best. Average users don’t have time for any of that. I wasn’t advocating for automated sending of the logs. I was advocating for a button in the Notification Center that would send you, the developers, the relevant data you need - because users do not do what devs do (report bugs), but they would if it was a button click. Regardless of where you land on this matter, making it easier for devs to fix their apps is best if you’re going to say the only con to the pro of “more uptime” is “less developer communication.”

You guys for sure won’t be niche forever. So scaling ideas to reduce support requests should be considered heavily. Turns out for everyone in this thread, they repaired all of their apps and it never happened again. So no support requests when I could totally see someone complaining about having to manual repair 20 apps. The fact that you have a health check and this one-off repair has worked for all of us and we never reported it, means a singular automatic repair has no real-life con (aside from so far your theoretical con), but real life pros (uptime and reliability).

Coincidentally, this unintentionally happened to me again while rebooting the server. The app booted up not responding (1 of 4) and I repaired it and went on with my day. I’m a developer and didn’t think anything of it except that I wished it had been automatic in case I hadn’t noticed.

So I just want to ask, do you feel like my goal and your goal can possibly coexist? I can wait for the Docker update which adds this but I think I’ll create Dot to keep everything healthy automatically (which is something I would expect from an app manager). Average customer UX, saying this even as a developer, is more important than developer UX.

nebulon

Well really if you hit this just now on a reboot, I don't understand why this would not be investigated. We are happy to help. This is simply not acceptable having to run a repair after reboot. Repair in this case would just mean it recreates the containers after they were already just recreated after reboot. This makes no sense why it would fail first and then not, maybe there is some timing or resource issue lurking. I suspect there is an issue in the healthchecker state and the apps would have become healthy after some time. But all this is up in the air without further info. Repair may actually re-runs the last task! So after a reboot it may run some random unrelated app configuration task.

I have seen many times where users would restart/repair apps just because they felt it took too long for them to start up, such actions are sometimes even risky if an app is starting up. This is also why we don't show the task cancellation button immediately. Randomly interrupting processes is not a good idea if it can be avoided, as it heavily depends on how an app is written and how it can recover from for example interrupted database migrations.

ruihildt

@nebulon I totally understand the privacy aspect, so that could get logged, but only sent when needed.

The idea would be to have a snapshot of the logs at the time with all necessary informational to further debug even if the app is restarted for obvious reasons.

Cloudron promise is to make selfhosting easy, and such a system would make easier the bug reporting.

Maybe that's unrealistic.

Lonkle

@nebulon @nebulon We can 100% agree on that. But I have nothing to give you. There was just a down time notification until I Xed the notification and hit repair on the app. I tried rebooting again and it worked so my impression is it was some kind of race condition since I have four apps now and that’s the most I’ve ever had. I’ve rebooted plenty of times with nothing happening. But it’s kind of the point. Yes, this shouldn’t happen. But it does. It has. We shouldn’t be punishing users because we want a perfect system simply because we want more data. We should have safeguards for them and make logs easier for us to receive if something does happen. If I had a notification that auto-repaired my app and then I could hit “Send before and after logs to developers” with a description of what happened; I’d do it.

I’m making Dot the Repair Bot for me to accomplish this. Since I know we’re not going to agree on this. But why wouldn’t we be trying to find a win for you and win for users. Cloudron has become my favorite platform, I presented an issue and possible solution and it was rejected for being to “automated” but that’s the point of Cloudron, automating the redundant tasks of running apps on the web.

Lonkle

PS. @nebulon, do you ever sleep? #meneither

Lonkle

PSPS. I hope no one takes any disrespect when having opposing ideas on here. I always try to debate in good faith. I view us, the forum members, as needing to decide on important infrastructure aspects like this. So it’s us vs the problem. Not my idea vs anyone else’s. The problem is that sometimes users have to repair their apps manually. It usually happens after auto-updates / reboots (the only two times I’ve experienced it or read about others experiencing it on here). So what’s the solution? My solution is to auto-repair apps like that until we get enough data to fix the underlying problem. Meaning, we need not only an auto-repair so we’re not punishing users for a bad update (that will always happen), but also despite automatic repairs, easier log reporting of these errors to us.

So, it’s us vying for more logs to fix underlying issues, and automatic repair when something really is a one-ff situation which sometimes don’t have an explanation. Or they do but an average user couldn’t find it. But if they hit the “Send us crash logs” button then we might be able to fix it or find the issue.

robi

While this has STILL not been addressed by docker upstream, there is a proposed solution that works:
https://hub.docker.com/r/willfarrell/autoheal

Alternatively a simple cron script checking for unhealthy containers to restart them.

@staff what do you think?

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Automatically repair app when the HealthCheck goes down (Not Responding)