Solved Graphite keeps crashing OOM
-
So every few hours Graphite crash, even when it has 2GB of memory, if I look at the log, this is what comes up : https://paste.armada.digital/todawelexi.sql
it seems there is Python Twisted error that keeps repeating over and over again, my log file is 500+MB with a loop of the error in the link above.
-
@rmdes can you ensure that you run the latest Cloudron version 6.2.7 ? There were fixes for graphite in the last one.
-
@nebulon I've had it happen on the latest v6.2.7 as well. Something keeps spiking its memory usage.
-
Updating to 6.2.7 should definitely make the twisted errors sorted out. There was an error in graphite web configuration in previous releases.
-
@girish Happened twice today with 6.2.7.
memory limit at 640MB down from 1250MB, where it had the same behavior. -
@robi OOM twice more yesterday.
It would be great if the message & email sent included the limit reached & time stamp.
The email includes the email time though.
-
Apr 07 09:15:39 builtins.StopIteration: Apr 07 09:15:39 07/04/2021 16:15:39 :: [console] Unhandled Error Apr 07 09:15:39 Traceback (most recent call last): Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext Apr 07 09:15:39 result = inContext.theWork() Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda> Apr 07 09:15:39 inContext.theWork = lambda: context.call(ctx, func, *args, **kw) Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext Apr 07 09:15:39 return self.currentContext().callWithContext(ctx, func, *args, **kw) Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext Apr 07 09:15:39 return func(*args,**kw) Apr 07 09:15:39 --- <exception caught here> --- Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 189, in writeForever Apr 07 09:15:39 writeCachedDataPoints() Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 98, in writeCachedDataPoints Apr 07 09:15:39 (metric, datapoints) = cache.drain_metric() Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 187, in drain_metric Apr 07 09:15:39 metric = self.strategy.choose_item() Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 116, in choose_item Apr 07 09:15:39 return next(self.queue) Apr 07 09:15:39 builtins.StopIteration: Apr 07 09:15:40 07/04/2021 16:15:40 :: [console] Unhandled Error Apr 07 09:15:40 Traceback (most recent call last): Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext Apr 07 09:15:40 result = inContext.theWork() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda> Apr 07 09:15:40 inContext.theWork = lambda: context.call(ctx, func, *args, **kw) Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext Apr 07 09:15:40 return self.currentContext().callWithContext(ctx, func, *args, **kw) Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext Apr 07 09:15:40 return func(*args,**kw) Apr 07 09:15:40 --- <exception caught here> --- Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 189, in writeForever Apr 07 09:15:40 writeCachedDataPoints() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 98, in writeCachedDataPoints Apr 07 09:15:40 (metric, datapoints) = cache.drain_metric() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 187, in drain_metric Apr 07 09:15:40 metric = self.strategy.choose_item() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 116, in choose_item Apr 07 09:15:40 return next(self.queue) Apr 07 09:15:40 builtins.StopIteration: Apr 07 09:15:40 07/04/2021 16:15:40 :: [console] Unhandled Error Apr 07 09:15:40 Traceback (most recent call last): Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext Apr 07 09:15:40 result = inContext.theWork() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda> Apr 07 09:15:40 inContext.theWork = lambda: context.call(ctx, func, *args, **kw) Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext Apr 07 09:15:40 return self.currentContext().callWithContext(ctx, func, *args, **kw) Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext Apr 07 09:15:40 return func(*args,**kw) Apr 07 09:15:40 --- <exception caught here> --- Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 189, in writeForever Apr 07 09:15:40 writeCachedDataPoints() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 98, in writeCachedDataPoints Apr 07 09:15:40 (metric, datapoints) = cache.drain_metric() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 187, in drain_metric Apr 07 09:15:40 metric = self.strategy.choose_item() Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 116, in choose_item Apr 07 09:15:40 return next(self.queue) Apr 07 09:15:40 builtins.StopIteration:
It still runs out of memory with 2.5GB limit.
Also my browser tab for the logs starts timing out as if it can't keep up.. maybe it's in a continuous crash loop.
-
@robi same here, continued crash loop, log file is really huge
-
@nebulon running the latest cloudron version, confirmed
-
I don't have any visible python twisted package error anymore, but it seems the two last OOM event output this error : https://paste.armada.digital/enijufixep.coffeescript
-
I'm also on 6.2.7 but graphite has just started crashing with OOM in the last couple of days (it's never crashed for me before then).
Also agree with @robi that the notifications within Cloudron really ought to include the time it happened (which I can only tell by seeing what time the email notification arrived).
-
A manual restart by pushing the button in services, seems to have calmed down the crashing for now.
I hope the log rotator does it's job, we don't need to have so many huge logs around.
-
Just had another Graphite OOM crash.
Seems strange seeing it's mostly just me using my Cloudron atm and I'm not really doing anything on it.
What does Graphite actually do?
-
@jdaviescoates that service (graphite+collectd) collects the data used in the graphs, like memory usage over time. Given that it causes issues from time to time and also we don't really utilize it well, we are thinking of maybe collecting the data on our own and ditch graphite.
-
Thanks
@nebulon said in Graphite keeps crashing OOM:
collecting the data on our own
What would that look like?
-
@jdaviescoates we don't know yet
-
@nebulon Caprover uses Netdata... would that be possible?
-
After a server restart, graphite won't start. Reconfig doesn't help. -
I decided to reboot the box for security upgrades (from notifications) and it came up without errors this time.
-
Graphite OOM, again.
-
@jdaviescoates how much memory as the limit is set in your case? Also does the server itself have enough free memory to allocate? The settings in Cloudron are only the upper limit, but it may still get killed with oom if there is none available system-wide
-
@nebulon it was at whatever the default is (256MB?) I've now upped it to 512MB to see if that stops it. Plenty of spare RAM on the machine.
-
@nebulon my graphite service has 1.60GB available, still OOM several times a day..
the machine where cloudron is running has 30GB available, on average 15 Gb is being used leaving half of the available memory free. -
All this does not sound right then. Do you see anything suspicious in the graphite logs as such? Like frequent restarts of something or so?
-
@nebulon This is the only errors I find in the log, beside the restarts :
https://paste.armada.digital/xanopucuqu.sql -
I get daily crashes too, with same/similar log messages about cache and draining issues.
-
When graphite crash... -
@rmdes It's like Graphite sees Nessie the Loch Ness monster and freaks out..
Thanks for the graphs, er laughs.
-
@robi here's another one, zoomed at 24h
Funny thing is I understand it crashes because of memory issues (resulting out of python errors?)
but why/how does Graphite reboot itself ? I mean why fail to reboot for hours and suddenly it back online? why ? -
@rmdes nice.. yep not how a health monitored app should behave.
looks like something got stuck for a while then finally failed to get kicked again.
-
Maybe this python error can help ? https://paste.armada.digital/ovurasajof.sql
-
@rmdes are you able to write to me on support@ and give me ssh access, so I can debug this? Would be good understand what's happening here.
-
@girish Yes of course, doing this now, SSH has been enabled.
-
@rmdes thanks for the access. it seems your server somehow hits this carbon cache bug - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923464
-
@rmdes I have applied the patch in the bug report and it seems to fix the problem. I have applied change to your server locally. Will be in next release.
-
@girish So this only hit on me ?
Anyway, Thanks a lot for applying the patch locally and fixing the issue ! -
@rmdes yes, I am not sure why. It doesn't happen in any of our demo servers or managed services. Quite strange. It could also be that maybe others have hit it but have not noticed it (since it only causes a CPU spike..) but clearly it's a bug since it's been fixed upstream.