Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


    Cloudron Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular

    Solved Graphite keeps crashing OOM

    Support
    graphs oom
    6
    37
    1229
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • robi
      robi @robi last edited by

      @robi OOM twice more yesterday.

      It would be great if the message & email sent included the limit reached & time stamp.

      The email includes the email time though.

      Life of Advanced Technology

      1 Reply Last reply Reply Quote 2
      • robi
        robi last edited by

        Apr 07 09:15:39 builtins.StopIteration:
        Apr 07 09:15:39 07/04/2021 16:15:39 :: [console] Unhandled Error
        Apr 07 09:15:39 Traceback (most recent call last):
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
        Apr 07 09:15:39 result = inContext.theWork()
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
        Apr 07 09:15:39 inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
        Apr 07 09:15:39 return self.currentContext().callWithContext(ctx, func, *args, **kw)
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
        Apr 07 09:15:39 return func(*args,**kw)
        Apr 07 09:15:39 --- <exception caught here> ---
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 189, in writeForever
        Apr 07 09:15:39 writeCachedDataPoints()
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 98, in writeCachedDataPoints
        Apr 07 09:15:39 (metric, datapoints) = cache.drain_metric()
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 187, in drain_metric
        Apr 07 09:15:39 metric = self.strategy.choose_item()
        Apr 07 09:15:39 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 116, in choose_item
        Apr 07 09:15:39 return next(self.queue)
        Apr 07 09:15:39 builtins.StopIteration:
        Apr 07 09:15:40 07/04/2021 16:15:40 :: [console] Unhandled Error
        Apr 07 09:15:40 Traceback (most recent call last):
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
        Apr 07 09:15:40 result = inContext.theWork()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
        Apr 07 09:15:40 inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
        Apr 07 09:15:40 return self.currentContext().callWithContext(ctx, func, *args, **kw)
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
        Apr 07 09:15:40 return func(*args,**kw)
        Apr 07 09:15:40 --- <exception caught here> ---
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 189, in writeForever
        Apr 07 09:15:40 writeCachedDataPoints()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 98, in writeCachedDataPoints
        Apr 07 09:15:40 (metric, datapoints) = cache.drain_metric()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 187, in drain_metric
        Apr 07 09:15:40 metric = self.strategy.choose_item()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 116, in choose_item
        Apr 07 09:15:40 return next(self.queue)
        Apr 07 09:15:40 builtins.StopIteration:
        Apr 07 09:15:40 07/04/2021 16:15:40 :: [console] Unhandled Error
        Apr 07 09:15:40 Traceback (most recent call last):
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
        Apr 07 09:15:40 result = inContext.theWork()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
        Apr 07 09:15:40 inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
        Apr 07 09:15:40 return self.currentContext().callWithContext(ctx, func, *args, **kw)
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
        Apr 07 09:15:40 return func(*args,**kw)
        Apr 07 09:15:40 --- <exception caught here> ---
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 189, in writeForever
        Apr 07 09:15:40 writeCachedDataPoints()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/writer.py", line 98, in writeCachedDataPoints
        Apr 07 09:15:40 (metric, datapoints) = cache.drain_metric()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 187, in drain_metric
        Apr 07 09:15:40 metric = self.strategy.choose_item()
        Apr 07 09:15:40 File "/usr/lib/python3/dist-packages/carbon/cache.py", line 116, in choose_item
        Apr 07 09:15:40 return next(self.queue)
        Apr 07 09:15:40 builtins.StopIteration:
        

        It still runs out of memory with 2.5GB limit.

        Also my browser tab for the logs starts timing out as if it can't keep up.. maybe it's in a continuous crash loop.

        Life of Advanced Technology

        rmdes 1 Reply Last reply Reply Quote 1
        • rmdes
          rmdes @robi last edited by

          @robi same here, continued crash loop, log file is really huge

          1 Reply Last reply Reply Quote 0
          • rmdes
            rmdes @nebulon last edited by

            @nebulon running the latest cloudron version, confirmed

            rmdes 1 Reply Last reply Reply Quote 0
            • rmdes
              rmdes @rmdes last edited by

              I don't have any visible python twisted package error anymore, but it seems the two last OOM event output this error : https://paste.armada.digital/enijufixep.coffeescript

              1 Reply Last reply Reply Quote 0
              • jdaviescoates
                jdaviescoates last edited by

                I'm also on 6.2.7 but graphite has just started crashing with OOM in the last couple of days (it's never crashed for me before then).

                Also agree with @robi that the notifications within Cloudron really ought to include the time it happened (which I can only tell by seeing what time the email notification arrived).

                I use Cloudron with Gandi & Hetzner

                1 Reply Last reply Reply Quote 1
                • robi
                  robi last edited by

                  A manual restart by pushing the button in services, seems to have calmed down the crashing for now.

                  I hope the log rotator does it's job, we don't need to have so many huge logs around.

                  Life of Advanced Technology

                  1 Reply Last reply Reply Quote 1
                  • jdaviescoates
                    jdaviescoates last edited by jdaviescoates

                    Just had another Graphite OOM crash.

                    Seems strange seeing it's mostly just me using my Cloudron atm and I'm not really doing anything on it.

                    What does Graphite actually do?

                    I use Cloudron with Gandi & Hetzner

                    nebulon 1 Reply Last reply Reply Quote 0
                    • nebulon
                      nebulon Staff @jdaviescoates last edited by

                      @jdaviescoates that service (graphite+collectd) collects the data used in the graphs, like memory usage over time. Given that it causes issues from time to time and also we don't really utilize it well, we are thinking of maybe collecting the data on our own and ditch graphite.

                      jdaviescoates 1 Reply Last reply Reply Quote 2
                      • jdaviescoates
                        jdaviescoates @nebulon last edited by

                        Thanks

                        @nebulon said in Graphite keeps crashing OOM:

                        collecting the data on our own

                        What would that look like?

                        I use Cloudron with Gandi & Hetzner

                        nebulon 1 Reply Last reply Reply Quote 0
                        • nebulon
                          nebulon Staff @jdaviescoates last edited by

                          @jdaviescoates we don't know yet 😉

                          scooke 1 Reply Last reply Reply Quote 1
                          • scooke
                            scooke @nebulon last edited by

                            @nebulon Caprover uses Netdata... would that be possible?

                            A life lived in fear is a life half-lived

                            1 Reply Last reply Reply Quote 2
                            • robi
                              robi last edited by

                              0bcb80f1-c3a8-4e0d-af61-6a02f89d7332-image.png
                              After a server restart, graphite won't start. Reconfig doesn't help.

                              Life of Advanced Technology

                              robi 1 Reply Last reply Reply Quote 0
                              • robi
                                robi @robi last edited by

                                I decided to reboot the box for security upgrades (from notifications) and it came up without errors this time.

                                Life of Advanced Technology

                                1 Reply Last reply Reply Quote 0
                                • jdaviescoates
                                  jdaviescoates last edited by

                                  Graphite OOM, again.

                                  I use Cloudron with Gandi & Hetzner

                                  nebulon 1 Reply Last reply Reply Quote 0
                                  • nebulon
                                    nebulon Staff @jdaviescoates last edited by

                                    @jdaviescoates how much memory as the limit is set in your case? Also does the server itself have enough free memory to allocate? The settings in Cloudron are only the upper limit, but it may still get killed with oom if there is none available system-wide

                                    jdaviescoates rmdes 2 Replies Last reply Reply Quote 0
                                    • jdaviescoates
                                      jdaviescoates @nebulon last edited by

                                      @nebulon it was at whatever the default is (256MB?) I've now upped it to 512MB to see if that stops it. Plenty of spare RAM on the machine.

                                      I use Cloudron with Gandi & Hetzner

                                      1 Reply Last reply Reply Quote 0
                                      • rmdes
                                        rmdes @nebulon last edited by

                                        @nebulon my graphite service has 1.60GB available, still OOM several times a day..
                                        the machine where cloudron is running has 30GB available, on average 15 Gb is being used leaving half of the available memory free.

                                        rmdes 1 Reply Last reply Reply Quote 1
                                        • nebulon
                                          nebulon Staff last edited by

                                          All this does not sound right then. Do you see anything suspicious in the graphite logs as such? Like frequent restarts of something or so?

                                          rmdes 1 Reply Last reply Reply Quote 1
                                          • rmdes
                                            rmdes @nebulon last edited by

                                            @nebulon This is the only errors I find in the log, beside the restarts :
                                            https://paste.armada.digital/xanopucuqu.sql

                                            robi 1 Reply Last reply Reply Quote 0
                                            • robi
                                              robi @rmdes last edited by

                                              I get daily crashes too, with same/similar log messages about cache and draining issues.

                                              Life of Advanced Technology

                                              1 Reply Last reply Reply Quote 0
                                              • rmdes
                                                rmdes @rmdes last edited by

                                                my.armada.digital_.png
                                                When graphite crash...

                                                robi 1 Reply Last reply Reply Quote 0
                                                • robi
                                                  robi @rmdes last edited by

                                                  @rmdes It's like Graphite sees Nessie the Loch Ness monster and freaks out..

                                                  Thanks for the graphs, er laughs. 😆

                                                  Life of Advanced Technology

                                                  rmdes 1 Reply Last reply Reply Quote 2
                                                  • rmdes
                                                    rmdes @robi last edited by rmdes

                                                    @robi here's another one, zoomed at 24h my.armada.digital_ (1).png
                                                    Funny thing is I understand it crashes because of memory issues (resulting out of python errors?)
                                                    but why/how does Graphite reboot itself ? I mean why fail to reboot for hours and suddenly it back online? why ?

                                                    robi 1 Reply Last reply Reply Quote 0
                                                    • robi
                                                      robi @rmdes last edited by

                                                      @rmdes nice.. yep not how a health monitored app should behave.

                                                      looks like something got stuck for a while then finally failed to get kicked again.

                                                      Life of Advanced Technology

                                                      1 Reply Last reply Reply Quote 0
                                                      • rmdes
                                                        rmdes last edited by

                                                        Maybe this python error can help ? https://paste.armada.digital/ovurasajof.sql

                                                        girish 1 Reply Last reply Reply Quote 0
                                                        • girish
                                                          girish Staff @rmdes last edited by

                                                          @rmdes are you able to write to me on support@ and give me ssh access, so I can debug this? Would be good understand what's happening here.

                                                          rmdes 1 Reply Last reply Reply Quote 1
                                                          • rmdes
                                                            rmdes @girish last edited by

                                                            @girish Yes of course, doing this now, SSH has been enabled.

                                                            girish 2 Replies Last reply Reply Quote 1
                                                            • girish
                                                              girish Staff @rmdes last edited by

                                                              @rmdes thanks for the access. it seems your server somehow hits this carbon cache bug - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923464

                                                              1 Reply Last reply Reply Quote 2
                                                              • girish
                                                                girish Staff @rmdes last edited by

                                                                @rmdes I have applied the patch in the bug report and it seems to fix the problem. I have applied change to your server locally. Will be in next release.

                                                                rmdes 1 Reply Last reply Reply Quote 2
                                                                • rmdes
                                                                  rmdes @girish last edited by rmdes

                                                                  @girish So this only hit on me ?
                                                                  Anyway, Thanks a lot for applying the patch locally and fixing the issue !

                                                                  girish 1 Reply Last reply Reply Quote 2
                                                                  • girish
                                                                    girish Staff @rmdes last edited by

                                                                    @rmdes yes, I am not sure why. It doesn't happen in any of our demo servers or managed services. Quite strange. It could also be that maybe others have hit it but have not noticed it (since it only causes a CPU spike..) but clearly it's a bug since it's been fixed upstream.

                                                                    1 Reply Last reply Reply Quote 3
                                                                    • First post
                                                                      Last post
                                                                    Powered by NodeBB