Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. GitLab
  3. Sidekiq Loop takes down GitLab instance

Sidekiq Loop takes down GitLab instance

Scheduled Pinned Locked Moved GitLab
3 Posts 2 Posters 555 Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • timweddeT Offline
      timweddeT Offline
      timwedde
      wrote on last edited by timwedde
      #1

      After trying out the Container Registry integration (but eventually deciding that it's not that useful to us), it seems to be the case that something within the process outlined in the docs has left our GitLab instance in an invalid state, which breaks it to the point of becoming unusable.

      The problem is that, apparently, a Sidekiq job is stuck in an infinite loop: It keeps spawning. As soon as capacity is available on a Sidekiq worker, a new job pops up. The specific job is called ContainerRegistry::DeleteContainerRepositoryWorker, I have reproduced one of its queue representations below:

      {"retry"=>0, "queue"=>"default", "version"=>0, "status_expiration"=>1800, "queue_namespace"=>"container_repository_delete", "args"=>[], "class"=>"ContainerRegistry::DeleteContainerRepositoryWorker", "jid"=>"8e56ce71ec9cf7c4a0eae8fd", "created_at"=>1708097426.835311, "meta.caller_id"=>"ContainerRegistry::DeleteContainerRepositoryWorker", "correlation_id"=>"caa24956212e875d03537fd71282cba2", "meta.root_caller_id"=>"Cronjob", "meta.feature_category"=>"container_registry", "meta.client_id"=>"ip/", "worker_data_consistency"=>"always", "size_limiter"=>"validated", "enqueued_at"=>1708097426.838449}
      

      This effectively suffocates the queue, no other task can be processed and the entire instance grinds to a halt. The web interface works, but most features simply hang because their background jobs never get the chance to run.

      Here is a very small excerpt from the instance logs:

      Feb 16 16:34:20{"severity":"INFO","time":"2024-02-16T15:34:20.544Z","retry":0,"queue":"default","version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","args":[],"class":"ContainerRegistry::DeleteContainerRepositoryWorker","jid":"1b599d25b46a565485c3f4f6","created_at":"2024-02-16T15:34:20.409Z","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"caa24956212e875d03537fd71282cba2","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2024-02-16T15:34:20.413Z","job_size_bytes":2,"pid":69152,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-1b599d25b46a565485c3f4f6: done: 0.120002 sec","job_status":"done","scheduling_latency_s":0.011408,"redis_calls":8,"redis_duration_s":0.007204,"redis_read_bytes":9,"redis_write_bytes":1246,"redis_queues_calls":4,"redis_queues_duration_s":0.002866,"redis_queues_read_bytes":5,"redis_queues_write_bytes":707,"redis_shared_state_calls":4,"redis_shared_state_duration_s":0.004338,"redis_shared_state_read_bytes":4,"redis_shared_state_write_bytes":539,"db_count":10,"db_write_count":3,"db_cached_count":1,"db_replica_count":0,"db_primary_count":10,"db_main_count":10,"db_ci_count":0,"db_main_replica_count":0,"db_ci_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":1,"db_main_cached_count":1,"db_ci_cached_count":0,"db_main_replica_cached_count":0,"db_ci_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_ci_wal_count":0,"db_main_replica_wal_count":0,"db_ci_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_ci_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_ci_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.033,"db_main_duration_s":0.033,"db_ci_duration_s":0.0,"db_main_replica_duration_s":0.0,"db_ci_replica_duration_s":0.0,"cpu_s":0.061245,"worker_id":"sidekiq","rate_limiting_gates":[],"duration_s":0.120002,"completed_at":"2024-02-16T15:34:20.544Z","load_balancing_strategy":"primary","db_duration_s":0.038678,"urgency":"low","target_duration_s":300,"target_scheduling_latency_s":60}
      Feb 16 16:34:20{"severity":"INFO","time":"2024-02-16T15:34:20.553Z","retry":0,"queue":"default","version":0,"status_expiration":1800,"queue_namespace":"container_repository_delete","args":[],"class":"ContainerRegistry::DeleteContainerRepositoryWorker","jid":"6be37d4007f559ced52a8eb5","created_at":"2024-02-16T15:34:20.535Z","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"caa24956212e875d03537fd71282cba2","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2024-02-16T15:34:20.538Z","job_size_bytes":2,"pid":69152,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-6be37d4007f559ced52a8eb5: start","job_status":"start","scheduling_latency_s":0.015051}
      

      It's just this, over and over again, multiple times a second. I've tried canceling the tasks manually via the Rails console, but they just keep spawning.
      Has anybody encountered this behavior before? Is there a way to prevent these tasks from spawning?

      The especially weird thing is that this only started happening after trying out the Container Registry integration, but also not immediately. Even disabling the registry again does nothing to quell the issue, which is particularly frustrating: Currently I have no way back to a stable state (that I can see, at least).

      Here's a screenshot of when things went south, the spike in jobs is incredibly drastic:
      Screenshot 2024-02-16 at 16.41.08.png

      Any info on this would be extremely appreciated.

      1 Reply Last reply
      1
      • girishG Offline
        girishG Offline
        girish
        Staff
        wrote on last edited by
        #2

        @timwedde there is a sidekiq troubleshooting guide upstream at https://docs.gitlab.com/ee/administration/sidekiq/sidekiq_troubleshooting.html#remove-specific-job-id-destructive . On Cloudron, you can get a sidekiq console following https://docs.cloudron.io/apps/gitlab/#rails-console

        For example, on our instance:

        irb(main):002:0> Sidekiq::Queue.new("pipeline_processing:build_queue").size
        => 0
        

        I would try deleting that sidekiq job using some of the commands in https://docs.gitlab.com/ee/administration/sidekiq/sidekiq_troubleshooting.html#managing-sidekiq-queues

        1 Reply Last reply
        0
        • timweddeT Offline
          timweddeT Offline
          timwedde
          wrote on last edited by
          #3

          I did read through those pages, and that is in fact how I arrived at the queue state representation that I posted above. Unfortunately clearing the queue did not help for long: The tasks kept reappearing.

          I have, however -through many trials and tribulations- possibly managed to resolve this problem over the weekend: Logging into the GitLab postgres DB, I found a table called container_repositories which contained references to the repos that were used to test the registry feature. After truncating that table, clearing the queued jobs and restarting GitLab (as well as the entire instance for good measure), it seems to have slowly reverted back to normal over the past 24 hours, at least for the time being. I'm gonna keep an eye on it some more, but it's looking like that might've been it.

          It's a bit weird that GitLab would keep spinning tasks endlessly like this, but then again it's a massive piece of software, so perhaps this is simply a strange bug in GitLab itself. In any case, the container registry instructions for GitLab integration did not seem to work that well for us as they produce this behavior whenever I try them out (I tried it twice to make sure), so maybe another look at those could be warranted.

          1 Reply Last reply
          1
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


            • Login

            • Don't have an account? Register

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • Bookmarks
            • Search