Baserow - [CRITICAL] WORKER TIMEOUT

martinkbs

Hi guys,

Since last weekend, my Baserow instance has started consuming excessive server resources (attached is a screenshot showing the CPU consumption spikes) and the frontend is not working. In fact, I noticed it wasn't working because the n8n workflows that use it were consistently failing.

Additionally, when reviewing the application logs, I found numerous warnings like the following:

Jul 22 18:44:59 Not configuring telemetry due to BASEROW_ENABLE_OTEL not being set.
Jul 22 18:45:00 [2024-07-22 16:45:00,098: INFO/MainProcess] Scheduler: Sending due task baserow.core.notifications.tasks.beat_send_instant_notifications_summary_by_email() (baserow.core.notifications.tasks.beat_send_instant_notifications_summary_by_email)
Jul 22 18:45:00 [2024-07-22 16:45:00,103: INFO/MainProcess] Task baserow.core.notifications.tasks.beat_send_instant_notifications_summary_by_email[1c6e7cb7-e7eb-4809-ae45-c46778e72594] received
Jul 22 18:45:00 [2024-07-22 16:45:00,304: INFO/MainProcess] Task baserow.core.notifications.tasks.singleton_send_instant_notifications_summary_by_email[bc948e50-4c8f-4229-96a2-4a79cc986bed] received
Jul 22 18:45:00 [2024-07-22 16:45:00,503: INFO/ForkPoolWorker-8] Task baserow.core.notifications.tasks.beat_send_instant_notifications_summary_by_email[1c6e7cb7-e7eb-4809-ae45-c46778e72594] succeeded in 0.3006958370006032s: None
Jul 22 18:45:00 [2024-07-22 16:45:00 +0000] [23] [WARNING] Worker with pid 226 was terminated due to signal 9
Jul 22 18:45:00 [2024-07-22 16:45:00,696: INFO/ForkPoolWorker-1] Task baserow.core.notifications.tasks.singleton_send_instant_notifications_summary_by_email[bc948e50-4c8f-4229-96a2-4a79cc986bed] succeeded in 0.2945839520007212s: None
Jul 22 18:45:00 [2024-07-22 16:45:00 +0000] [232] [INFO] Booting worker with pid: 232
Jul 22 18:45:02 [2024-07-22 16:45:01 +0000] [23] [CRITICAL] WORKER TIMEOUT (pid:227)
Jul 22 18:45:02 [2024-07-22 16:45:01 +0000] [23] [CRITICAL] WORKER TIMEOUT (pid:228)
Jul 22 18:45:02 [2024-07-22 16:45:02 +0000] [228] [INFO] Worker exiting (pid: 228)
Jul 22 18:45:02 Not configuring telemetry due to BASEROW_ENABLE_OTEL not being set.
Jul 22 18:45:02 Not configuring telemetry due to BASEROW_ENABLE_OTEL not being set.
Jul 22 18:45:02 [2024-07-22 16:45:02 +0000] [227] [INFO] Worker exiting (pid: 227)
Jul 22 18:45:03 [2024-07-22 16:45:03 +0000] [23] [WARNING] Worker with pid 227 was terminated due to signal 9
Jul 22 18:45:03 [2024-07-22 16:45:03 +0000] [233] [INFO] Booting worker with pid: 233
Jul 22 18:45:03 [2024-07-22 16:45:03 +0000] [23] [WARNING] Worker with pid 228 was terminated due to signal 9
Jul 22 18:45:03 [2024-07-22 16:45:03 +0000] [234] [INFO] Booting worker with pid: 234
Jul 22 18:45:07 => Healtheck error: Error: Timeout of 7000ms exceeded
Jul 22 18:45:07 172.18.0.1 - - [22/Jul/2024:16:45:07 +0000] "GET /_health HTTP/1.1" 499 0 "-" "Mozilla (CloudronHealth)"
Jul 22 18:45:17 => Healtheck error: Error: Timeout of 7000ms exceeded
Jul 22 18:45:17 172.18.0.1 - - [22/Jul/2024:16:45:17 +0000] "GET /_health HTTP/1.1" 499 0 "-" "Mozilla (CloudronHealth)"
Jul 22 18:45:22 [2024-07-22 16:45:22 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:229)
Jul 22 18:45:23 [2024-07-22 16:45:23 +0000] [22] [WARNING] Worker with pid 229 was terminated due to signal 6
Jul 22 18:45:23 [2024-07-22 16:45:23 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:230)
Jul 22 18:45:23 [2024-07-22 16:45:23 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:231)
Jul 22 18:45:23 [2024-07-22 16:45:23 +0000] [235] [INFO] Booting worker with pid: 235
Jul 22 18:45:24 [2024-07-22 16:45:24 +0000] [22] [WARNING] Worker with pid 230 was terminated due to signal 6
Jul 22 18:45:24 [2024-07-22 16:45:24 +0000] [236] [INFO] Booting worker with pid: 236
Jul 22 18:45:24 [2024-07-22 16:45:24 +0000] [22] [WARNING] Worker with pid 231 was terminated due to signal 9
Jul 22 18:45:24 [2024-07-22 16:45:24 +0000] [237] [INFO] Booting worker with pid: 237
Jul 22 18:45:25 172.18.0.1 - - [22/Jul/2024:16:45:25 +0000] "GET /_health HTTP/1.1" 200 162736 "-" "Mozilla (CloudronHealth)"
Jul 22 18:45:31 Not configuring telemetry due to BASEROW_ENABLE_OTEL not being set.
Jul 22 18:45:31 [2024-07-22 16:45:30 +0000] [23] [CRITICAL] WORKER TIMEOUT (pid:232)
Jul 22 18:45:31 [2024-07-22 16:45:30 +0000] [232] [INFO] Worker exiting (pid: 232)
Jul 22 18:45:32 [2024-07-22 16:45:32 +0000] [23] [WARNING] Worker with pid 232 was terminated due to signal 9
Jul 22 18:45:32 [2024-07-22 16:45:32 +0000] [238] [INFO] Booting worker with pid: 238
Jul 22 18:45:33 [2024-07-22 16:45:33 +0000] [23] [CRITICAL] WORKER TIMEOUT (pid:233)
Jul 22 18:45:33 [2024-07-22 16:45:33 +0000] [23] [CRITICAL] WORKER TIMEOUT (pid:234)
Jul 22 18:45:33 Not configuring telemetry due to BASEROW_ENABLE_OTEL not being set.
Jul 22 18:45:33 [2024-07-22 16:45:33 +0000] [234] [INFO] Worker exiting (pid: 234)
Jul 22 18:45:33 [2024-07-22 16:45:33 +0000] [233] [INFO] Worker exiting (pid: 233)
Jul 22 18:45:34 Not configuring telemetry due to BASEROW_ENABLE_OTEL not being set.
Jul 22 18:45:35 [2024-07-22 16:45:35 +0000] [23] [WARNING] Worker with pid 234 was terminated due to signal 9
Jul 22 18:45:35 [2024-07-22 16:45:35 +0000] [23] [WARNING] Worker with pid 233 was terminated due to signal 9
Jul 22 18:45:35 [2024-07-22 16:45:35 +0000] [23] [DEBUG] 2 workers
Jul 22 18:45:35 [2024-07-22 16:45:35 +0000] [240] [INFO] Booting worker with pid: 240
Jul 22 18:45:35 [2024-07-22 16:45:35 +0000] [23] [DEBUG] 3 workers
Jul 22 18:45:35 172.18.0.1 - - [22/Jul/2024:16:45:35 +0000] "GET /_health HTTP/1.1" 200 162743 "-" "Mozilla (CloudronHealth)"

I have tried both restoring a backup from previous days and updating to the latest version, and the time for both processes far exceeds 30 minutes for each operation, which is not normal.

Any suggestions or indications of what might be happening?

nebulon

Is this happening after some baserow update of your instance? Something must have changed to cause this, either data or maybe connected services like n8n since you mention that.

martinkbs

It has happened when upgrading from package version 1.19.2 to 1.20.0.

I have tested the application in Recovery Mode and running /app/pkg/start.sh manually and the application works, but when I exit Recovery Mode, the application stays in Not responding state.

nebulon

Hm this usually indicates the app running out of memoty (recovery mode does not limit memory) or something tries to write to read-only filesystem. At least here the issue does not show up in my test instance. Are there any special settings or app configs you were using?

martinkbs

Hi @nebulon

No, the Baserow application has the standard configuration with which it is installed. Using it daily only stores about 100 rows in a table corresponding to the current month. It has been operating normally, until this anomaly occurred.

I have raised the memory available for the application to the maximum available, and it still does not work. The application marks running but the frontend cannot be accessed.

Furthermore, as you can see in the following screenshot, the memory barely exceeds normal consumption, but the CPU consumption slows down the operation of other applications on the server.

However, in recovery mode, with an allocation of 8 Gb of memory, the consumption of both memory and CPU is normal (you can see it in the hours before the screenshot, before the changes).

nebulon

It very much seems like an app bug in the new version you are hitting here. Not sure if we can help you debug this without app specific expertise. Have you contacted upstream app devs and issue tracker about this already?

To stay in operation, have you tried to clone the non-updated, working version and then only update the cloned app to see if this is easily reproducible?

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Baserow - [CRITICAL] WORKER TIMEOUT