@alex-adestech That was quite some debugging session 🙂
Wanted to leave some notes here... The server was an EC2 R5 xlarge instance. It worked well but when you resize any app, it will just hang. And the whole server will stop responding eventually. One curious thing was that server had 32GB and ~20GB was in buff/cache in free -m output. I have never seen kernel caching so much. We also found this backtrace in dmesg output:
INFO: task docker:111571 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
docker D 0000000000000000 0 111571 1 0x00000080
ffff881c01527ab0 0000000000000086 ffff881c332f5080 ffff881c01527fd8
ffff881c01527fd8 ffff881c01527fd8 ffff881c332f5080 ffff881c01527bf0
ffff881c01527bf8 7fffffffffffffff ffff881c332f5080 0000000000000000
Call Trace:
[<ffffffff8163a909>] schedule+0x29/0x70
[<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
[<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
[<ffffffff8163acd6>] wait_for_completion+0x116/0x170
[<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
[<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
[<ffffffff810ab190>] ? call_srcu+0x70/0x70
[<ffffffff81219ebf>] ? __sync_blockdev+0x1f/0x40
[<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
[<ffffffffa000318d>] __dm_suspend+0x5d/0x220 [dm_mod]
[<ffffffffa0004c9a>] dm_suspend+0xca/0xf0 [dm_mod]
[<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[<ffffffffa000a174>] dev_suspend+0x194/0x250 [dm_mod]
[<ffffffffa0009fe0>] ? table_load+0x380/0x380 [dm_mod]
[<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[<ffffffff8128bc6e>] ? file_has_perm+0xae/0xc0
[<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[<ffffffff816408d9>] ? do_async_page_fault+0x29/0xe0
[<ffffffff81645909>] system_call_fastpath+0x16/0x1b
Which led to this redhat article but the answer to that is locked. More debugging led to answers like this and this. The final answer was found here:
sudo sysctl -w vm.dirty_ratio=10
sudo sysctl -w vm.dirty_background_ratio=5
With the explanation "By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing". Crazy 🙂 After we put those settings, it actually worked (!). Still cannot believe that choosing AWS instance is that important.