Just to close this one out - I added more physical RAM to the box, and upped the allocation to the VM to 48Gb - re-ran the migration and it worked fine. Still not sure why moving files should cause this type of oom-kill, but more RAM was the solution.
jdeighton
Posts
-
Task 'mvvolume.sh' dies from oom-kill during move of data-directory from local drive to NFS share -
After Ubuntu 22/24 Upgrade syslog getting spammed and grows way to much clogging up the diskspaceApplied this fix tonight and it worked for me to quiet the errors in the syslog file - thanks for finding and sharing the solution.
-
Task 'mvvolume.sh' dies from oom-kill during move of data-directory from local drive to NFS shareI upped the VM memory allocation from 24Gb to 30Gb (the whole box has 32Gb, I wanted to leave a little for the hypervisor).
Same problem. Partway though the file copy the oom-kill steps in and spoils the party. The total -vm size reported by the oom-killer was a little larger - so maybe something is using a lot of RAM during that process - but I cannot for the life of me think what it would be.
Have ordered more physical RAM for the server, will add that in the next service window and re-try this to see if it helps.
What's odd is outside the VM, there is no evidence of that amount of memory usage. The Proxmox host itself tracks the memory usage and it's not changing significantly during that time, and the looking at the system monitoring inside the Ubuntu VM shows a peak memory usage of less than 10Gb overall.
-
Task 'mvvolume.sh' dies from oom-kill during move of data-directory from local drive to NFS shareLooking at the source of the
.sh
file here -https://git.cloudron.io/platform/box/-/blob/master/src/scripts/mvvolume.sh?ref_type=heads
I'd guess that replacing these lines:find "${source_dir}" -maxdepth 1 -mindepth 1 -not -wholename "${target_dir}" -exec cp -ar '{}' "${target_dir}" \; find "${source_dir}" -maxdepth 1 -mindepth 1 -not -wholename "${target_dir}" -exec rm -rf '{}' \;
with something along the lines of:
rsync -a --remove-source-files --exclude "${target_dir}" "${source_dir}/" "${target_dir}/"
may end up being less of a resource hog.
I'll schedule an increase of RAM for the VM this weekend and reboot it all - see if that help.
-
Task 'mvvolume.sh' dies from oom-kill during move of data-directory from local drive to NFS shareAdditional output from
dmesg
around the oom-kill below. It looks like the process is getting up to ~11Gb before being killed, but even that should fit on the box with no issues based on the free memory. The system is running in a VM, so I could assign it more overall RAM to see if that helps, but would prefer not to restart everything in the middle of the week to increase this.[106869.999957] cp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [106869.999964] CPU: 7 PID: 1054475 Comm: cp Not tainted 6.8.0-55-generic #57-Ubuntu [106869.999967] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [106869.999968] Call Trace: [106869.999969] <TASK> [106869.999972] dump_stack_lvl+0x76/0xa0 [106869.999976] dump_stack+0x10/0x20 [106869.999978] dump_header+0x47/0x1f0 [106869.999981] oom_kill_process+0x118/0x280 [106869.999983] out_of_memory+0x103/0x350 [106869.999984] mem_cgroup_out_of_memory+0x145/0x170 [106869.999987] try_charge_memcg+0x6d6/0x7d0 [106870.000021] mem_cgroup_swapin_charge_folio+0x7d/0x160 [106870.000023] __read_swap_cache_async+0x215/0x2b0 [106870.000025] swap_cluster_readahead+0x192/0x330 [106870.000027] swapin_readahead+0x7f/0x110 [106870.000028] do_swap_page+0x281/0xad0 [106870.000030] ? __pte_offset_map+0x1c/0x1b0 [106870.000033] handle_pte_fault+0x17b/0x1d0 [106870.000034] __handle_mm_fault+0x654/0x800 [106870.000036] handle_mm_fault+0x18a/0x380 [106870.000037] do_user_addr_fault+0x1f4/0x670 [106870.000039] ? virtscsi_queuecommand+0x149/0x3c0 [106870.000042] exc_page_fault+0x83/0x1b0 [106870.000044] asm_exc_page_fault+0x27/0x30 [106870.000046] RIP: 0010:_copy_to_iter+0x98/0x590 [106870.000049] Code: 48 89 d9 31 f6 48 01 d7 48 01 f9 40 0f 92 c6 48 85 c9 0f 88 0a 01 00 00 48 85 f6 0f 85 01 01 00 00 0f 01 cb 48 89 d9 4c 89 ee <f3> a4 0f 1f 00 0f 01 ca 49 8b 54 24 08 49 89 de 49 8b 44 24 18 49 [106870.000050] RSP: 0018:ffffb5cf4fdaba50 EFLAGS: 00050246 [106870.000052] RAX: 0000000000071000 RBX: 0000000000001000 RCX: 0000000000001000 [106870.000053] RDX: 000000000008f000 RSI: ffff8aaa4d46b000 RDI: 00007367cc58e000 [106870.000054] RBP: ffffb5cf4fdabac8 R08: 0000000000000000 R09: 0000000000000000 [106870.000055] R10: 000000004948f000 R11: ffff8aaefce4f968 R12: ffffb5cf4fdabcb0 [106870.000055] R13: ffff8aaa4d46b000 R14: 0000000000000000 R15: 0000000000001000 [106870.000058] ? filemap_get_pages+0xa9/0x3b0 [106870.000061] copy_page_to_iter+0x9f/0x170 [106870.000063] filemap_read+0x227/0x470 [106870.000065] generic_file_read_iter+0xbb/0x110 [106870.000067] ext4_file_read_iter+0x63/0x210 [106870.000070] vfs_read+0x25c/0x390 [106870.000073] ksys_read+0x73/0x100 [106870.000075] __x64_sys_read+0x19/0x30 [106870.000077] x64_sys_call+0x1bf0/0x25a0 [106870.000079] do_syscall_64+0x7f/0x180 [106870.000081] ? handle_pte_fault+0x17b/0x1d0 [106870.000082] ? __handle_mm_fault+0x654/0x800 [106870.000084] ? rseq_get_rseq_cs+0x22/0x280 [106870.000086] ? rseq_ip_fixup+0x90/0x1f0 [106870.000088] ? count_memcg_events.constprop.0+0x2a/0x50 [106870.000090] ? irqentry_exit_to_user_mode+0x7b/0x260 [106870.000091] ? irqentry_exit+0x43/0x50 [106870.000092] ? clear_bhb_loop+0x15/0x70 [106870.000094] ? clear_bhb_loop+0x15/0x70 [106870.000095] ? clear_bhb_loop+0x15/0x70 [106870.000097] entry_SYSCALL_64_after_hwframe+0x78/0x80 [106870.000098] RIP: 0033:0x7367cc71ba61 [106870.000110] Code: 00 48 8b 15 b9 73 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 40 c4 01 00 f3 0f 1e fa 80 3d e5 f5 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec [106870.000111] RSP: 002b:00007ffd2da52e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [106870.000112] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007367cc71ba61 [106870.000113] RDX: 0000000000100000 RSI: 00007367cc4ff000 RDI: 0000000000000004 [106870.000119] RBP: 00007ffd2da52f60 R08: 00000000ffffffff R09: 00007367cc5ff000 [106870.000119] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [106870.000120] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [106870.000121] </TASK> [106870.000123] memory: usage 1048576kB, limit 1048576kB, failcnt 786574 [106870.000124] swap: usage 35080kB, limit 9007199254740988kB, failcnt 0 [106870.000125] Memory cgroup stats for /system.slice/box-task-146.service: [106870.000145] anon 0 [106870.000146] file 1065398272 [106870.000146] kernel 8138752 [106870.000147] kernel_stack 229376 [106870.000147] pagetables 1585152 [106870.000148] sec_pagetables 0 [106870.000148] percpu 960 [106870.000148] sock 0 [106870.000149] vmalloc 0 [106870.000149] shmem 0 [106870.000150] zswap 0 [106870.000150] zswapped 0 [106870.000150] file_mapped 0 [106870.000151] file_dirty 1065353216 [106870.000151] file_writeback 200704 [106870.000152] swapcached 204800 [106870.000152] anon_thp 0 [106870.000152] file_thp 0 [106870.000153] shmem_thp 0 [106870.000153] inactive_anon 204800 [106870.000154] active_anon 0 [106870.000154] inactive_file 1065324544 [106870.000154] active_file 40960 [106870.000155] unevictable 0 [106870.000155] slab_reclaimable 5885024 [106870.000155] slab_unreclaimable 379336 [106870.000156] slab 6264360 [106870.000156] workingset_refault_anon 2653 [106870.000157] workingset_refault_file 70006 [106870.000157] workingset_activate_anon 272 [106870.000157] workingset_activate_file 30798 [106870.000170] workingset_restore_anon 272 [106870.000171] workingset_restore_file 20 [106870.000171] workingset_nodereclaim 0 [106870.000172] pgscan 935071 [106870.000172] pgsteal 894602 [106870.000173] pgscan_kswapd 0 [106870.000173] pgscan_direct 935071 [106870.000174] pgscan_khugepaged 0 [106870.000180] pgsteal_kswapd 0 [106870.000180] pgsteal_direct 894602 [106870.000181] pgsteal_khugepaged 0 [106870.000181] pgfault 28424 [106870.000182] pgmajfault 622 [106870.000183] pgrefill 466537803 [106870.000183] pgactivate 28389 [106870.000184] pgdeactivate 0 [106870.000184] pglazyfree 0 [106870.000185] pglazyfreed 0 [106870.000185] zswpin 0 [106870.000185] zswpout 0 [106870.000186] zswpwb 0 [106870.000186] thp_fault_alloc 0 [106870.000187] thp_collapse_alloc 0 [106870.000187] thp_swpout 0 [106870.000187] thp_swpout_fallback 0 [106870.000188] Tasks state (memory values in pages): [106870.000188] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [106870.000190] [1054414] 808 1054414 2882105 10976 0 10976 0 1343488 7840 0 node [106870.000192] [1054470] 808 1054470 4189 1472 0 1472 0 77824 288 0 sudo [106870.000195] [1054471] 0 1054471 1835 832 0 832 0 57344 64 0 mvvolume.sh [106870.000197] [1054474] 0 1054474 1734 640 0 640 0 53248 32 0 find [106870.000198] [1054475] 0 1054475 1924 512 0 512 0 53248 320 0 cp [106870.000199] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=box-task-146.service,mems_allowed=0,oom_memcg=/system.slice/box-task-146.service,task_memcg=/system.slice/box-task-146.service,task=node,pid=1054414,uid=808 [106870.000240] Memory cgroup out of memory: Killed process 1054414 (node) total-vm:11528420kB, anon-rss:0kB, file-rss:43904kB, shmem-rss:0kB, UID:808 pgtables:1312kB oom_score_adj:0
-
Task 'mvvolume.sh' dies from oom-kill during move of data-directory from local drive to NFS shareI have a Nextcloud app instance that I want to move from the local data-directory to an NFS share. The process starts ok, but after a few gig of transfer the task is killed claiming it it out of memory. I've done the same process on other apps without problems, but this app has the most data to move (approximately 22G b so far).
Logs from the container at the time it dies:
Mar 17 20:32:43 box:tasks update 146: {"percent":60,"message":"Moving data dir"} Mar 17 20:32:43 box:apptask moveDataDir: migrating data from /home/yellowtent/appsdata/ce1c66ad-34d1-43c6-abba-46e24aa1aac9/data to /mnt/volumes/88f699476db947efa134d42b4abc6885 Mar 17 20:32:43 box:shell apptask /usr/bin/sudo -S /home/yellowtent/box/src/scripts/mvvolume.sh /home/yellowtent/appsdata/ce1c66ad-34d1-43c6-abba-46e24aa1aac9/data /mnt/volumes/88f699476db947efa134d42b4abc6885
Logs from
journalctl --system
:
The box itself has plenty of system RAM free - output of
free -h
:
The Nextcloud app instance initially had 600Mb assigned, I increased this to 2Gb, 4Gb and 8Gb but the same thing happened each time, the
mvvolume.sh
task died having used 1.0Gb of memory at peak.Any ideas on what else could be tweaked or changed to allow that script to complete successfully?
Thanks.
-
Cloudron + Proxmox + Cloudflare tunnelsI struggled to find information on how to make Cloudron work behind cloudflare tunnels, so I took a crack at it and figured I'd share my notes in the hope they can save someone else from banging their head on the desk quite as much.
Goal
Get Cloudron running with Nextcloud installed as an app, on Proxmox, behind a router that we have no admin control over, and available via a proper domain name. To do this we'll use a Cloudflare tunnel. (This requires your DNS to be managed at Cloudflare.)
Cloudflare tunnel setup
Once the Cloudflare tunnel is set up, the Proxmox host will create the tunnel out to the Cloudflare edge servers (avoiding the need to set anything up on the router to allow inbound traffic). The tunnel configuration controls what is allowed into the tunnel, and also what the
cloudflared
daemon will do with the traffic once it arrives on the proxmox host.This setup will allow some traffic to localhost to access the proxmox GUI, and other traffic to be sent to other IP/ports on the local private network, specifically our Cloudron box/VM inside Proxmox.
- Starting with a working Proxmox host that has access to the internet, but is on a private IP address behind a router that we do not have admin access to (so cannot set a DMZ computer, or port forwarding etc)
- You'll also need a domain (e.g.
example.com
) to use that is already set up on Cloudflare, with Cloudflare running the DNS. - On Cloudflare, set up "Zero Trust" - you'll need billing info entered, but can select the free plan so the card will not be charged.
- In the "Zero Trust" section, go to "Tunnels" and choose "create a tunnel"
- Give the tunnel a name (e.g. cloudron-pve), and the select the debian, 64 bit option for the connector installer details.
- On the Proxmox host, run the connector installer scripts:
curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb && sudo dpkg -i cloudflared.deb && sudo cloudflared service install <your token details here>
- On the tunnel config, add a public hostname, the hostname will be something to access the proxmox gui, so e.g.
proxmox.example.com
and then set the type to be "HTTPS
" and the URL to behttps://localhost8006
. Under "Additional application settings" -> "TLS", enable the "No TLS Verify" so that Cloudflare does not freak out about the self signed certificates on Proxmox.
Cloudron setup
- On the Proxmox host, Create a VM (not LXC), install the current LTS Ubuntu server. Ensure the CPU type is set to
host
so that the AVX support is exposed to the VM (required since Cloudron 7.6 updated the MongoDB version to 5.0). - Set a static IP address for the VM, one that is in the router's network range, but won't be assigned by the router's DHCP assignments if possible - e.g. 192.168.0.201
- Install Cloudron on the server (need sudo/root for the last command), from the webpage:
wget https://cloudron.io/cloudron-setup chmod +x ./cloudron-setup ./cloudron-setup
- (If you have a local machine on the same local network as the cloudron VM, use that for the setup steps, if not, install another linux VM with a GUI on the proxmox box and use a browser in the Proxmox console for these steps)
- On the Cloudron setup web page, add the domain (
example.com
), Choose Cloudflare as the DNS provider, set the API token (I use the API token option). - Under "Advanced settings" for the network setup, choose "Network Interface" as the Provider and then add the interface name in the box below (e.g. eth0 or ens18).
- Finish the Cloudron setup steps for user details etc.
Configure tunnel to allow access to Cloudron GUI
- Go back to the Cloudflare domain DNS, there should be an A record that has been created for
my.example.com
- delete this. - On the Cloudflare Zero Access -> Tunnels -> pve-cloudron tunnel, add another public hostname
my.example.com
, set the type to "HTTPS
" and the URL to behttps://192.168.0.201
, and again set the Advanced -> TLS -> No TLS Verify option. - At this point you should be able to access the cloudron GUI in a web browser from locally or remotely via the
https://my.example.com
address.
App installation/setup
Adding apps via the Cloudflare tunnel has a couple of additional extra steps vs a normal Cloudron app install.
- Install the app from the Cloudron App Store as per usual - let it finish installing completely. For testing I installed nextcloud, and gave it the name
nc
on my domain. - Once the app has finished installing and shows "running" as the status, go to the Cloudflare DNS settings for the
example.com
domain, and find the newly added A record (nc
) which should be pointing at the private IP of the interface you set Cloudron up on (192.168.0.201
). Delete that A record. - Go to the Zero Trust -> Tunnels ->
pve-cloudron
tunnel, and add another public hostname. Set the Type to "HTTPS
" and the URL tohttps://192.168.0.201
, and set the Advanced -> TLS -> No TLS Verify option. - you should now be able to access the nextcloud GUI in a web browser from locally or remotely via
https://nc.example.com
Repeat the install -> delete A record -> add public hostname each time you add another app.
Caveats
If you re-publish the DNS records, Cloudflare will end up with both an A record pointing to the private IP, and a CNAME record pointing to the tunnel. That's likely to cause problems. I'm uncertain if Cloudflare will prevent the A records being added, or if it'll break in new an interesting ways. Any issues can be resolved by deleting the A records again, and ensuring the tunnel CNAME records are there (re-enter the public hostnames if needed).
If you want to use a subdomain (e.g.
cloudron.example.com
) as your base domain, and Cloudflare knows only the top levelexample.com
zone, then you'll need to pay Cloudflare for their advanced TLS option to generate certificates for the sub-subdomain names (e.g.my.cloudron.example.com
)These notes were brain dumped after I worked though this and finally got it all working. Their accuracy is not guaranteed...
General reminder - don't expose stuff to the internet unless you know roughly what you are doing. Are you really sure you want your Proxmox GUI out there on the internet? You can limit access to the things in the Cloudflare tunnel using the cloudflare "Access" controls, but this is left as an exercise for the reader.