Disk usage issues
-
My cloudron storage consumption has ballooned again recently. It's not mysql this time.
Here are the disk usage stats from the dashboard:
Platform data 37.13 GB Docker images 11.33 GB matrix.example.com 385.45 MB Email data 12.14 MB Box data 8.26 MB vault.apps.example.com 4.95 MB music.apps.example.com 4.41 MB firefly.apps.example.com 3.21 MB example.com 120 kB mail.apps.example.com 24 kB chat.apps.example.com 12 kB source.example.com Not available yet Everything else (Ubuntu, Swap, etc) 25.36 GB
I recently used synadm to purge remote media and delete some unused rooms which brought matrix down to 380MB from 12GB previously, which I expected to give me more headroom but apparently has not.
The biggest consumers of the ~80GB total disk are:
/var/lib/docker/overlay2
consumes 49GB/home/yellowtent/platformdata/postgresql/12/main/base/18609
consumes 36GB. These are the top several offenders based ondu -sh * | sort -hr
from that directory:
1.1G 19931.9 1.1G 19931.8 1.1G 19931.7 1.1G 19931.6 1.1G 19931.5 1.1G 19931.4 1.1G 19931.3 1.1G 19931.2 1.1G 19931.10 1.1G 19931.1 1.1G 19931 1.1G 19342.9 1.1G 19342.8 1.1G 19342.7 1.1G 19342.6 1.1G 19342.5 1.1G 19342.4 1.1G 19342.3 1.1G 19342.2 1.1G 19342.18 1.1G 19342.17 1.1G 19342.16 1.1G 19342.15 1.1G 19342.14 1.1G 19342.13 1.1G 19342.12 1.1G 19342.11 1.1G 19342.10 1.1G 19342.1 1.1G 19342 785M 18882 569M 19342.19 558M 19931.11 234M 19739 228M 19191 205M 18835 178M 18953 153M 18864 149M 19673 116M 19909 111M 19838 107M 18917 104M 18686 100M 18935 91M 18843 79M 19253 69M 18678 68M 123897 67M 19911 67M 123899 66M 19687 65M 19835 65M 19834 62M 19325 60M 19693 58M 19852 58M 19681
-
I am not yet sure about the postgres consumption and if we maybe also have to disable some journaling or so, similar to mysql.
The docker overlay folder is however not actual usage. Most of that is virtual and runtime, the default unix tools are just not able to report the correct disk usage there.
-
Automatic Vacuuming is enabled by default in postgresql, so maybe it's because the vaccuumer is unable to finish. @infogulch Do you see any crashes in postgresql in the logs? The full logs are at
/home/yellowtent/platformdata/logs/postgresql
. -
@girish Yes it seems about once a week between June-Aug, and again from Sept-now.
root@localhost:/home/yellowtent/platformdata/logs/postgresql# ls -la total 15672 drwxr-xr-x 2 yellowtent yellowtent 4096 Feb 4 03:58 . drwxr-xr-x 22 yellowtent yellowtent 4096 Jan 30 00:00 .. -rw-r--r-- 1 yellowtent yellowtent 4994229 Feb 3 07:00 app.log -rw-r--r-- 1 yellowtent yellowtent 11032805 May 20 2021 app.log.1 root@localhost:/home/yellowtent/platformdata/logs/postgresql# grep -i 'server stopped' app.log 2021-06-24T16:37:18.000Z server stopped 2021-07-04T07:38:11.000Z server stopped 2021-08-13T01:11:59.000Z server stopped 2021-08-20T05:42:41.000Z server stopped 2021-08-26T02:41:56.000Z server stopped 2021-10-23T02:57:27.000Z server stopped 2021-11-08T14:39:15.000Z server stopped 2021-11-08T14:39:49.000Z server stopped 2021-11-08T14:42:02.000Z server stopped 2021-11-21T19:28:33.000Z server stopped 2021-11-21T23:23:10.000Z server stopped 2021-11-26T03:09:58.000Z server stopped 2021-12-13T02:31:36.000Z server stopped 2021-12-26T20:14:03.000Z server stopped 2021-12-28T20:00:52.000Z server stopped 2022-01-07T03:39:58.000Z server stopped 2022-01-14T05:20:33.000Z server stopped 2022-01-21T07:20:47.000Z server stopped root@localhost:/home/yellowtent/platformdata/logs/postgresql# grep -i 'server stopped' app.log.1 2021-05-19T23:00:57.000Z server stopped
-
I've disabled backups while I'm dealing with this issue because the postgres db is larger than all the free space remaining on the server, so backups will never succeed again until this is resolved.
- Perhaps Cloudron could stream the pg backup directly to the backup provider without requiring additional disk space on the local server.
I did some cleanup using
synadm
and manual sql commands (see below) to delete matrix room data that was taking up a lot of space. Now I have a table that's 31GB with 125M dead tuples and only 4% utilization and needs vacuuming very badly.create extension pgstattuple; SELECT tuple_percent, dead_tuple_count, dead_tuple_percent, free_space, free_percent FROM pgstattuple('state_groups_state');
tuple_percent | dead_tuple_count | dead_tuple_percent | free_space | free_percent ---------------+------------------+--------------------+------------+-------------- 3.72 | 125312880 | 87.23 | 854177576 | 4.06
Unfortunately running vacuum against this table eventually stalls (my last attempt was running for 48h and not progressing according to
select * from pg_stat_progress_vacuum;
which stops whenheap_blks_scanned
reaches exactly583939
blocks). I found some references to thepg_repack
tool which supposedly performs vacuum tasks online and wouldn't freeze due to locking the table, however I was unable to install it due to missing dependencies in the postgres container.- Do you think we could compile and install the
pg_repack
extension as part of the standard postgresql container image? - Any other ideas?
These are the commands I used to clear room data that was left over after running
synadm room delete ...
because synapse's room deletion tool is vulnerable to failures during the room deletion procedure which causes storage space to be frozen and unable to be freed via synapse api. These scripts were derived by translating the synapse room deletion python source code at purge_events.py#_purge_room_txn and store.py#_purge_room_state_txn into postgres sql functions. Perhaps in the future synapse could use something like this wrapped inside a transaction instead of doing these steps independently in python which is vulnerable to interruption.Warning: I'm unfamiliar with synapse and have not vetted this solution with the synapse developers. Posted here for learning purposes only. Proceed at your own risk. Take backups first. Etc etc.
drop function if exists delete_room; create function delete_room(text) returns bigint as $$ declare d integer; ds bigint = 0; rid alias for $1; begin raise info 'deleting data for room %', rid; drop table if exists t_deleted_event_ids; create temp table t_deleted_event_ids(event_id text) on commit drop; insert into t_deleted_event_ids select event_id from events where room_id=rid; get diagnostics d = row_count; raise info 'found % event ids to purge', d; analyze t_deleted_event_ids; delete from event_auth where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_auth'; delete from event_edges where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_edges'; delete from event_json where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_json'; delete from event_push_actions_staging where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_push_actions_staging'; delete from event_reference_hashes where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_reference_hashes'; delete from event_relations where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_relations'; delete from event_to_state_groups where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_to_state_groups'; delete from event_auth_chains where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_auth_chains'; delete from event_auth_chain_to_calculate where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_auth_chain_to_calculate'; delete from redactions where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'redactions'; delete from rejections where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'rejections'; delete from state_events where event_id in (select event_id from t_deleted_event_ids); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'state_events'; delete from current_state_events where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'current_state_events'; delete from destination_rooms where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'destination_rooms'; delete from event_backward_extremities where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_backward_extremities'; delete from event_forward_extremities where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_forward_extremities'; delete from event_push_actions where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_push_actions'; delete from event_search where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_search'; delete from events where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'events'; delete from group_rooms where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'group_rooms'; delete from receipts_graph where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'receipts_graph'; delete from receipts_linearized where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'receipts_linearized'; delete from room_aliases where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_aliases'; delete from room_depth where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_depth'; delete from room_memberships where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_memberships'; delete from room_stats_state where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_stats_state'; delete from room_stats_current where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_stats_current'; delete from room_stats_earliest_token where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_stats_earliest_token'; delete from rooms where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'rooms'; delete from stream_ordering_to_exterm where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'stream_ordering_to_exterm'; delete from users_in_public_rooms where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'users_in_public_rooms'; delete from users_who_share_private_rooms where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'users_who_share_private_rooms'; delete from appservice_room_list where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'appservice_room_list'; delete from e2e_room_keys where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'e2e_room_keys'; delete from event_push_summary where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'event_push_summary'; delete from pusher_throttle where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'pusher_throttle'; delete from group_summary_rooms where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'group_summary_rooms'; delete from room_account_data where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_account_data'; delete from room_tags where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'room_tags'; delete from local_current_membership where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'local_current_membership'; delete from state_groups_state where room_id=rid; get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'state_groups_state'; drop table if exists t_deleted_state_groups; create temp table t_deleted_state_groups(state_group int) on commit drop; insert into t_deleted_state_groups select distinct state_group from events inner join event_to_state_groups using(event_id) where events.room_id=rid union select distinct id from state_groups where room_id=rid; get diagnostics d = row_count; raise info 'found % state group ids to purge', d; analyze t_deleted_state_groups; delete from state_group_edges where state_group in (select state_group from t_deleted_state_groups); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'state_group_edges'; delete from state_groups where id in (select state_group from t_deleted_state_groups); get diagnostics d = row_count; ds := ds + d; raise info 'deleted % rows from %', d, 'state_groups'; return ds; end; $$ LANGUAGE plpgsql; drop function if exists cleanup_rooms; create function cleanup_rooms() returns table(room_id text, events_found bigint, state_groups_found bigint, rows_deleted bigint) as $$ select room_id , (select count(*) from events where room_id=a.room_id) as events_found , (select count(*) from state_groups where room_id=a.room_id) as state_groups_found , delete_room(room_id) as rows_deleted from ( select room_id from ( select distinct room_id from events union select distinct room_id from state_groups union select distinct room_id from state_groups_state ) a except select room_id from rooms ) a $$ language sql;
-
Ok, finally sorted after stopping queries that were locking old txnids and then doing a full vacuum. I suppose restarting the postgres container could have done that.
This was useful to find blocking queries, in addition to
pg_terminate_backend
:SELECT pid, datname, usename, state, backend_xmin, query FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY age(backend_xmin) DESC;
Went from 88% full to 38% full in about 10 minutes.