Solved Nextcloud in Error state even though it's running (after Cloudron 5.5 update)
-
Experiencing an odd behaviour on one of my servers with a 700GB Nextcloud instance. The dashboard/app info says "Error : - Error restoring postgresql. Status code: 500 message: Failed to import database. Code: 3"
Restarting the app didn't change anything, stopping doesn't work because it's in an erronous state.
Error logs show this:
Aug 13 01:12:04 box:tasks setCompleted - 4453: {"result":null,"error":{"stack":"BoxError: Unknown install command in apptask:error\n at /home/yellowtent/box/src/apptask.js:1070:29\n at /home/yellowtent/box/src/apps.js:520:13\n at Query.<anonymous> (/home/yellowtent/box/src/appdb.js:147:13)\n at Query.<anonymous> (/home/yellowtent/box/node_modules/mysql/lib/Connection.js:526:10)\n at Query._callback (/home/yellowtent/box/node_modules/mysql/lib/Connection.js:488:16)\n at Query.Sequence.end (/home/yellowtent/box/node_modules/mysql/lib/protocol/sequences/Sequence.js:83:24)\n at Query._handleFinalResultPacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/sequences/Query.js:149:8)\n at Query.EofPacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/sequences/Query.js:133:8)\n at Protocol._parsePacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/Protocol.js:291:23)\n at Parser._parsePacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/Parser.js:433:10)","name":"BoxError","reason":"Internal Error","details":{},"message":"Unknown install command in apptask:error"}} Aug 13 01:12:04 box:tasks 4453: {"percent":100,"result":null,"error":{"stack":"BoxError: Unknown install command in apptask:error\n at /home/yellowtent/box/src/apptask.js:1070:29\n at /home/yellowtent/box/src/apps.js:520:13\n at Query.<anonymous> (/home/yellowtent/box/src/appdb.js:147:13)\n at Query.<anonymous> (/home/yellowtent/box/node_modules/mysql/lib/Connection.js:526:10)\n at Query._callback (/home/yellowtent/box/node_modules/mysql/lib/Connection.js:488:16)\n at Query.Sequence.end (/home/yellowtent/box/node_modules/mysql/lib/protocol/sequences/Sequence.js:83:24)\n at Query._handleFinalResultPacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/sequences/Query.js:149:8)\n at Query.EofPacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/sequences/Query.js:133:8)\n at Protocol._parsePacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/Protocol.js:291:23)\n at Parser._parsePacket (/home/yellowtent/box/node_modules/mysql/lib/protocol/Parser.js:433:10)","name":"BoxError","reason":"Internal Error","details":{},"message":"Unknown install command in apptask:error"}}
not sure if those are related, but it is still up and running. Any suggestions on what to do?
-
Did you attempt to retry the restore in the repair section of the app configure view?
-
Just had a quick restore session with @girish, he suggested that even though postgres had 3,5GB of RAM available, that this still wasn't enough to import/migrate a 400MB+ dump of the database. We upped the limit to 4GB and did another restore, this fixed the app status. I rescanned the files now and waiting for feedback if any other stuff is missing.
-
What happened was that the db migration failed because postgres wanted more memory. What I did was to give it more memory and trigger a in-place import. That did the trick.
-
Something is not right with my Nextcloud instance, either, after the Cloudron 5.5 update. I had to increase the memory to 8 GB and CPU to 50 %, otherwise the app was in a "not responding" state.
All clients (mac, PC, iOS) are in an endless loop to sync but never actually do. My Nextcloud website takes forever to load (all other Cloudron services like FreshRSS are fine). I re-setup the iOS client which takes forever. After entering credentials on the login dialog, I'm not being redirected to the app but I see a webview of Nextcloud.
The Nextcloud logs don't show anything odd at a first glance except this:
"Aug 18 09:23:29 [Tue Aug 18 07:23:29.826254 2020] [rewrite:error] [pid 8495] [client 172.18.0.1:50318] AH00670: Options FollowSymLinks and SymLinksIfOwnerMatch are both off, so the RewriteRule directive is also forbidden due to its similar ability to circumvent directory restrictions : /app/code/config"
and (?)
Aug 18 09:27:12 58:C 18 Aug 07:27:12.149 * DB saved on disk Aug 18 09:27:12 58:C 18 Aug 07:27:12.159 * RDB: 0 MB of memory used by copy-on-write Aug 18 09:27:12 15:M 18 Aug 07:27:12.220 * Background saving terminated with success
-
Also: Suddenly there's a new folder "uploads" that wasn't there before and that I didn't create.
-
I think the culprit is PostgreSQL 11 - was that recently changed in the Nextcloud Docker? My CPU runs at 100 % the whole time....
-
Yes, Cloudron moved to Postgres 11 in the previous release (Cloudron 5.5). Can you just try restarting Postgres under services?
Another thing is in
/home/yellowtent/platformdata/logs/box.log
do you see some error likeError importing postgresql
? -
@necrevistonnezr said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
Aug 18 09:27:12 58:C 18 Aug 07:27:12.149 * DB saved on disk
Aug 18 09:27:12 58:C 18 Aug 07:27:12.159 * RDB: 0 MB of memory used by copy-on-write
Aug 18 09:27:12 15:M 18 Aug 07:27:12.220 * Background saving terminated with successThis one is from redis, you can ignore it.
-
@girish said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
Yes, Cloudron moved to Postgres 11 in the previous release (Cloudron 5.5). Can you just try restarting Postgres under services?
Another thing is in
/home/yellowtent/platformdata/logs/box.log
do you see some error likeError importing postgresql
?No error in box.log
After restarting Postgres, it immediateley goes back to 100 % CPU. -
Just to narrow the issue down, if you stop the nextcloud app, does the postgresql cpu usage go back to normal? From the screenshot it seems it's busy in some
SELECT
command. -
@girish said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
Just to narrow the issue down, if you stop the nextcloud app, does the postgresql cpu usage go back to normal? From the screenshot it seems it's busy in some
SELECT
command.After stopping the app, CPU cores go down to the usual 5-15 %
-
@necrevistonnezr Do you think you can stop the existing nextcloud and then maybe clone from the latest backup and check if postgres is still using a lot of CPU? If it works out, maybe you can then just move stopped nextcloud into another domain and then put the cloned one there.
-
@girish said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
@necrevistonnezr Do you think you can stop the existing nextcloud and then maybe clone from the latest backup and check if postgres is still using a lot of CPU? If it works out, maybe you can then just move stopped nextcloud into another domain and then put the cloned one there.
Clone Nextcloud into another subdomain you mean? How do I do that?
EDIT: Found it.
-
What was the root cause if you found it?
On a side note postgres really gets hammered with SELECTs during for example a rescan of nextcloud files. -
@nebulon said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
What was the root cause if you found it?
On a side note postgres really gets hammered with SELECTs during for example a rescan of nextcloud files.I meant I found the cloning process, I haven't found the cause for the CPU spikes.
I'm trying go clone a backup to a new subdomain but I don't have enough free space to clone a 300 GB Nextcloud instance... -
@girish said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
@necrevistonnezr Do you think you can stop the existing nextcloud and then maybe clone from the latest backup and check if postgres is still using a lot of CPU? If it works out, maybe you can then just move stopped nextcloud into another domain and then put the cloned one there.
I did that now. Took 10 hours. Result is the same. 100 % CPU on Postgres on Nextcloud (app id 410c...). This is HUGELY frustrating. And I can't even login, it takes forever.
-
For a start, do you have some nextcloud client running on your laptop or so? Maybe that fires requests like crazy and thus hammering postgres as a result?
-
No, I switched off all clients on purpose - and after cloning to a new subdomain, there would be no connection, anyway.
-
maybe some plugin causes this? Can you use the
occ
tool via terminal into the app to disable some? -
@nebulon said in Nextcloud in Error state even though it's running (after Cloudron 5.5 update):
maybe some plugin causes this? Can you use the
occ
tool via terminal into the app to disable some?I think only the bare minimum is enabled....
occ app:list Enabled: - accessibility: 1.5.0 - activity: 2.12.0 - admin_audit: 1.9.0 - calendar: 2.0.3 - cloud_federation_api: 1.2.0 - comments: 1.9.0 - contacts: 3.3.0 - contactsinteraction: 1.0.0 - dav: 1.15.0 - encryption: 2.7.0 - federatedfilesharing: 1.9.0 - files: 1.14.0 - files_external: 1.10.0 - files_pdfviewer: 1.8.0 - files_rightclick: 0.16.0 - files_sharing: 1.11.0 - files_trashbin: 1.9.0 - files_versions: 1.12.0 - files_videoplayer: 1.8.0 - firstrunwizard: 2.8.0 - logreader: 2.4.0 - lookup_server_connector: 1.7.0 - nextcloud_announcements: 1.8.0 - notifications: 2.7.0 - oauth2: 1.7.0 - password_policy: 1.9.1 - photos: 1.1.0 - privacy: 1.3.0 - provisioning_api: 1.9.0 - recommendations: 0.7.0 - serverinfo: 1.9.0 - settings: 1.1.0 - sharebymail: 1.9.0 - spreed: 9.0.3 - support: 1.2.1 - systemtags: 1.9.0 - text: 3.0.1 - theming: 1.10.0 - twofactor_backupcodes: 1.8.0 - twofactor_totp: 4.1.3 - updatenotification: 1.9.0 - user_ldap: 1.9.0 - viewer: 1.3.0 - workflowengine: 2.1.0 Disabled: - bookmarks - bruteforcesettings - documentserver_community - federation - mail - maps - ransomware_detection - survey_client - tasks - twofactor_admin
-
This is very strange, since if noone accesses nextcloud there shouldn't be long-running processes accessing the database either.
-
To be clear: Postgres goes crazy if I try to login from a browser or a client...
-
Can I somehow go back to an earlier Postgres version? This right now is killing my server and my workflow.
-
You would have to reinstall Cloudron for that old version altogether
maybe if you enable remote SSH support we could take a more direct look, if so please mail your domain to support@cloudron.io
-
This is not related to the thread directly (but I was wondering about if we do make db rollbacks even possible). Do you use other apps that use Postgres? I realize this is not immediately obvious and hard to tell
. For example, GitLab is now incompatible with the older postgres and then some of the newer apps like loomio require some of the Postgres extensions we have enabled (maybe one of these extensions is causing CPU use). If it's possible, as @nebulon said we can take a look.
-
@necrevistonnezr Ah ok, I guess all of them are mysql. That does make it easier to debug. Please write when possible, we can look into it asap.
-
We hit this with another user now and I think the root cause is that the migration only partially imported the database. This is causing nextcloud to do a log of queries (maybe some internal loop).
To fix this (please do all this carefully, if you not are confident just reach out to support@cloudron.io and we can do it for you):
-
Give the postgresql service a lot more memory (
Services
->PostgreSQL
). There is no good number for this, just give it as much as you can. It's harmless since you can always scale it down later after the import. -
First, identify the backup of the app that was created before the Cloudron updated to 5.5.0. From this backup, copy over the postgresqldump file. Assuming
f6e87030-2102-4c6c-b8eb-b2d86a268917
is the id of the nextcloud app:
# cp /home/yellowtent/appsdata/f6e87030-2102-4c6c-b8eb-b2d86a268917/postgresqldump /root/postgresqldump.current # cp /from/the/app/backups/postgresqldump /home/yellowtent/appsdata/f6e87030-2102-4c6c-b8eb-b2d86a268917/postgresqldump
On your PC/Mac (not Cloudron!), then use the CLI tool to import the data in-place. This command simple re-imports the database that we just copied above.
$ cloudron import --in-place --app nextcloud.domain.com
If you had generated some files in the past few days, you should run the occ scan again - https://cloudron.io/documentation/apps/nextcloud/#rescan-files after nextcloud is running again.
-
-
CPU usage after the re-import:
-
Since I was pressured for time, I re-setup Nextcloud from scratch, imported the backup and went that route. Really stressful and I hope I don't have to do that again. Makes you realize why you pay for certain cloud services...
-
@necrevistonnezr thanks for the update. We have fixed the code in the meantime that causes this.
-
@girish Now I know why support didn't work out: Cloudron blocked my answer from my Cloudron mail account to you guys - as mail relay via Sendgrid - as spam.... (!)
FYI: the shown IP 167.89.12.138 does indeed belong to Sendgrid.So mail relay via Sendgrid from the Cloudron mail server is not reliable, I guess....
-
Ah, looks like the sendgrid IP is blacklisted by zen spamhaus (which cloudron uses by default).
$ host -t TXT 138.12.89.167.zen.spamhaus.org 138.12.89.167.zen.spamhaus.org descriptive text "https://www.spamhaus.org/sbl/query/SBL491387"
https://www.spamhaus.org/sbl/query/SBL491387 says phishing mails are originating from that IP. Can you tell sendgrid about this (the link says you as customer can do nothing about it)?
-
I reported it to Sendgrid, this was their answer:
When sending email through an account that is under the Free or Essentials pricing package, your account will be utilizing a shared IP pool. Being grouped with others in a pool of shared IP's can offer several benefits, especially if you are only sending a moderate amount of email.
Although there are benefits to sending on shared IP's, there are also risks which can sometimes produce unintended negative consequences. If some of these users display poor sending habits or behavior, it can negatively affect others (you) within the group.Essentially, you need to be on a paid plan, otherwise you end up in spam lists. The thing is, no one tells you that. And you only find out that your mail was blocked when you login to the account and go to the "Blocked" subsite. In my case, I found out that a job application didn't go through 14 days ago. I get that they wanna sell you something but at least tell me about it. I learned the hard way, this is the end for me for self hosted mail. Imapsynced my Cloudron mail back to my old provider and that's that.
-
You can use their web api and catch events like those through webhooks. Not sure if I set it up at sendgrid, because I left after a few days of testing (and for that exact reason, getting randomly blocked because of bad IPs) and went to Mailjet.
There I've got a webhook which is pulled by a zapier task a few times a day, which notifies me when an email got blocked/bounced, maybe that's something to consider to set up.
-
Correction: it wasn't zapier, but integromat
-
Related to Sendgrid and why their IPs are identified as spam sources: https://krebsonsecurity.com/2020/08/sendgrid-under-siege-from-hacked-accounts/
-
well that was months ago, so they had that problem before that as well