Improve DNS updates to avoid rate limits
-
Currently adding multiple aliases or redirections does them all one at a time, and if 3 more are added to a list of lets say an existing 20 that are not changing, 23 new changes are done with each update which can trigger rate limits at the upstream DNS provider (Route53).
Another approach would be to compare the existing list with the new list and only apply the 3 that are new or different.
I'd prefer this to slowing things down rate wise.
-
robireplied to girish on Dec 15, 2023, 10:30 AM last edited by robi Dec 15, 2023, 10:44 AM
@girish Yes, I see that in the log before the "External Error, rate limit reached" message comes up.
It's more than one per second and worse if there's IPv6 as it doubles the number of records it sets, even when no IP changes are needed.
If triggered, it leaves the app in an Error state (does it need to be?), and if you retry it starts the whole thing from the beginning, not just the addresses that weren't set yet.
Hence failure loops and extra downtime.So why set things that don't need to be set?
-
@girish I would think that only needs to happen for the primary subdomain, not all the aliases/redirects.
The app should still work.And yes, since it's an API, all queries count towards the rate limit.
Even if there are 100+ aliases, it needs to have a way to succeed.
Perhaps in a async way with eventual consistency.
One would think there would also be a "bulk" API that can deal with lists. Or at least the ability to validate all existing ones.
-
What has also come up is the inability to uninstall an app due to the same rate limits.
Uninstalling should work for the primary domain and the aliases could be done async with some +1 sec backoff if rate limits are hit.
Now app is stuck in Error state, uninstall can't be cancelled as it errors before the task stop (red x) appears.
App could be running, but isn't.
-
The code does things in sequence and also retries. I am not sure what the solution is if rate-limits are so low (I mean we only talk about max tens of requests within a short time-frame, if I understand the situation correctly).
Seems like there is something else at play in your case?
As a quick solution, you can set the DNS provider temporarily to no-op for the uninstallation.
-
@nebulon said in Improve DNS updates to avoid rate limits:
The code does things in sequence and also retries.
Not seeing that in the logs. It mentions the error is one that can be continued, however that does not happen either.
IMO the error should not be fatal, so it can continue (not from the beginning) from where it left off and eventually complete all requests.
I am not sure what the solution is if rate-limits are so low
That burst seems to be the problem, if rate-limit is hit and handled in the code, it can be slowed to 1/sec to be ok and complete. A more universal solution for all providers.
As I already mentioned, optimizing the retry logic so it doesn't start from scratch and make it worse.
You can assume all previous API requests -1 succeeded, right?Seems like there is something else at play in your case?
No, happening on the demo server as well.
As a quick solution, you can set the DNS provider temporarily to no-op for the uninstallation.
Yes, plus cleanup of the DNS manually.
Or the retry dialog which gives a checklist of the aliases. -
Route 53 (from Amazon) and Bunny.net are mentioned in the Cloudron blog on running a Cloudron at home.
https://blog.cloudron.io/installing-cloudron-on-a-home-server/If you visit eg Bunny, they don't seem to sell domain names, in the way that other companies do, eg. Epik. How would you get a domain name from Bunny or route 53?
-
@LoudLemur I guess you are posting on the wrong thread?
-
@girish said in Improve DNS updates to avoid rate limits:
@LoudLemur I guess you are posting on the wrong thread?
I thought I would throw the question about Bunny and Route 53 in here, since people seem to be familiar with the services and they are up for discussion. Maybe it would be better to create another thread. ...
-
With a bit of hunting, I found https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests . It says "Five requests per second per AWS account per Region" . Looks like they recommend implementing a backoff as the solution. I will try to add this in our route53 backend.
-
-
-
-
Okay, thanks for the retry delay patch, can you apply it to the demo server?
This, however, does not solve the topic though, for which the list compare approach makes a large list with one change just do one change as an atomic subset, not the entire list as a global re-set.
In Multi Cloudron this could potentially create API storms.
We're getting close to this with the large netblock lists etc.Please think ahead on this one.
-
@robi I have patched the demo server.
As for the optimization: Cloudron checks if the record is up to date first and only updates it if it is not up to date. It's not a blind update each time. This is the best we can do without complex state tracking.
To write code such as "continue from where it last errored" and "update whatever you can right now" and "only updated what is minimally required" etc , requires Cloudron to track state . In practically all cases, this state tracking is not needed. The most common case (by which is mean 99.9999%) is that there are say < 5 domains for an app. So, all this code and state tracking to reduce some API requests? It's not worth the complexity and it's not like we are generating some 1000 requests. Also, think about all the UI buttons and error messages that now need to be created just for this. You need a button to say "continue" , messages that show what exactly went wrong and what domains did not update and why exactly and so on.
-
@girish Thanks! Will test ;-]
Q: Why not trust the last state?
You already have a button to re-sync DNS in case that becomes an issue.
Q: Why do you need any new UI buttons or error messages when it's "handled" in the background transparently?
Isn't it iterative retry to eventual completion even if failures occur?
-
@robi said in Improve DNS updates to avoid rate limits:
Q: Why not trust the last state?
You already have a button to re-sync DNS in case that becomes an issue.
Maybe I am not completely understanding. Let's say you have 5 domains right now and say 2 failed . The app now says 'Running', but I also assume it has show some message or error somewhere that 2 failed. Otherwise, people will say it didn't really succeed. To track that 2 failed, we have to maintain state. But ok, let's say we ignore this. User adds a domain. Now, we only try this new domain. The 2 failed are silently ignored. And maybe user adds more and more failures add up all silently ignored. There's no way to know what has succeeded and what has failed unless we maintain state.
-
@girish You're right, I think not.
Currently the app is NOT in a 'Running' state if any failures occur.
An alternate method would be to bring up the app anyway (as long as the primary subdomain is up, and continue to resolve the failed API calls until some 'threshold' (then alert) or it succeeds and all is well.
No need to ignore the failed ones, they just try again async as with your API rate backoff code, or at the next addition of a new sub if you want to be less proactive and more lazy. (The non working subs, if used will draw attention back to the app anyway.)
1/18