yesterday we experienced a downtime and today as well, initially i thought it was caused by some tweaks i did to the database but the second one was far more severe and it didnt resolve itself before i figured out what it was
the problem was caused by our provider (digital ocean) and i had to go trough quite an ordeal to get them to even look at the issue, i found a workaround by myself and frankly the delay i went trough was far beyond what i can consider acceptable
i will be evaluating the possibility of switching providers because this is probably the biggest pile of horseshit i've gone trough for a while
for your amusement, here's the conversation i had with them, im gonna go to bed now
oh, also this
EDIT01: in case someone is actually curious on how this impacted us, its pretty simple, since i use round robin for load balancing (1 load balancer distributes the load between 3 web servers), when the servers tried to return the traffic to the balancer (which has to play around with DNS because if it didn't then sessions wouldn't carry over between the different web servers if you log in after 6 hours and you end up on a different web server, there are more sophisticated ways of handling this but i honestly didn't expect this to ever be an issue) it would end up timing out because of lacking DNS resolution which is why at the start it dished out gateway errors and eventually database errors because of the vast amount of open sessions still trying to resolve.
this made the database slow to a crawl and die, not that it mattered since traffic wasn't going out anyways. DO somewhat forces what DNS servers you use so i had to do some tweaks in order to stop them from doing that and then i got the site back up, there are still however a lot of issues since there are a lot of things under the hood that use google related services so you can expect some things to be sorta wonky until DO stops being shit and fixes it or i get angry enough to switch back to linode or something.
EDIT02: and the second downtime was caused by leftover problems from the DNS resolution issue, i have a few things under the hood running that were still having problems due to the original issue and caused two of the nodes to pretty much hang up at 100% cpu usage about 4 hours after i went to bed, i just spent the last 2 hours or so modifying said jobs and re-creating all 3 webservers and re-adding them to the load balancer so hopefully things will be running a bit smoother now, sorry for the downtime