Incident: 15.02.2022 (downtime)
On 15th February 2022 at 9pm (UTC) our validator went down and effectively suspended validation for 11 hours until 8am of the following day.
We were notified to the validator going down straight away.
Shortly afterwards we de-registered the validator so that no further proposals were missed which would have further decreased the recent uptime percentage (which is at 99.3% right now) until the issue could be resolved. The validator was re-registered the next morning with a lower fee of 0.1% (instead of 1.99%). The fee will return to normal in about 2 weeks.
Here we go through the details of what went wrong and how we aim to prevent this kind of situation in the future.
All times given are in UTC (London time).
The currently active validator node was shutdown and put into maintenance mode by our hosting provider. The maintenance started at 9pm and went on until 7am the next morning.
The maintenance of 3 of our servers (on 3 separate days) was actually announced ahead of time so all of this was entirely preventable. This was a human (my) error. I mixed up the server names and had thought this was only going to occur the following day. That’s my bad and I apologise to our stakers for the unnecessary downtime.
Now even with this human error this should have been not a big deal.
Just switch over to the failover and the downtime should be only a couple of minutes long.
After all I was alerted and able to react immediately.
However, due to unexpected complications with the failover node it wasn’t ready to jump in immediately.
While it did have a snapshot of the ledger and associated DB files they seemed to be corrupt and so the failover started synchronizing from the start again.
A full sync from zero takes at least 12 hours. That is a big problem.
So as one does I went to moan about this in the #node-runners channel on the RadixDLT discord server.
The Radix community being as awesome as it is people heard me right away.
Faraz from radstakes.com immediately offered his help. Rigel (Marco) from StakingCoins also answered right away and offered me a link to a backup of the database from his validator to speed up re-synchronisation. The backup was a bit older but would still save a lot of time.
Faraz then went on to point out that Stuart from RadixPool still shared a relatively recent backup with the community on the RadixPool website. As Faraz said in discord, we definitely all owe Stuart more than a few beers for his continuing service to the Radix community.
With Stuart’s backup I was able to re-synchronize the failover node by around 11pm.
Thanks to Faraz and Rigel for being so kind and helpful! And thanks to Stuart for providing these snapshots.
At this point I had re-activated the validator but still left it unregistered. The problem was again an unexpected one for me1. The original validator server could not actually be controlled by me in any way. The hosting provider doesn’t allow deactivating or removing the server, removing its internet access, closing ports or anything of the sort. So after maintenance was done it would then restart automatically and try to keep on validating.
This is a problem because there must never be 2 nodes active with the same key which both try to validate. This will cause missed proposals as they fight over the spot among the validators.
Now usually, in my experience with other hosting providers, this wouldn’t have been a problem. I would just deactivate the inactive node so that it does not automatically start again.
Unfortunately, and I didn’t expect this, I had no such control over the server with our current provider.
I should of course have realised this before this situation came up. My mistake.
This means I could not prevent it from coming back up online automatically eventually and ruin things.
This is why, with a heavy heart, I made the decision to keep the validator de-registered until the next morning when the maintenance window was going to end.
With the maintenance window finally over the original validator came back online as expected.
Now at this point it was of course not in sync again as it hadn’t been online for 11 or so hours.
If had known for sure that the old validator would only came back online at a specific time, I wouldn’t have had to de-register the validator. I would’ve had all the time in the world to turn of the old server while it was catching up as soon as it came back up online.
Unfortunately it could’ve come back up online at any time within an 8 hour time window in the middle of the night. We observed this during the maintenance of another server.
Now I had control over the server again and deactivated the Radix node on it.
Then I promoted the still synchronized failover to validator and re-registered the validator so that it would again participate in the network.
As mentioned above when re-registering the validator I did this with a much lower fee of 0.1% as opposed to our normal fee 1.99%. I had originally intended to lower the fee to make up for the missed rewards. I then changed my mind after some discussion in our discord.
Full disclosure: It’s only as low as it is now because during re-registration I didn’t consider that the fee parameter in the transaction was actually expected not in percent (1.99) but in per 10,000 (199). On the plus side now our stakers do get more rewards as I originally intended for a while.
Still, this change will be reverted back to 1.99% in about 2 weeks.
To summarize: The issue was caused by human error (1), that is me, to begin with. The failover then was only available with a delay (2). And even with the failover ready we were unable to re-register the validator due to double validation (3) concerns.
I can only pledge to not let my mistakes happen again. I’ve learned from them and will take extra care from now on.
As for technical issues, these shall be mitigated in the future.
1 Which is of course not the provider’s fault. The issue was just me being to used to cloud hosting providers such as AWS or Hetzner where you can easily do this sort of thing. But for our node we chose to use closer-to-the-metal root servers which are of course not nearly as flexible.
The most critical problem here was the double validation issue (3). Without this the validator would have been back up and running within 2 hours after it went down despite issue number 2 (the corrupted database on the failover server). Still not ideal but a whole lot better than 11 hours.
Preventing double validation
To prevent this from happening in the future we will adjust our docker setup.
Currently before starting the docker services on a server the configuration is adjusted on the server once (validator vs full node configuration). From then onwards the server will always start in that configured role, i.e. either as a validator or only as a full node.
If a server becomes completely unavailable so that not even in the configuration can be adjusted on its disk (as it happened in this case) we cannot safely switch over the other to the validator role.
This step should have happened just before the maintenance window. It’s a problem that this requires a manual step before the server goes down because with this provider we can’t make sure that it stays down.
To mitigate this we will make the configuration source for the switch between validator and full node external. That is there will be a configuration file, say on a public S3 bucket, which will simply contain the IP of the server mapped to a role.
We will customize the docker container so that it checks this external source to decide whether it will start as a validator or full node. This means that even if the original validator server comes back online it will check this file and once it sees that it doesn’t have the validator role anymore, it will simply start as a full node only, preventing the double validation issue.
Should the external configuration source be unavailable for some reason we could still fallback to the manual configuration on the server. But by default the node container wouldn’t start at all.
External database snapshots
Issue number 2 was the corrupt database on the failover node. This, of course, shouldn’t happen to begin with and we will fix that issue too. We believe this happened due to a restart of the server during the local snapshot process. Overriding the last snapshot then corrupted it. We will adapt this such that we will keep multiple snapshots so that we can fallback onto a previous one in that case.
Moreover we will take a leaf out of Stuart’s book and store additional external snapshots of the database which we can then download as a last resort.
This long downtime of 11 hours was caused by multiple separate issues that all came together but in the end were triggered due to my mistakes. We recognize that our setup was not ideal and we could’ve prevented an issue to this extent even in the face of human error on my side.
We have learned from this and will improve our infrastructure accordingly so that we can offer a more reliable service in the future.
A big thank you to all our stakers who stay with us despite these temporary issues, and especially to those who expressed their support in our Telegram channel.