So long … and thanks for the radish

A few months ago we dropped out of the top 100 validators and have since not managed to get back in.

We recommend everyone who hasn’t done so yet to unstake from us and stake to somebody else near the bottom of the validator list.

Perhaps we will come back when Xi’an hits. Until then, we will shut down our servers. This page will follow soon after.

Thanks to everyone who staked with us and who supported us throughout our Radix validator journey.

~Markus

Incident: 15.02.2022 (downtime)

On 15th February 2022 at 9pm (UTC) our validator went down and effectively suspended validation for 11 hours until 8am of the following day.

We were notified to the validator going down straight away.
Shortly afterwards we de-registered the validator so that no further proposals were missed which would have further decreased the recent uptime percentage (which is at 99.3% right now) until the issue could be resolved. The validator was re-registered the next morning with a lower fee of 0.1% (instead of 1.99%). The fee will return to normal in about 2 weeks.

Post Mortem

Here we go through the details of what went wrong and how we aim to prevent this kind of situation in the future.

All times given are in UTC (London time).

15.02.2022, 9pm

The currently active validator node was shutdown and put into maintenance mode by our hosting provider. The maintenance started at 9pm and went on until 7am the next morning.

The maintenance of 3 of our servers (on 3 separate days) was actually announced ahead of time so all of this was entirely preventable. This was a human (my) error. I mixed up the server names and had thought this was only going to occur the following day. That’s my bad and I apologise to our stakers for the unnecessary downtime.

Now even with this human error this should have been not a big deal.
Just switch over to the failover and the downtime should be only a couple of minutes long.
After all I was alerted and able to react immediately.

However, due to unexpected complications with the failover node it wasn’t ready to jump in immediately.
While it did have a snapshot of the ledger and associated DB files they seemed to be corrupt and so the failover started synchronizing from the start again.

A full sync from zero takes at least 12 hours. That is a big problem.

So as one does I went to moan about this in the #node-runners channel on the RadixDLT discord server.
The Radix community being as awesome as it is people heard me right away.

Faraz from radstakes.com immediately offered his help. Rigel (Marco) from StakingCoins also answered right away and offered me a link to a backup of the database from his validator to speed up re-synchronisation. The backup was a bit older but would still save a lot of time.

Faraz then went on to point out that Stuart from RadixPool still shared a relatively recent backup with the community on the RadixPool website. As Faraz said in discord, we definitely all owe Stuart more than a few beers for his continuing service to the Radix community.

With Stuart’s backup I was able to re-synchronize the failover node by around 11pm.

Thanks to Faraz and Rigel for being so kind and helpful! And thanks to Stuart for providing these snapshots.

15.02.2022, 11pm

At this point I had re-activated the validator but still left it unregistered. The problem was again an unexpected one for me1. The original validator server could not actually be controlled by me in any way. The hosting provider doesn’t allow deactivating or removing the server, removing its internet access, closing ports or anything of the sort. So after maintenance was done it would then restart automatically and try to keep on validating.

This is a problem because there must never be 2 nodes active with the same key which both try to validate. This will cause missed proposals as they fight over the spot among the validators.

Now usually, in my experience with other hosting providers, this wouldn’t have been a problem. I would just deactivate the inactive node so that it does not automatically start again.
Unfortunately, and I didn’t expect this, I had no such control over the server with our current provider.
I should of course have realised this before this situation came up. My mistake.

This means I could not prevent it from coming back up online automatically eventually and ruin things.

This is why, with a heavy heart, I made the decision to keep the validator de-registered until the next morning when the maintenance window was going to end.

16.0.2022, 7am

With the maintenance window finally over the original validator came back online as expected.
Now at this point it was of course not in sync again as it hadn’t been online for 11 or so hours.

If had known for sure that the old validator would only came back online at a specific time, I wouldn’t have had to de-register the validator. I would’ve had all the time in the world to turn of the old server while it was catching up as soon as it came back up online.

Unfortunately it could’ve come back up online at any time within an 8 hour time window in the middle of the night. We observed this during the maintenance of another server.

Now I had control over the server again and deactivated the Radix node on it.

Then I promoted the still synchronized failover to validator and re-registered the validator so that it would again participate in the network.

As mentioned above when re-registering the validator I did this with a much lower fee of 0.1% as opposed to our normal fee 1.99%. I had originally intended to lower the fee to make up for the missed rewards. I then changed my mind after some discussion in our discord.

Full disclosure: It’s only as low as it is now because during re-registration I didn’t consider that the fee parameter in the transaction was actually expected not in percent (1.99) but in per 10,000 (199). On the plus side now our stakers do get more rewards as I originally intended for a while.

Still, this change will be reverted back to 1.99% in about 2 weeks.

Summary

To summarize: The issue was caused by human error (1), that is me, to begin with. The failover then was only available with a delay (2). And even with the failover ready we were unable to re-register the validator due to double validation (3) concerns.

I can only pledge to not let my mistakes happen again. I’ve learned from them and will take extra care from now on.

As for technical issues, these shall be mitigated in the future.

1 Which is of course not the provider’s fault. The issue was just me being to used to cloud hosting providers such as AWS or Hetzner where you can easily do this sort of thing. But for our node we chose to use closer-to-the-metal root servers which are of course not nearly as flexible.

The Future

The most critical problem here was the double validation issue (3). Without this the validator would have been back up and running within 2 hours after it went down despite issue number 2 (the corrupted database on the failover server). Still not ideal but a whole lot better than 11 hours.

Preventing double validation

To prevent this from happening in the future we will adjust our docker setup.
Currently before starting the docker services on a server the configuration is adjusted on the server once (validator vs full node configuration). From then onwards the server will always start in that configured role, i.e. either as a validator or only as a full node.

If a server becomes completely unavailable so that not even in the configuration can be adjusted on its disk (as it happened in this case) we cannot safely switch over the other to the validator role.

This step should have happened just before the maintenance window. It’s a problem that this requires a manual step before the server goes down because with this provider we can’t make sure that it stays down.

To mitigate this we will make the configuration source for the switch between validator and full node external. That is there will be a configuration file, say on a public S3 bucket, which will simply contain the IP of the server mapped to a role.

We will customize the docker container so that it checks this external source to decide whether it will start as a validator or full node. This means that even if the original validator server comes back online it will check this file and once it sees that it doesn’t have the validator role anymore, it will simply start as a full node only, preventing the double validation issue.

Should the external configuration source be unavailable for some reason we could still fallback to the manual configuration on the server. But by default the node container wouldn’t start at all.

External database snapshots

Issue number 2 was the corrupt database on the failover node. This, of course, shouldn’t happen to begin with and we will fix that issue too. We believe this happened due to a restart of the server during the local snapshot process. Overriding the last snapshot then corrupted it. We will adapt this such that we will keep multiple snapshots so that we can fallback onto a previous one in that case.

Moreover we will take a leaf out of Stuart’s book and store additional external snapshots of the database which we can then download as a last resort.

Closing thoughts

This long downtime of 11 hours was caused by multiple separate issues that all came together but in the end were triggered due to my mistakes. We recognize that our setup was not ideal and we could’ve prevented an issue to this extent even in the face of human error on my side.

We have learned from this and will improve our infrastructure accordingly so that we can offer a more reliable service in the future.

A big thank you to all our stakers who stay with us despite these temporary issues, and especially to those who expressed their support in our Telegram channel.

Mainnet Go-Live

Congratulations to everyone in the Radix team for the launch of the mainnet!

We are still soaking up all the new information available in the docs. Our nodes are already synced and online. We are just waiting to be able to register as a validator which is not yet possible.

Until then make sure to go to https://wallet.radixdlt.com/ and create your mainnet wallet address!

Once we’re registered we’ll post an update. Our validator’s address will be:

rv1qw6m5nrwnjx2estgkv8zsvp77es6yea0p99zkregud6dqad8q5wg7yvr4na

Update (11pm): Our validator is now live!

Radix Betanet. We’re in.

Good news! We are one of 81 Radix Nodes from more than 300 applicants who are allowed to participate as full validators in the Radix Betanet. The 2-month beta phase starts on April 28th. We’ll be there from the start. We’re ready to go as soon as the documentation is available!

More on this in the official radix Blog entry .

Proposal #302 Found: RadixRadar.de

Hello Radix Community,

You are sure to be surprised to see a German post. There is of course a reason! We at Radixradar.de want to strengthen the Radix network, especially in German-speaking countries (Germany, Austria and Switzerland) and represent a point of contact for the broader population.
We are also convinced of the Radix DiFi system and would therefore like to act as a validator!

You can find our proposal (# 302) here .

Our technology

In our proposal, we originally considered hosting our servers at Hetzner.
However, we have now changed our mind and will host our servers in Düsseldorf at myloc.de.
Even if they are only based in one city, they operate 5 data centers there and offer the best possible security and redundancy. The plan is currently to run 2 root servers that will both host Radix nodes.
Should we be chosen as the validator, 1 of them will act as the validator. In the event of a failure of one of the two servers – in theory – the other can then be made the validator without much delay.

We will only know more once the betanet is available.

Our team

Our team from Radixradar.de currently consists of three people, Markus, Mathias and Robert.

Markus: I have a lot of experience with the operation of server infrastructures on various platforms (e.g. AWS). I will keep the servers running and be available for technical questions. I’m very interested in Radix as a dApp platform and I can’t wait to jump into Scrypto as soon as possible!

Mathias: I am an experienced software engineer and I will support Markus in the operation of the servers and take care of our website.

Robert: Through years of experience as a lecturer at universities, I will promote the distribution in all possible social networks and present instruction videos and information in German. So I listen to you, support you and spread the most important information … to put it another way …. I’ll be the “mom” for everything. 🙂

Community

The main point of contact should be our website. We also have our own subreddit as well as a Twitter account to keep the community up to date.

Our goal

  • Permanent operation of our Radix-Nodes and Community website
  • Support with technical questions about radix, staking and crypto in general
  • Develop the community and bring in your ideas
  • Spread the word about Radix in German-speaking countries