These best practices are an ideal, if you can manage it. If not, do your best. Validators in a Proof of Stake network are responsible for keeping the network in consensus and verifying state transitions. As the number of validators is limited, validators in the set have the responsibility to be online and faithfully execute their tasks.
This means that validators:
- Must have high availability.
- Must have infrastructure that protects the validator’s signing keys so that an attacker cannot take control and commit slashable behavior.
High Availability
High availability set-ups that involve redundant validator nodes may seem attractive at first. However, they can be very dangerous if they are not set up perfectly. The reason for this is that the session keys used by a validator should always be isolated to just a single node. Replicating session keys across multiple nodes could lead to equivocation slashes which can make you lose 100% of your staked funds. The good news is that 100% uptime of your validator is not really needed, as it has some buffer within eras in order to go offline for a little while to upgrade. For this reason, we advise that you only attempt a high availability set-up if you’re confident you know exactly what you’re doing. Many expert validators have made mistakes in the past due to the handling of session keys. Remember, even if your validator goes offline for some time, the offline slash is much more forgiving than the equivocation slash.
Secure Key Management
The keys that are of primary concern for validator infrastructure are the Session keys. These keys sign messages related to consensus. Although Session keys are not account keys and therefore cannot transfer funds, an attacker could use them to commit slashable behavior. Session keys are generated inside the node via RPC call. These should be generated and kept within your client. When you generate new Session keys, you must submit an extrinsic (a Session certificate) from your Controller key telling the chain your new Session keys. Session keys can also be generated outside the client and inserted into the client’s keystore via RPC. For most users, we recommend using the key generation functionality within the client.
Signing Outside the Client – In the future, Geode may support signing payloads outside the client so that keys can be stored on another device, e.g. a hardware security module (HSM) or secure enclave. For the time being, however, Session key signatures are performed within the client only.
Monitoring Tools
- The Geode Portal which has all kinds of stats on your node at the Network > Staking > Targets page.
- (More advanced) Prometheus-based monitoring stack, including Grafana for dashboards and log aggregation. It includes alerting, querying, visualization, and monitoring features and works for both cloud and home based systems. The data from your node can be made available to Prometheus through exporters like this.
Linux Best Practices For Cloud VMs
- Never use the root user. Make another user with similar privileges.
- Always update the security patches for your OS.
- Enable and set up a firewall.
- Never allow password-based SSH, only use key-based SSH.
- Disable non-essential SSH subsystems (banner, motd, scp, X11 forwarding) and harden your SSH configuration (reasonable guide to begin with).
- Back up your storage regularly.
Conclusions
Keys
- At the moment, Geode can’t interact with HSM/SGX, so we need to provide the signing key seeds to the validator machine. This key is kept in memory for signing operations and persisted to disk (encrypted with a password).
- Given that HA setups would always be at risk of double-signing and there’s currently no built-in mechanism to prevent it, we propose having a single instance of the validator to avoid slashing. Slashing penalties for being offline are much less than those for equivocation.
Validators
- Validators should only run the Geode binary, and they should not listen on any port other than the configured p2p port.
- Session keys should be generated and provided in a secure way like using SubKey.
- Geode should be started at boot and restarted if stopped for any reason (supervisor process).
- Geode should run as a non-root user.
Monitoring
- There should be an on-call rotation of team members for managing the alerts.
- There should be a clear protocol with actions to perform for each level of each alert and an escalation policy.