Sharing infrastructure best practices for running validators on Cosmos-based chains like Dymension

An overview of Kiln’s Cosmos-based blockchains infrastructure practices

As we run validators on several Cosmos-based blockchains, significant effort has been devoted to automating their setup and maintenance. This enhances our efficiency, resilience, and also reduces the chances of human mistakes.

In this post, we’ll explore how we’ve built our system and the daily tools utilized to ensure smooth operations. Let’s dive into the specifics of our Cosmos validators infrastructure.

Kiln validators infrastructure

Our infrastructure is deployed on Kubernetes across various cloud providers, including AWS and GCP, but mainly bare metal solutions like OVH & Data Packet are managed by a central control plane.

We’ve standardized on large servers, which allows us to run multiple nodes for different blockchains on a single machine. This approach gives us great flexibility in where we run our nodes, primarily across Europe and Asia. It also enables dynamic adjustments of the CPU, RAM, and disk resources as needed. However, for blockchains with high block rates, like Injective or dYdX, we use dedicated servers equipped with high-frequency CPUs.

Our entire system is managed through a GitOps workflow, offering several benefits such as a reliable source of trust, a comprehensive history, and the integration of Continuous Integration processes.

We’ve also fully automated the bootstrap and maintenance of Cosmos-based chains, including default configuration setup, snapshot management, and upgrade procedures. This level of automation streamlines our operations and ensures consistency across our deployments.

Security & resilience

Our security strategy is comprehensive and multifaceted, ensuring the protection of sensitive data and the uninterrupted operation of our infrastructure:

  • Key management: All sensitive information, especially validator keys, is securely stored in Hashicorp Vault. This system not only provides robust security but also facilitates easy access when necessary.
  • Constrained access: Strict policy rules are in place, allowing each node access only to its own key. This limitation is crucial for preventing unauthorized use of keys and preserving the integrity of each node.
  • Network isolation: Nodes within the same blockchain network can communicate with each other but are isolated from the rest of our infrastructure. This network isolation per namespace is a key security measure, preventing potential cross-contamination or breaches.
  • Port management: The only ports open to the public are P2P ports, with dynamic configuration of node IPs, This enhances our operational resilience, allowing us to swiftly relocate nodes if necessary in response to outages.
  • Geographic distribution: The combination of these security measures makes it easy to move validators when needed. This mobility is essential for maintaining uninterrupted service and quick response to any network issues.
  • Operators wallets: For an additional security layer, operator accounts are managed exclusively via hardware wallets like Ledger. This practice ensures that even in the event of a compromised system, the operators’ access remains secure and untouchable.

These measures uphold a high-security standard across our infrastructure, safeguarding our validators and the reliability of the networks they support.

Remote signing with Horcrux

Initially, our configuration included running a single validator per chain, supplemented by a spare located in a different geographical location.

As we evolved, we embraced the use of Horcrux cosigners, which substantially enhanced our system’s security and efficiency.

  • Sentries: For each validator key, we now deploy at least two sentries. These sentries are strategically distributed across different geographic zones to ensure diversity and mitigate the risk of simultaneous downtime.
  • Cosigners: Alongside the sentries, we operate 3 Horcrux cosigners, which is essential for maintaining the integrity and reliability of our validation process.
  • Colocation for Reduced Latency: To optimize performance, the cosigners are often colocated within the same servers as the sentries. This proximity minimizes latency, critical for maintaining swift and efficient validation processes.
  • Network: In this architecture, the sentries are the ones interfacing with the public P2P network. In contrast, the Horcrux cosigners are configured to operate exclusively over a private network, reducing the possible attack surface.
  • WireGuard Mesh-VPN: A key element of our setup is the use of a WireGuard mesh-VPN. This VPN creates a secure, private network that envelops our nodes. The Horcrux cluster, in particular, communicates over this VPN, ensuring that our internal communications are shielded from external threats.

This evolved validator architecture underscores our commitment to not only maintaining but continually enhancing the security, efficiency, and resilience of our blockchain operations. By leveraging advanced technologies like Horcrux cosigners and WireGuard VPNs, we ensure that our infrastructure remains robust and capable of adapting to the ever-evolving landscape of blockchain technology.

Monitoring

To monitor all our validators efficiently, we rely on two sources of metrics:

  • CometBFT Metrics: We actively monitor CometBFT (formerly known as Tendermint) metrics from our nodes, including current block height, peer connections, mempool status, and standard host metrics like CPU usage, memory, and disk throughput. These metrics provide us with a comprehensive overview of each node’s health and performance.
  • Blackbox Monitoring: In addition to internal metrics, we employ blackbox monitoring through our custom Open-source tool, cosmos-validator-watcher. This tool checks our uptime by connecting simultaneously to internal and external RPCs and helps us ensure that no critical information is missed.

The cosmos-validator-watcher extends beyond monitoring, offering insights into total stakes, reward commissions, and tracking our votes on current on-chain governance proposals.

Leveraging our GitOps workflow, the validator watcher enables us to automate the upgrade process through webhooks. This approach is more efficient than swapping out binaries (especially in an immutable container), a common practice with tools like Cosmovisor.

For those interested in a deeper dive into how our cosmos-validator-watcher operates, we’ve made the information available on our GitHub repository at GitHub - kilnfi/cosmos-validator-watcher: Real-time Cosmos-based chains monitoring.

About Horcrux

Horcrux, a Multi-Party Computation (MPC) signing service for Tendermint nodes, enhances validator infrastructure security and availability by utilizing a cluster of signer nodes, ensuring fault tolerance, securing private keys through threshold Ed25519 signatures, and boosting performance. Explore the documentation to upgrade your validator infrastructure with Horcrux.

About Kiln

Kiln is the leading enterprise-grade staking platform, enabling institutional customers to stake their digital assets programmatically and whitelabel staking functionality into their offerings. Kiln runs validators on all major PoS blockchains, with over $5b of stake under management. As an experienced Cosmos-based chains node operator, we offer staking services and real-time data for various chains, including DYM, ATOM, Osmosis, TIA, INJ, KAVA, and more, to fully meet our customers’ requirements.

23 Likes

Nice read already applying some of the techniques, will think of implementing more of them. Thanks.

11 Likes

Good technique I’ll try to apply

9 Likes

As I delve into learning how to set up a node for Dym, this post has truly enlightened me on security and best practices.

One question arose while I was engrossed in understanding how you configured your system, particularly around this passage:

I can acknowledge that these solutions assure an uptime of nearly 99%, but doesn’t this centralize the entire process if everyone depends on these providers to host their nodes?

Also I would love to read some more about what kind of hardware you are using and which other tools you may recommend :smile:

9 Likes

Does it centralize the entire process ? Yes and no. Yes of course, using Node Operators like Kiln implies a form of centralization. From a blockchain point-of-view, it would be much better to have only solo stakers with their own hardware at home. But since most of the DPOS chains have an active set, this is not possible, hence us, the Node Operators.

Nevertheless, we run our infrastructure on multiple providers, multiple regions and multiple datacenters within those regions. So the reality is quite different and we are committed to stay as decentralized as we can do.

We don’t really communicate about the hardware we use, not that this is secret but we use a very wide range of hardware spec depending on what’s available from our different providers and it would be time-consuming to report about it. For Cosmos-based chains we like using high frequency CPU (> 3.5 Ghz).

4 Likes

Thanks for the insight guys!

Didn’t try to bash you with my question, I was just curious to hear your perspective on this topic since I know that a couple of providers (like Hetzner) don’t like that their infra is used for this kind ot stuff :no_good_man:

5 Likes