General Management¶
Introduction¶
Validator performance is pivotal in maintaining the security and stability of the Polkadot network. As a validator, optimizing your setup ensures efficient transaction processing, minimizes latency, and maintains system reliability during high-demand periods. Proper configuration and proactive monitoring also help mitigate risks like slashing and service interruptions.
This guide covers essential practices for managing a validator, including performance tuning techniques, security hardening, and tools for real-time monitoring. Whether you're fine-tuning CPU settings, configuring NUMA balancing, or setting up a robust alert system, these steps will help you build a resilient and efficient validator operation.
Configuration Optimization¶
For those seeking to optimize their validator's performance, the following configurations can improve responsiveness, reduce latency, and ensure consistent performance during high-demand periods.
Deactivate Simultaneous Multithreading¶
Polkadot validators operate primarily in single-threaded mode for critical paths, meaning optimizing for single-core CPU performance can reduce latency and improve stability. Deactivating simultaneous multithreading (SMT) can prevent virtual cores from affecting performance. SMT implementation is called Hyper-Threading on Intel and 2-way SMT on AMD Zen. The following will deactivate every other (vCPU) core:
for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un)
do
echo 0 > /sys/devices/system/cpu/cpu$cpunum/online
done
To save the changes permanently, add nosmt=force
as kernel parameter. Edit /etc/default/grub
and add nosmt=force
to GRUB_CMDLINE_LINUX_DEFAULT
variable as follows:
GRUB_HIDDEN_TIMEOUT = 0;
GRUB_HIDDEN_TIMEOUT_QUIET = true;
GRUB_TIMEOUT = 10;
GRUB_DISTRIBUTOR = `lsb_release -i -s 2> /dev/null || echo Debian`;
GRUB_CMDLINE_LINUX_DEFAULT = 'nosmt=force';
GRUB_CMDLINE_LINUX = '';
After updating the variable, be sure to update GRUB to apply changes:
After the reboot, you should see that half of the cores are offline. To confirm, run:
Deactivate Automatic NUMA Balancing¶
Deactivating NUMA (Non-Uniform Memory Access) balancing for multi-CPU setups helps keep processes on the same CPU node, minimizing latency. Run the following command to deactivate NUMA balancing in runtime:
To deactivate NUMA balancing permanently, add numa_balancing=disable
to GRUB settings:
GRUB_DEFAULT = 0;
GRUB_HIDDEN_TIMEOUT = 0;
GRUB_HIDDEN_TIMEOUT_QUIET = true;
GRUB_TIMEOUT = 10;
GRUB_DISTRIBUTOR = `lsb_release -i -s 2> /dev/null || echo Debian`;
GRUB_CMDLINE_LINUX_DEFAULT = 'numa_balancing=disable';
GRUB_CMDLINE_LINUX = '';
After updating the variable, be sure to update GRUB to apply changes:
Confirm the deactivation by running the following command:
If you successfully deactivated NUMA balancing, the preceding command should return 0
.
Spectre and Meltdown Mitigations¶
Spectre and Meltdown are well-known vulnerabilities in modern CPUs that exploit speculative execution to access sensitive data. These vulnerabilities have been patched in recent Linux kernels, but the mitigations can slightly impact performance, especially in high-throughput or containerized environments.
If your security needs allow it, you may selectively deactivate specific mitigations for performance gains. The Spectre V2 and Speculative Store Bypass Disable (SSBD) for Spectre V4 apply to speculative execution and are particularly impactful in containerized environments. Deactivating them can help regain performance if your environment doesn't require these security layers.
To selectively deactivate the Spectre mitigations, update the GRUB_CMDLINE_LINUX_DEFAULT
variable in your /etc/default/grub configuration:
GRUB_DEFAULT = 0;
GRUB_HIDDEN_TIMEOUT = 0;
GRUB_HIDDEN_TIMEOUT_QUIET = true;
GRUB_TIMEOUT = 10;
GRUB_DISTRIBUTOR = `lsb_release -i -s 2> /dev/null || echo Debian`;
GRUB_CMDLINE_LINUX_DEFAULT =
'spec_store_bypass_disable=prctl spectre_v2_user=prctl';
After updating the variable, be sure to update GRUB to apply changes and then reboot:
This approach selectively deactivates the Spectre V2 and Spectre V4 mitigations, leaving other protections intact. For full security, keep mitigations activated unless there's a significant performance need, as disabling them could expose the system to potential attacks on affected CPUs.
Monitor Your Node¶
Monitoring your node's performance is critical to maintaining network reliability and security. Tools like Prometheus and Grafana provide insights into block height, peer connections, CPU and memory usage, and more. This section walks through setting up these tools and configuring alerts to notify you of potential issues.
Prepare Environment¶
Before installing Prometheus, it's important to set up the environment securely to ensure Prometheus runs with restricted user privileges. You can set up Prometheus securely as follows:
- Create a Prometheus user - ensure Prometheus runs with minimal permissions
- Set up directories - create directories for configuration and data storage
- Change directory ownership - ensure Prometheus has access
Install and Configure Prometheus¶
After preparing the environment; install and configure the latest version of Prometheus as follows:
- Download Prometheus - obtain the respective release binary for your system architecture from the Prometheus releases page. Replace the placeholder text with the respective release binary, e.g.
https://github.com/prometheus/prometheus/releases/download/v3.0.0/prometheus-3.0.0.linux-amd64.tar.gz
-
Set up Prometheus - copy binaries and directories, assign ownership of these files to the
prometheus
user, and clean up download directory as follows: -
Create
Prometheus is scraped every 5 seconds in this example configuration file, ensuring detailed internal metrics. Node metrics with customizable intervals are scraped from portprometheus.yml
for configuration - run this command to define global settings, rule files, and scrape targets:9615
by default.prometheus-config.ymlglobal: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first.rules" # - "second.rules" scrape_configs: - job_name: 'prometheus' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] - job_name: 'substrate_node' scrape_interval: 5s static_configs: - targets: ['localhost:9615']
-
Validate configuration with promtool - use the open source monitoring system to check your configuration
- Assign ownership - save the configuration file and change the ownership of the file to
prometheus
user
Start Prometheus¶
-
Launch Prometheus - use the following command to launch Prometheus with a given configuration, set the storage location for metric data, and enable web console templates and libraries:
sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib/prometheus/ --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries
If you set the server up properly, you should see terminal output similar to the following:
-
Verify access - verify you can access the Prometheus interface by visiting the following address:
If the interface appears to work as expected, exit the process using
Control + C
. -
Create new systemd service file - this will automatically start the server during the boot process
Add the following code to the service file:prometheus.serviceOnce you save the file, execute the following command to reload[Unit] Description=Prometheus Monitoring Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries ExecReload=/bin/kill -HUP $MAINPID [Install] WantedBy=multi-user.target
systemd
and enable the service so that it will load automatically during the operating system's startup: 4. Verify service - return to the Prometheus interface at the following address to verify the service is running:
Install and Configure Grafana¶
Grafana provides a powerful, customizable interface to visualize metrics collected by Prometheus. This guide follows Grafana's canonical installation instructions. To install and configure Grafana, follow these steps:
-
Install Grafana prerequisites - run the following commands to install the required packages:
-
Import the GPG key:
-
Configure the stable release repo and update packages:
-
Install the latest stable version of Grafana:
After installing Grafana, you can move on to the configuration steps:
-
Set Grafana to auto-start - configure Grafana to start automatically on system boot and start the service
-
Verify the Grafana service is running** with the following command:
If necessary, you can stop or restart the service with the following commands: -
Access Grafana - open your browser, navigate to the following address, and use the default user and password
admin
to login:
Change default port
If you want run Grafana on another port, edit the file /usr/share/grafana/conf/defaults.ini
with a command like:
http_port
value as desired. Then restart Grafana with:
Follow these steps to visualize node metrics:
- Select the gear icon for settings to configure the Data Sources
- Select Add data source to define the data source
- Select Prometheus
- Enter
http://localhost:9090
in the URL field, then select Save & Test. If you see the message "Data source is working" your connection is configured correctly - Next, select Import from the menu bar on the left, select Prometheus in the dropdown list and select Import
- Finally, start your Polkadot node by running
./polkadot
. You should now be able to monitor your node's performance such as the current block height, network traffic, and running tasks on the Grafana dashboard
Import via grafana.com
The Grafana dashboards page features user created dashboards made available for public use. Visit "Substrate Node Metrics" for an example of available dashboards.
Install and Configure Alertmanager¶
The optional Alertmanager
complements Prometheus by handling alerts and notifying users of potential issues. Follow these steps to install and configure Alertmanager
:
- Download extract
Alertmanager
- download the latest version from the Prometheus Alertmanager releases page. Replace the placeholder text with the respective release binary, e.g.https://github.com/prometheus/alertmanager/releases/download/v0.28.0-rc.0/alertmanager-0.28.0-rc.0.linux-amd64.tar.gz
- Move binaries and set permissions - copy the binaries to a system directory and set appropriate permissions
-
Create configuration file - create a new
Add the following code to the configuration file to define email notifications:alertmanager.yml
file under/etc/alertmanager
alertmanager.ymlglobal: resolve_timeout: 1m route: receiver: 'gmail-notifications' receivers: - name: 'gmail-notifications' email_configs: - to: INSERT_YOUR_EMAIL from: INSERT_YOUR_EMAIL smarthost: smtp.gmail.com:587 auth_username: INSERT_YOUR_EMAIL auth_identity: INSERT_YOUR_EMAIL auth_password: INSERT_YOUR_APP_PASSWORD send_resolved: true
App password
You must generate an
app password
in your Gmail account to allowAlertmanager
to send you alert notification emails.Ensure the configuration file has the correct permissions:
4. Configure as a service - set upAlertmanager
to run as a service by creating a systemd service file Add the following code to the service file:alertmanager.serviceReload and enable the service Verify the service status using the following command: If you have configured the[Unit] Description=AlertManager Server Service Wants=network-online.target After=network-online.target [Service] User=root Group=root Type=simple ExecStart=/usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --web.external-url=http://SERVER_IP:9093 --cluster.advertise-address='0.0.0.0:9093' [Install] WantedBy=multi-user.target
Alertmanager
properly, the Active field should display active (running) similar to below:sudo systemctl status alertmanager alertmanager.service - AlertManager Server Service Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-08-20 22:01:21 CEST; 3 days ago Main PID: 20592 (alertmanager) Tasks: 70 (limit: 9830) CGroup: /system.slice/alertmanager.service
Grafana Plugin¶
There is an Alertmanager
plugin in Grafana that can help you monitor alert information. Follow these steps to use the plugin:
- Install the plugin - use the following command:
- Restart Grafana
- Configure datasource - go to your Grafana dashboard
SERVER_IP:3000
and configure theAlertmanager
datasource as follows:- Go to Configuration -> Data Sources, and search for Prometheus Alertmanager
- Fill in the URL to your server location followed by the port number used in the
Alertmanager
. Select Save & Test to test the connection
- To monitor the alerts, import the 8010 dashboard, which is used for
Alertmanager
. Make sure to select the Prometheus Alertmanager in the last column then select Import
Integrate Alertmanager¶
A few more steps are required to allow the Prometheus server to talk to the Alertmanager and to configure rules for detection and alerts. Complete the integration as follows:
- Update configuration - update the configuration file in the
etc/prometheus/prometheus.yml
to add the following code: - Create rules file - here you will define the rules for detection and alerts
Run the following command to create the rules file:
If any of the conditions defined in the rules file are met, an alert will be triggered. The following sample rule checks for the node being down and triggers an email notification if an outage of more than five minutes is detected:
rules.ymlSee Alerting Rules and additional alerts in the Prometheus documentation to learn more about defining and using alerting rules.
groups: - name: alert_rules rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: 'Instance [{{ $labels.instance }}] down' description: '[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 5 minutes.'
- Update ownership of rules file - ensure user
prometheus
has access by running: - Check rules - ensure the rules defined in
rules.yml
are syntactically correct by running the following command: - Restart Alertmanager
Now you will receive an email alert if one of your rule triggering conditions is met.
Updated prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- 'rules.yml'
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'substrate_node'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9615']
Secure Your Validator¶
Validators in Polkadot's Proof of Stake network play a critical role in maintaining network integrity and security by keeping the network in consensus and verifying state transitions. To ensure optimal performance and minimize risks, validators must adhere to strict guidelines around security and reliable operations.
Key Management¶
Though they don't transfer funds, session keys are essential for validators as they sign messages related to consensus and parachains. Securing session keys is crucial as allowing them to be exploited or used across multiple nodes can lead to a loss of staked funds via slashing.
Given the current limitations in high-availability setups and the risks associated with double-signing, it’s recommended to run only a single validator instance. Keys should be securely managed, and processes automated to minimize human error.
There are two approaches for generating session keys:
-
Generate and store in node - using the
author.rotateKeys
RPC call. For most users, generating keys directly within the client is recommended. You must submit a session certificate from your staking proxy to register new keys. See the How to Validate guide for instructions on setting keys -
Generate outside node and insert - using the
author.setKeys
RPC call. This flexibility accommodates advanced security setups and should only be used by experienced validator operators
Signing Outside the Client¶
Polkadot plans to support external signing, allowing session keys to reside in secure environments like Hardware Security Modules (HSMs). However, these modules can sign any payload they receive, potentially enabling an attacker to perform slashable actions.
Secure-Validator Mode¶
Polkadot's Secure-Validator mode offers an extra layer of protection through strict filesystem, networking, and process sandboxing. This secure mode is activated by default if the machine meets the following requirements:
- Linux (x86-64 architecture) - usually Intel or AMD
- Enabled
seccomp
- this kernel feature facilitates a more secure approach for process management on Linux. Verify by running: Ifseccomp
is enabled, you should see output similar to the following:
Note
Optionally, Linux 5.13 may also be used, as it provides access to even more strict filesystem protections.
Linux Best Practices¶
Follow these best practices to keep your validator secure:
- Use a non-root user for all operations
- Regularly apply OS security patches
- Enable and configure a firewall
- Use key-based SSH authentication; deactivate password-based login
- Regularly back up data and harden your SSH configuration. Visit this SSH guide for more details
Validator Best Practices¶
Additional best practices can add an additional layer of security and operational reliability:
- Only run the Polkadot binary, and only listen on the configured p2p port
- Run on bare-metal machines, as opposed to virtual machines
- Provisioning of the validator machine should be automated and defined in code which is kept in private version control, reviewed, audited, and tested
- Generate and provide session keys in a secure way
- Start Polkadot at boot and restart if stopped for any reason
- Run Polkadot as a non-root user
- Establish and maintain an on-call rotation for managing alerts
- Establish and maintain a clear protocol with actions to perform for each level of each alert with an escalation policy
Additional Resources¶
- Certus One's Knowledge Base
- EOS Block Producer Security List
- HSM Policies and the Importance of Validator Security
For additional guidance, connect with other validators and the Polkadot engineering team in the Polkadot Validator Lounge on Element.
| Created: October 16, 2024