Skip to content

General Management

Introduction

Validator performance is pivotal in maintaining the security and stability of the Polkadot network. As a validator, optimizing your setup ensures efficient transaction processing, minimizes latency, and maintains system reliability during high-demand periods. Proper configuration and proactive monitoring also help mitigate risks like slashing and service interruptions.

This guide covers essential practices for managing a validator, including performance tuning techniques, security hardening, and tools for real-time monitoring. Whether you're fine-tuning CPU settings, configuring NUMA balancing, or setting up a robust alert system, these steps will help you build a resilient and efficient validator operation.

Configuration Optimization

For those seeking to optimize their validator's performance, the following configurations can improve responsiveness, reduce latency, and ensure consistent performance during high-demand periods.

Deactivate Simultaneous Multithreading

Polkadot validators operate primarily in single-threaded mode for critical tasks, so optimizing single-core CPU performance can reduce latency and improve stability. Deactivating simultaneous multithreading (SMT) can prevent virtual cores from affecting performance. SMT is called Hyper-Threading on Intel and 2-way SMT on AMD Zen.

Take the following steps to deactivate every other (vCPU) core:

  1. Loop though all the CPU cores and deactivate the virtual cores associated with them:

    for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | \
    cut -s -d, -f2- | tr ',' '\n' | sort -un)
    do
    echo 0 > /sys/devices/system/cpu/cpu$cpunum/online
    done
    
  2. To permanently save the changes, add nosmt=force to the GRUB_CMDLINE_LINUX_DEFAULT variable in /etc/default/grub:

    sudo nano /etc/default/grub
    # Add to GRUB_CMDLINE_LINUX_DEFAULT
    
    /etc/default/grub
    GRUB_HIDDEN_TIMEOUT = 0;
    GRUB_HIDDEN_TIMEOUT_QUIET = true;
    GRUB_TIMEOUT = 10;
    GRUB_DISTRIBUTOR = `lsb_release -i -s 2> /dev/null || echo Debian`;
    GRUB_CMDLINE_LINUX_DEFAULT = 'nosmt=force';
    GRUB_CMDLINE_LINUX = '';
    
  3. Update GRUB to apply changes:

    sudo update-grub
    
  4. After the reboot, you should see that half of the cores are offline. To confirm, run:

    lscpu --extended
    

Deactivate Automatic NUMA Balancing

Deactivating NUMA (Non-Uniform Memory Access) balancing for multi-CPU setups helps keep processes on the same CPU node, minimizing latency.

Follow these stpes:

  1. Deactivate NUMA balancing in runtime:

    sysctl kernel.numa_balancing=0
    
  2. Deactivate NUMA balancing permanently by adding numa_balancing=disable to the GRUB settings:

    sudo nano /etc/default/grub
    # Add to GRUB_CMDLINE_LINUX_DEFAULT
    
    /etc/default/grub
    GRUB_DEFAULT = 0;
    GRUB_HIDDEN_TIMEOUT = 0;
    GRUB_HIDDEN_TIMEOUT_QUIET = true;
    GRUB_TIMEOUT = 10;
    GRUB_DISTRIBUTOR = `lsb_release -i -s 2> /dev/null || echo Debian`;
    GRUB_CMDLINE_LINUX_DEFAULT = 'numa_balancing=disable';
    GRUB_CMDLINE_LINUX = '';
    
  3. Update GRUB to apply changes:

    sudo update-grub
    
  4. Confirm the deactivation:

    sysctl -a | grep 'kernel.numa_balancing'
    

If you successfully deactivated NUMA balancing, the preceding command should return 0.

Spectre and Meltdown Mitigations

Spectre and Meltdown are well-known CPU vulnerabilities that exploit speculative execution to access sensitive data. These vulnerabilities have been patched in recent Linux kernels, but the mitigations can slightly impact performance, especially in high-throughput or containerized environments.

If your security requirements allow it, you can deactivate specific mitigations, such as Spectre V2 and Speculative Store Bypass Disable (SSBD), to improve performance.

To selectively deactivate the Spectre mitigations, take these steps:

  1. Update the GRUB_CMDLINE_LINUX_DEFAULT variable in your /etc/default/grub configuration:

    sudo nano /etc/default/grub
    # Add to GRUB_CMDLINE_LINUX_DEFAULT
    
    /etc/default/grub
    GRUB_DEFAULT = 0;
    GRUB_HIDDEN_TIMEOUT = 0;
    GRUB_HIDDEN_TIMEOUT_QUIET = true;
    GRUB_TIMEOUT = 10;
    GRUB_DISTRIBUTOR = `lsb_release -i -s 2> /dev/null || echo Debian`;
    GRUB_CMDLINE_LINUX_DEFAULT =
      'spec_store_bypass_disable=prctl spectre_v2_user=prctl';
    
  2. Update GRUB to apply changes and then reboot:

sudo update-grub
sudo reboot

This approach selectively deactivates the Spectre V2 and Spectre V4 mitigations, leaving other protections intact. For full security, keep mitigations activated unless there's a significant performance need, as disabling them could expose the system to potential attacks on affected CPUs.

Monitor Your Node

Monitoring your node's performance is critical for network reliability and security. Tools like the following provide valuable insights:

  • Prometheus - an open-source monitoring toolkit for collecting and querying time-series data
  • Grafana - a visualization tool for real-time metrics, providing interactive dashboards
  • Alertmanager - a tool for managing and routing alerts based on Prometheus data.

This section covers setting up these tools and configuring alerts to notify you of potential issues.

Environment Setup

Before installing Prometheus, ensure the environment is set up securely by running Prometheus with restricted user privileges.

Follow these steps:

  1. Create a Prometheus user to ensure Prometheus runs with minimal permissions:

    sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
    
  2. Create directories for configuration and data storage:

    sudo mkdir /etc/prometheus
    sudo mkdir /var/lib/prometheus
    
  3. Change directory ownership to ensure Prometheus has access:

    sudo chown -R prometheus:prometheus /etc/prometheus
    sudo chown -R prometheus:prometheus /var/lib/prometheus
    

Install and Configure Prometheus

After setting up the environment, install and configure the latest version of Prometheus as follows:

  1. Download Prometheus for your system architecture from the releases page. Replace INSERT_RELEASE_DOWNLOAD with the release binary URL (e.g., https://github.com/prometheus/prometheus/releases/download/v3.0.0/prometheus-3.0.0.linux-amd64.tar.gz):

    sudo apt-get update && sudo apt-get upgrade
    wget INSERT_RELEASE_DOWNLOAD_LINK
    tar xfz prometheus-*.tar.gz
    cd prometheus-3.0.0.linux-amd64
    
  2. Set up Prometheus:

    1. Copy binaries:

      sudo cp ./prometheus /usr/local/bin/
      sudo cp ./promtool /usr/local/bin/
      sudo cp ./prometheus /usr/local/bin/
      
    2. Copy directories and assign ownership of these files to the prometheus user:

      sudo cp -r ./consoles /etc/prometheus
      sudo cp -r ./console_libraries /etc/prometheus
      sudo chown -R prometheus:prometheus /etc/prometheus/consoles
      sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries
      
    3. Clean up the download directory:

      cd .. && rm -r prometheus*
      
  3. Create prometheus.yml to define global settings, rule files, and scrape targets:

    sudo nano /etc/prometheus/prometheus.yml
    
    prometheus-config.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      # - "first.rules"
      # - "second.rules"
    
    scrape_configs:
      - job_name: 'prometheus'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9090']
      - job_name: 'substrate_node'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9615']
    

    Prometheus is scraped every 5 seconds in this example configuration file, ensuring detailed internal metrics. Node metrics with customizable intervals are scraped from port 9615 by default.

  4. Verify the configuration with promtool, an open source monitoring tool:

    promtool check config /etc/prometheus/prometheus.yml
    
  5. Save the configuration and change the ownership of the file to prometheus user:

    sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
    

Start Prometheus

  1. Launch Prometheus with the appropriate configuration file, storage location, and necessary web resources, running it with restricted privileges for security:

    sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries
    

    If you set the server up properly, you should see terminal output similar to the following:

  2. Verify you can access the Prometheus interface by navigating to:

    http://SERVER_IP_ADDRESS:9090/graph
    

    If the interface appears to work as expected, exit the process using Control + C.

  3. Create a systemd service file to ensure Prometheus starts on boot:

    sudo nano /etc/systemd/system/prometheus.service
    
    prometheus.service
    [Unit]
    Description=Prometheus Monitoring
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=prometheus
    Group=prometheus
    Type=simple
    ExecStart=/usr/local/bin/prometheus \
     --config.file /etc/prometheus/prometheus.yml \
     --storage.tsdb.path /var/lib/prometheus/ \
     --web.console.templates=/etc/prometheus/consoles \
     --web.console.libraries=/etc/prometheus/console_libraries
    ExecReload=/bin/kill -HUP $MAINPID
    
    [Install]
    WantedBy=multi-user.target
    
  4. Reload systemd and enable the service to start on boot:

    sudo systemctl daemon-reload && sudo systemctl enable prometheus && sudo systemctl start prometheus
    
  5. Verify the service is running by visiting the Prometheus interface again at:

    http://SERVER_IP_ADDRESS:9090/
    

Install and Configure Grafana

This guide follows Grafana's canonical installation instructions.

To install and configure Grafana, follow these steps:

  1. Install Grafana prerequisites:

    sudo apt-get install -y apt-transport-https software-properties-common wget    
    
  2. Import the GPG key:

    sudo mkdir -p /etc/apt/keyrings/
    wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
    
  3. Configure the stable release repo and update packages:

    echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
    sudo apt-get update
    
  4. Install the latest stable version of Grafana:

    sudo apt-get install grafana
    

To configure Grafana, take these steps:

  1. Configure Grafana to start automatically on boot and start the service:

    sudo systemctl daemon-reload
    sudo systemctl enable grafana-server.service
    sudo systemctl start grafana-server
    
  2. Check if Grafana is running:

    sudo systemctl status grafana-server
    

    If necessary, you can stop or restart the service with the following commands:

    sudo systemctl stop grafana-server
    sudo systemctl restart grafana-server
    
  3. Access Grafana by navigating to the following URL and logging in with the default username and password (admin):

    http://SERVER_IP_ADDRESS:3000/login
    

    Change default port

    To change Grafana's port, edit /usr/share/grafana/conf/defaults.ini:

    sudo vim /usr/share/grafana/conf/defaults.ini
    

    Modify the http_port value, then restart Grafana:

    sudo systemctl restart grafana-server
    

Grafana login screen

To visualize node metrics, follow these steps:

  1. Select the gear icon to access Data Sources settings
  2. Select Add data source to define the data source

    Select Prometheus

  3. Select Prometheus

    Save and test

  4. Enter http://localhost:9090 in the URL field and click Save & Test. If "Data source is working" appears, your connection is configured correctly

    Import dashboard

  5. Select Import from the left menu, choose Prometheus from the dropdown, and click Import

  6. Start your Polkadot node by running ./polkadot. You should now be able to monitor node performance, block height, network traffic, and tasks tasks on the Grafana dashboard

    Live dashboard

The Grafana dashboards page features user created dashboards made available for public use. For an example, see the Substrate Node Metrics dashboard.

Install and Configure Alertmanager

Alertmanager is an optional component that complements Prometheus by managing alerts and notifying users about potential issues.

Follow these steps to install and configure Alertmanager:

  1. Download Alertmanager for your system architecture from the releases page. Replace INSERT_RELEASE_DOWNLOAD with the release binary URL (e.g., https://github.com/prometheus/alertmanager/releases/download/v0.28.0-rc.0/alertmanager-0.28.0-rc.0.linux-amd64.tar.gz):

    wget INSERT_RELEASE_DOWNLOAD_LINK
    tar -xvzf alertmanager*
    
  2. Copy the binaries to the system directory and set permissions:

    cd alertmanager-0.28.0-rc.0.linux-amd64
    sudo cp ./alertmanager /usr/local/bin/
    sudo cp ./amtool /usr/local/bin/
    sudo chown prometheus:prometheus /usr/local/bin/alertmanager
    sudo chown prometheus:prometheus /usr/local/bin/amtool
    
  3. Create the alertmanager.yml configuration file under /etc/alertmanager:

    sudo mkdir /etc/alertmanager
    sudo nano /etc/alertmanager/alertmanager.yml
    

    Generate an app password in your Google account to enable email notifications from Alertmanager. Then, add the following code to the configuration file to define email notifications using your email and app password:

    alertmanager.yml
    global:
      resolve_timeout: 1m
    
    route:
      receiver: 'gmail-notifications'
    
    receivers:
      - name: 'gmail-notifications'
        email_configs:
          - to: INSERT_YOUR_EMAIL
            from: INSERT_YOUR_EMAIL
            smarthost: smtp.gmail.com:587
            auth_username: INSERT_YOUR_EMAIL
            auth_identity: INSERT_YOUR_EMAIL
            auth_password: INSERT_YOUR_APP_PASSWORD
            send_resolved: true
    
    sudo chown -R prometheus:prometheus /etc/alertmanager
    
  4. Configure Alertmanager as a service by creating a systemd service file:

    sudo nano /etc/systemd/system/alertmanager.service
    
    alertmanager.service
    [Unit]
    Description=AlertManager Server Service
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=root
    Group=root
    Type=simple
    ExecStart=/usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --web.external-url=http://SERVER_IP:9093 --cluster.advertise-address='0.0.0.0:9093'
    
    [Install]
    WantedBy=multi-user.target
    
  5. Reload and enable the service:

    sudo systemctl daemon-reload
    sudo systemctl enable alertmanager
    sudo systemctl start alertmanager
    
  6. Verify the service status:

    sudo systemctl status alertmanager
    

    If you have configured Alertmanager properly, the Active field should display active (running) similar to below:

    sudo systemctl status alertmanager alertmanager.service - AlertManager Server Service Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-08-20 22:01:21 CEST; 3 days ago Main PID: 20592 (alertmanager) Tasks: 70 (limit: 9830) CGroup: /system.slice/alertmanager.service

Grafana Plugin

There is an Alertmanager plugin in Grafana that can help you monitor alert information.

Follow these steps to use the plugin:

  1. Install the plugin:

    sudo grafana-cli plugins install camptocamp-prometheus-alertmanager-datasource
    
  2. Restart Grafana:

    sudo systemctl restart grafana-server
    
  3. Configure Alertmanager as a data source in your Grafana dashboard (SERVER_IP:3000):

    1. Go to Configuration > Data Sources and search for Prometheus Alertmanager
    2. Enter the server URL and port for the Alertmanager service, and select Save & Test to verify the connection
  4. Import the 8010 dashboard for Alertmanager, selecting Prometheus Alertmanager in the last column, then select Import

Integrate Alertmanager

Complete the integration by following these steps to enable communication between Prometheus and Alertmanager and configure detection and alert rules:

  1. Update the etc/prometheus/prometheus.yml configuration file to include the following code:

    prometheus.yml
    rule_files:
      - 'rules.yml'
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - localhost:9093
    

    Expand the following item to view the complete prometheus.yml file.

    prometheus.yml
    prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - 'rules.yml'
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - localhost:9093
    
    scrape_configs:
      - job_name: 'prometheus'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9090']
      - job_name: 'substrate_node'
        scrape_interval: 5s
        static_configs:
          - targets: ['localhost:9615']
    
  2. Create the rules file for detection and alerts:

    sudo nano /etc/prometheus/rules.yml
    

    Add a sample rule to trigger email notifications for node downtime over five minutes:

    rules.yml
    groups:
      - name: alert_rules
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: 'Instance [{{ $labels.instance }}] down'
              description: '[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 5 minutes.'
    

    If any of the conditions defined in the rules file are met, an alert will be triggered. For more on alert rules, refer to Alerting Rules and additional alerts.

  3. Update the file ownership to prometheus:

    sudo chown prometheus:prometheus rules.yml
    
  4. Validate the rules syntax:

    sudo -u prometheus promtool check rules rules.yml
    
  5. Restart Prometheus and Alertmanager:

    sudo systemctl restart prometheus && sudo systemctl restart alertmanager
    

Now you will receive an email alert if one of your rule triggering conditions is met.

Secure Your Validator

Validators in Polkadot's Proof of Stake (PoS) network play a critical role in maintaining network integrity and security by keeping the network in consensus and verifying state transitions. To ensure optimal performance and minimize risks, validators must adhere to strict guidelines around security and reliable operations.

Key Management

Though they don't transfer funds, session keys are essential for validators as they sign messages related to consensus and parachains. Securing session keys is crucial as allowing them to be exploited or used across multiple nodes can lead to a loss of staked funds via slashing.

Given the current limitations in high-availability setups and the risks associated with double-signing, it’s recommended to run only a single validator instance. Keys should be securely managed, and processes automated to minimize human error.

There are two approaches for generating session keys:

  • Generate and store in node - using the author.rotateKeys RPC call. For most users, generating keys directly within the client is recommended. You must submit a session certificate from your staking proxy to register new keys. See the How to Validate guide for instructions on setting keys

  • Generate outside node and insert - using the author.setKeys RPC call. This flexibility accommodates advanced security setups and should only be used by experienced validator operators

Signing Outside the Client

Polkadot plans to support external signing, allowing session keys to reside in secure environments like Hardware Security Modules (HSMs). However, these modules can sign any payload they receive, potentially enabling an attacker to perform slashable actions.

Secure-Validator Mode

Polkadot's Secure-Validator mode offers an extra layer of protection through strict filesystem, networking, and process sandboxing. This secure mode is activated by default if the machine meets the following requirements:

  • Linux (x86-64 architecture) - usually Intel or AMD
  • Enabled seccomp - this kernel feature facilitates a more secure approach for process management on Linux. Verify by running:

    cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
    

    If seccomp is enabled, you should see output similar to the following:

    CONFIG_SECCOMP=y
    

Tip

Optionally, Linux 5.13 may also be used, as it provides access to even more strict filesystem protections.

Linux Best Practices

Follow these best practices to keep your validator secure:

  • Use a non-root user for all operations
  • Regularly apply OS security patches
  • Enable and configure a firewall
  • Use key-based SSH authentication; deactivate password-based login
  • Regularly back up data and harden your SSH configuration. Visit this SSH guide for more details

Validator Best Practices

Additional best practices can add an additional layer of security and operational reliability:

  • Only run the Polkadot binary, and only listen on the configured p2p port
  • Run on bare-metal machines, as opposed to virtual machines
  • Provisioning of the validator machine should be automated and defined in code which is kept in private version control, reviewed, audited, and tested
  • Generate and provide session keys in a secure way
  • Start Polkadot at boot and restart if stopped for any reason
  • Run Polkadot as a non-root user
  • Establish and maintain an on-call rotation for managing alerts
  • Establish and maintain a clear protocol with actions to perform for each level of each alert with an escalation policy

Additional Resources

For additional guidance, connect with other validators and the Polkadot engineering team in the Polkadot Validator Lounge on Element.

Last update: February 12, 2025
| Created: October 16, 2024