Runtime Metrics and Monitoring¶
Introduction¶
Maintaining a stable, secure, and efficient network requires continuous monitoring. Polkadot SDK-based nodes are equipped with built-in telemetry components that automatically collect and transmit detailed data about node performance in real-time. This telemetry system is a core feature of the Substrate framework, allowing for easy monitoring of network health without complex setup.
Substrate's client telemetry enables real-time data ingestion, which can be visualized on a client dashboard. The telemetry process uses tracing and logging to gather operational data. This data is sent through a tracing layer to a background task called the TelemetryWorker
, which then forwards it to configured remote telemetry servers.
If multiple Substrate nodes run within the same process, the telemetry system uses a tracing::Span
to distinguish data from each node. This ensures that each task, managed by the sc-service
's TaskManager
, inherits a span for data consistency, making it easy to track parallel node operations. Each node can be monitored for basic metrics, such as block height, peer connections, CPU usage, and memory. Substrate nodes expose these metrics at the host:9615/metrics
endpoint, accessible locally by default. To expose metrics on all interfaces, start a node with the --prometheus-external
flag.
As a developer or node operator, the telemetry system handles most of the technical setup. Collected data is automatically sent to a default telemetry server, where it’s aggregated and displayed on a dashboard, making it easy to monitor network performance and identify issues.
Runtime Metrics¶
Substrate exposes a variety of metrics about the operation of your network, such as the number of peer connections, memory usage, and block production. To capture and visualize these metrics, you can configure and use tools like Prometheus and Grafana. At a high level, Substrate exposes telemetry data that can be consumed by the Prometheus endpoint and then presented as visual information in a Grafana dashboard or graph. The provided diagram offers a simplified overview of how the interaction between Substrate, Prometheus, and Grafana can be configured to display information about node operations.
graph TD
subNode([Substrate Node]) --> telemetryStream[Exposed Telemetry Stream]
telemetryStream --> prometheus[Prometheus]
prometheus --> endpoint[Endpoint: Every 1 minute]
endpoint --> grafana[Grafana]
grafana --> userOpen[User Opens a Graph]
prometheus --> localData[Local Prometheus Data]
localData --> getmetrics[Get Metrics]
The diagram shows the flow of data from the Substrate node to the monitoring and visualization components. The Substrate node exposes a telemetry stream, which is consumed by Prometheus. Prometheus is configured to collect data every minute and store it. Grafana is then used to visualize the data, allowing the user to open graphs and retrieve specifc metrics from the telemetry stream.
Visual Monitoring¶
The Polkadot telemetry dashboard provides a real-time view of how currently online nodes are performing. This dashboard, allows users to select the network you need to check on, and also the information you want to display by turning visible columns on and off from the list of columns available. The monitoring dashboard provides the following indicators and metrics:
- Validator - identifies whether the node is a validator node or not
- Location - displays the geographical location of the node
- Implementation - shows the version of the software running on the node
- Network ID - displays the public network identifier for the node
- Peer count - indicates the number of peers connected to the node
- Transactions in queue - shows the number of transactions waiting in the
Ready
queue for a block author - Upload bandwidth - graphs the node's recent upload activity in MB/s
- Download bandwidth - graphs the node's recent download activity in MB/s
- State cache size - graphs the size of the node's state cache in MB
- Block - displays the current best block number to ensure synchronization with peers
- Block hash - shows the block hash for the current best block number
- Finalized block - displays the most recently finalized block number to ensure synchronization with peers
- Finalized block hash - shows the block hash for the most recently finalized block
- Block time - indicates the time between block executions
- Block propagation time - displays the time it took to import the most recent block
- Last block time - shows the time it took to author the most recent block
- Node uptime - indicates the number of days the node has been online without restarting
Displaying Network-Wide Statistics¶
In addition to the details available for individual nodes, you can view statistics that provide insights into the broader network. The network statistics provide detailed information about the hardware and software configurations of the nodes in the network, including:
- Software version
- Operating system
- CPU architecture and model
- Number of physical CPU cores
- Total memory
- Whether the node is a virtual machine
- Linux distribution and kernel version
- CPU and memory speed
- Disk speed
Customizing Monitoring Tools¶
The default telemetry dashboard offers core metrics without additional setup. However, many projects prefer custom telemetry setups with more advanced monitoring and alerting policies.
Typically, setting up a custom telemetry solution involves establishing monitoring and alerting policies for both on-chain events and individual node operations. This allows for more tailored monitoring and reporting compared to the default telemetry setup.
On-Chain Activity¶
You can monitor specific on-chain events like transactions from certain addresses or changes in the validator set. Connecting to RPC nodes allows tracking for delays or specific event timings. Running your own RPC servers is recommended for reliable queries, as public RPC nodes may occasionally be unreliable.
Monitoring Tools¶
To implement customized monitoring and alerting, consider using the following stack:
- Prometheus - collects metrics at intervals, stores data in a time series database, and applies rules for evaluation
- Grafana - visualizes collected data through customizable dashboards
- Node exporter - reports host metrics, including CPU, memory, and bandwidth usage
- Alert manager - manages alerts, routing them based on defined rules
- Loki - scalable log aggregator for searching and viewing logs across infrastructure
Change the Telemetry Server¶
Once backend monitoring is configured, use the --telemetry-url
flag when starting a node to specify telemetry endpoints and verbosity levels. Multiple telemetry URLs can be provided, and verbosity ranges from 0 (least verbose) to 9 (most verbose).
For instance, setting a custom telemetry server with verbosity level 5 would look like:
./target/release/node-template --dev \
--telemetry-url "wss://192.168.48.1:9616 5" \
--prometheus-port 9616 \
--prometheus-external
For more information on the backend components for telemetry or configuring your own server, you can refer to the substrate-telemetry
project or the Substrate Telemetry Helm Chart for Kubernetes deployments.
| Created: October 18, 2024