Monitor the health of the SystemLink Enterprise DataFrame Service using OpenTelemetry metrics and Prometheus metrics.

Refer to the following table of metrics emitted by the DataFrame Service and related dependencies.
Note You can deploy and configure the OpenTelemetry collector to expose all OpenTelemetry metrics as Prometheus metrics. Exposing these metrics to Prometheus allows you to view the metrics in a tool such as Grafana.

For the metrics that contain ni.dataframe.row_data_store.{object_storage}_stream_pool, the service replaces {object_storage} with s3 or azure. The replacement is dependent on the object storage provider that the service is connected to. The service implements the replacement automatically when emitting the metrics.

DataFrame Service

Table 60. Performance Metrics for the DataFrame Service
KPI? Metric Type Description Labels
Yes ni.dataframe.staged_row_data_processor.staging.files.found.count Counter The number of staging files found in storage.

Use with ni.dataframe.staged_row_data_processor.staging.files.orphaned.count to understand if the service is falling behind in processing files.

None
Yes ni.dataframe.staged_row_data_processor.staging.files.orphaned.count Counter The number of staging files deleted as orphans.
Use with ni.dataframe.staged_row_data_processor.staging.files.found.count to understand if the service is falling behind in processing files. In an ideal operation, this metric is zero. One of the following situations can cause a value greater than X.
  • The connection between the service and MongoDB is intermittent.
  • A client is writing data using a pattern that you must adjust.
None
Yes ni.dataframe.staged_row_data_processor.staging.files.missing.count Counter The number of staging files that are missing.
This metric indicates one of the following issues.
  • The S3 storage is not consistent.
  • A backup and restore operation broke the consistency between S3 and MongoDB.
  • When coupled with a ni.dataframe.staged_row_data_processor.staging.files.orphaned.count value that is not zero, the dataframeservice.ingestion.stagedDataProcessor.stagingFileExpiration Helm value is too low.
None
Yes ni.dataframe.staged_row_data_processor.claims.lost.count Counter The number of claims lost during processing.
This metric indicates one of the following issues.
  • The dataframeservice.ingestion.stagedDataProcessor.tableClaimExpiration Helm value is too low.
  • Users are deleting tables that are still receiving new data.

None
Yes ni.dataframe.staged_row_data_processor.claims.with.errors.count Counter The number of claims that encountered errors during processing.

Treat values greater than zero as the service returning 500 errors.

ni_dataframe_staged_row_data_processor_phase: [1, 2]
No ni.dataframe.staged_row_data_processor.skipped.storage.ids.count Counter The number of discovered storage IDs that did not process. None
No ni.dataframe.staged_row_data_processor.failed.to.claim.count Counter The number of discovered storage IDs without a claim. None
No ni.dataframe.staged_row_data_processor.claims.processed.count Counter The number of claims processed. ni_dataframe_staged_row_data_processor_phase: [1, 2]
No ni.dataframe.staged_row_data_processor.sent.notifications.count Counter The number of notifications sent. None
No ni.dataframe.row_data_store.{object_storage}_stream_pool.blocks.count Counter The number of free blocks in the stream pool for object storage. None
No ni.dataframe.row_data_store.{object_storage}_stream_pool.allocations.count Counter The number of blocks allocated in the stream pool for object storage. None
No ni.dataframe.row_data_store.{object_storage}_stream_pool.discards.count Counter The number of buffers discarded from the stream pool for object storage. None
No ni.dataframe.row_data_store.{object_storage}_stream_pool.free.size.bytes Counter The number of bytes allocated but unused in the stream pool for object storage. None
No ni.dataframe.row_data_store.{object_storage}_stream_pool.used.size.bytes Counter The number of bytes currently in use by the stream pool for object storage. None
Yes ni.dataframe.table_reaper.tables.reaped.count Counter The number of tables deleted.

Use this metric to monitor the clean up of tables.

ni_dataframe_table_reaper_reaped_result: [deleted, skipped, failed]
Yes ni.dataframe.tables.count Gauge The total number of data tables.

Use this metric to monitor data table growth. MongoDB resource requirements increase with the number of data tables.

None
Yes ni.dataframe.tables.appendable.count Gauge The number of active tables that can be appended.

Use this metric to compare the number of appendable tables to the appendable table limit.

None
Yes ni.dataframe.iceberg_operations.duration Histogram The duration of Iceberg operations.
  • ni_dataframe_iceberg_operations_job_state: [Complete, Error]
  • ni_dataframe_iceberg_operations_operation_type: [Promoting, CompactingData, CompactingManifests, Vacuuming, FinalCompactingData, FinalCompactingManifests, FinalVacuuming]
  • ni_dataframe_iceberg_operations_changes_made: [true, false]

DataFrame Service Dependencies

To learn about other available performance metrics and how to use the metrics, refer to the documentation for the DataFrame Service dependencies.
Table 61. References for Performance Metrics for DataFrame Service Dependencies
Dependency Where to Find Information
ASP.NET For a list of ASP.NET metrics, refer to ASP.NET Core Metrics and ASP.NET Runtime Metrics.
Kubernetes For a list of Kubernetes metrics, refer to Kubernetes Metrics Reference, cAdvisor Metrics, and the kube-state-metrics Documentation.
Dremio For a list of Dremio metrics, refer to Available JMX Metrics.