Performance Metrics for the Data Frame Service

Monitor the health of the SystemLink Enterprise Data Frame Service using OpenTelemetry metrics and Prometheus metrics.

Refer to the following table of metrics emitted by the Data Frame Service and the Data Frame Service dependencies. You can deploy the OpenTelemetry collector and configure it to expose all OpenTelemetry metrics as Prometheus metrics. Then, you can view the Prometheus metrics in a tool such as Grafana.

Data Frame Service

Table 52. Performance Metrics for the Data Frame Service
KPI? Metric Type Description Labels
Yes ni.dataframe.staged_row_data_processor.staging.files.found.count Counter The number of staging files detected in storage

Use with ni.dataframe.staged_row_data_processor.staging.files.orphaned.count to understand if the service is falling behind in processing files.

None
Yes ni.dataframe.staged_row_data_processor.staging.files.orphaned.count Counter The number of staging files deleted as orphans
Use with ni.dataframe.staged_row_data_processor.staging.files.found.count to understand if the service is falling behind in processing files. In an ideal operation, this metric is zero. One of the following situations can cause a value greater than X.
  • The connection between the service and MongoDB is intermittent.
  • A client is writing data using a pattern that you must adjust.
None
Yes ni.dataframe.staged_row_data_processor.staging.files.missing.count Counter The number of staging files missing
This metric indicates one of the following issues.
  • The S3 storage is not consistent.
  • A backup and restore operation broke the consistency between S3 and MongoDB.
  • When coupled with a ni.dataframe.staged_row_data_processor.staging.files.orphaned.count value that is not zero, the dataframeservice.ingestion.stagedDataProcessor.stagingFileExpiration Helm value is set too low.
None
Yes ni.dataframe.staged_row_data_processor.claims.lost.count Counter The number of claims lost during processing
This metric indicates one of the following issues.
  • The dataframeservice.ingestion.stagedDataProcessor.tableClaimExpiration Helm value is set too low.
  • Tables are being deleted while the tables are still being written to.
None
Yes ni.dataframe.staged_row_data_processor.claims.with.errors.count Counter The number of claims that encountered errors during processing

Treat values greater than zero as the service returning 500 errors.

ni_dataframe_staged_row_data_processor_phase: [1, 2]
No ni.dataframe.staged_row_data_processor.skipped.storage.ids.count Counter The number of discovered storage IDs that were not processed None
No ni.dataframe.staged_row_data_processor.failed.to.claim.count Counter The number of discovered storage IDs that were not claimed None
No ni.dataframe.staged_row_data_processor.claims.processed.count Counter The number of claims processed ni_dataframe_staged_row_data_processor_phase: [1, 2]
No ni.dataframe.staged_row_data_processor.sent.notifications.count Counter The number of notifications sent None
No ni.dataframe.row_data_store.s3_stream_pool.blocks.count Counter The number of free blocks in the S3 stream pool None
No ni.dataframe.row_data_store.s3_stream_pool.allocations.count Counter The number of blocks allocated in the S3 stream pool None
No ni.dataframe.row_data_store.s3_stream_pool.discards.count Counter The number of buffers discarded from the S3 stream pool None
No ni.dataframe.row_data_store.s3_stream_pool.free.size.bytes Counter The number of bytes allocated but unused in the S3 stream pool None
No ni.dataframe.row_data_store.s3_stream_pool.used.size.bytes Counter The number of bytes currently in use by the S3 stream pool None
Yes ni.dataframe.table_reaper.tables.reaped.count Counter The number of tables deleted

Use this metric to monitor the clean up of tables.

ni_dataframe_table_reaper_reaped_result: [deleted, skipped, failed]
Yes ni.dataframe.tables.appendable.count Gauge The number of active tables that can be appended

Use this metric to compare the number of tables that can be appended to the limit for tables that can be appended.

None
Yes ni.dataframe.iceberg_operations.duration Histogram The duration of Iceberg operations.
  • ni_dataframe_iceberg_operations_job_state: [Complete, Error]
  • ni_dataframe_iceberg_operations_operation_type: [Promoting, CompactingData, CompactingManifests, Vacuuming, FinalCompactingData, FinalCompactingManifests, FinalVacuuming]
  • ni_dataframe_iceberg_operations_changes_made: [true, false]

DataFrame Service Dependencies

To learn about other available performance metrics and how to use the metrics, refer to the documentation for the DataFrame Service dependencies.
Table 53. References for Performance Metrics for DataFrame Service Dependencies
Dependency Where to Find Information
ASP.NET For a list of ASP.NET metrics, refer to ASP.NET Core Metrics and ASP.NET Runtime Metrics.
Kubernetes For a list of Kubernetes metrics, refer to Kubernetes Metrics Reference, cAdvisor Metrics, and the kube-state-metrics Documentation.
Dremio For a list of Dremio metrics, refer to Available JMX Metrics.