Enterprise AI Analysis
PromCopilot: Simplifying Prometheus Metric Querying in Cloud Native Online Service Systems via Large Language Models
This paper proposes PromCopilot, an LLM-based framework to simplify Prometheus metric querying in cloud-native online service systems by leveraging knowledge graphs and large language models.
Executive Impact Summary
PromCopilot aims to transform natural language questions into PromQL queries, addressing the challenge of manual query writing. It uses a knowledge graph to describe system context and LLMs for synergistic reasoning. The approach achieves an accuracy of 69.1% in translating natural language to PromQL queries, demonstrating its effectiveness and potential for improving operational efficiency.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Engineers struggle with writing complex PromQL queries due to the need for high programming skills and deep system context understanding. Existing LLM-based approaches fall short due to a lack of domain knowledge, dynamic system components, and complex multi-hop reasoning requirements.
PromCopilot uses a knowledge graph to model system context (metrics, components, dependencies) and LLMs for natural language understanding and query generation. It retrieves relevant knowledge from the graph to augment LLM prompts, enabling accurate PromQL query generation.
A custom benchmark dataset of 280 PromQL queries was created. PromCopilot with GPT-4-Turbo achieved 69.1% query accuracy, 91.3% metric retrieval accuracy, and significantly reduced query completion time in user studies compared to baseline approaches.
Enterprise Process Flow
| Approach | MetricAcc | SyntaxAcc | QueryAcc |
|---|---|---|---|
| Basic Prompt | 28.3% | 86.1% | 2.6% |
| Basic Prompt + 10-shot | 77% | 96.5% | 37.4% |
| PromCopilot | 91.3% | 96.1% | 69.1% |
Successful Case Example
PromCopilot successfully generates a PromQL query for CPU time of pods calling 'ts-auth' service by retrieving relevant service and pod information, and the correct metric 'container_cpu_usage_seconds_total' with its associated label-value pairs.
- Natural Language Input: 'Calculate the CPU time used by each individual pod in the services that call the ts-auth service over the last 30 minutes.'
- Knowledge Retrieved: Services 'ts-gateway-service' and 'ts-user-service', their corresponding pods, and the metric 'container_cpu_usage_seconds_total' with relevant pod labels.
- PromQL Output: `increase(container_cpu_usage_seconds_total{pod=~'ts-gateway-service-6f99b4b794-.*|ts-user-service-5fc7759cf4-.*'}[30m])`
Calculate Your Potential ROI
Estimate the time and cost savings your enterprise could achieve by implementing AI-powered solutions like PromCopilot.
PromCopilot Implementation Roadmap
Our structured approach ensures a smooth transition and rapid value realization for your enterprise.
Knowledge Graph Construction
Automatic extraction of entities and relationships from Prometheus, Kubernetes, Traces, and Documents.
Question Parsing
LLMs extract component relation paths and metric-component pairs from natural language questions.
Knowledge Retrieval
System component and metric knowledge are retrieved from the knowledge graph based on parsed information.
PromQL Query Generation
LLMs generate the final PromQL query using the original question and retrieved knowledge as context.
Ready to Transform Your Operations?
Connect with our AI specialists to explore how PromCopilot can enhance your enterprise's monitoring capabilities and developer productivity.