Prometheus Metrics
TFGQL exposes Prometheus-compatible metrics at GET /metrics.
The endpoint is authenticated and executes configured GraphQL queries, then converts
the result rows into Prometheus metric families (# HELP, # TYPE, sample lines).
Runnable example stack:
examples/prometheus-grafanaincludes Docker Compose, Prometheus config, Grafana provisioning, and a starter dashboard for this endpoint.
How It Works
- Your scraper calls
GET /metricswithAuthorization: Bearer <encrypted-jwt>. - TFGQL validates the JWT (issued by
POST /auth/token). - For each configured metric definition, TFGQL runs its GraphQL query.
- TFGQL extracts rows using
resultPath, resolves labels, and computes a numeric value. - TFGQL renders Prometheus exposition text and returns
text/plain; version=0.0.4.
Notes:
- Values are normalized to numbers. Booleans become
1/0. - Non-numeric values for gauge/counter rows are skipped.
- Metric names and label keys are sanitized to valid Prometheus names.
- Results are cached per token hash for
TFGQL_METRICS_CACHE_TTLseconds. - Cache floor is enforced by deployment target:
- TFC: minimum
600s(10 minutes), always mandatory - TFE: minimum
120s(2 minutes), even if you set a lower env value
- TFC: minimum
Scrape Cadence (Important)
/metrics is not a cheap host-level exporter endpoint. Each scrape executes live
GraphQL queries and upstream Terraform API calls, which can be slow.
Scrape this endpoint occasionally, not every few seconds.
Recommended starting point:
scrape_interval:30mscrape_timeout:120s
Practical guidance:
- Use
15mto60mintervals for most business metrics. - For broad organization-wide queries, prefer
30mor longer. - Keep query scope narrow (
includeOrgs, time windows, focusedresultPath) to reduce scrape latency.
Runtime Configuration
| Variable | Description | Default |
|---|---|---|
TFGQL_METRICS_ENABLED | Enables the /metrics endpoint. Set to false to disable. | true |
TFGQL_METRICS_CONFIG | Optional path to a JSON metrics definition file. If unset, built-in defaults are used. | (unset) |
TFGQL_METRICS_CACHE_TTL | Cache TTL in seconds for rendered /metrics output (per token). Enforced minimum: 600 (TFC) and 120 (TFE). | 600 on TFC, 120 on TFE |
Authenticated Scrape Example
For long-running Prometheus jobs, you can mint either a default TTL token or an infinite-lifetime token.
Default TTL token (30 days by default)
JWT=$(curl -s -X POST http://127.0.0.1:4000/auth/token \
-H 'content-type: application/json' \
-d '{"tfcToken":"<your terraform token>"}' | jq -r '.token')
curl -s http://127.0.0.1:4000/metrics \
-H "Authorization: Bearer ${JWT}"
/auth/token returns an expiresAt timestamp for this mode.
Infinite-lifetime token
Set infinite: true when calling /auth/token:
JWT=$(curl -s -X POST http://127.0.0.1:4000/auth/token \
-H 'content-type: application/json' \
-d '{"tfcToken":"<your terraform token>","infinite":true}' | jq -r '.token')
In this mode, the response has expiresAt: null and the JWT has no expiration claim.
Use infinite-lifetime tokens carefully. If you need tokens to survive process restarts,
set a stable TFGQL_JWT_ENCRYPTION_KEY; otherwise, tokens become invalid when the
server restarts.
If auth is missing/invalid, /metrics returns HTTP 401.
Built-In Metrics
When TFGQL_METRICS_CONFIG is not set, TFGQL provides these defaults:
tfgql_workspace_resource_counttfgql_workspace_run_failurestfgql_workspace_lockedtfgql_workspace_plan_duration_avgtfgql_workspace_apply_duration_avgtfgql_workspace_policy_check_failurestfgql_agent_statustfgql_explorer_terraform_version_workspace_count(TFC only)
Custom Metrics Config File
Set TFGQL_METRICS_CONFIG to a JSON file with a metrics array:
{
"metrics": [
{
"name": "tfgql_workspace_resource_count",
"help": "Number of resources per workspace",
"type": "gauge",
"query": "query($orgs: [String!]) { workspaces(includeOrgs: $orgs) { name resourceCount organization { name } } }",
"variables": { "orgs": ["my-org"] },
"resultPath": "workspaces",
"valueField": "resourceCount",
"labels": {
"organization": "organization.name",
"workspace": "name"
}
}
]
}
Field behavior:
type:gauge,counterresultPath: dot path to an array in the GraphQL resultvalueField: required forgauge/counter, ignored forinfolabels: map oflabel_name -> dot.pathTFGQL_METRICS_CONFIGcurrently expects JSON (not YAML)
Advanced pathing:
resultPathsupports[]flattening, for example:agentPools[].agents- Child rows created through
[]can reference parent fields via__parent, for example:organization: "__parent.organizationName"
Policy Failure Metrics (Legacy Sentinel + Agent Mode Sentinel/OPA)
Policy failure telemetry spans two paths in Terraform:
- Legacy policy checks on
run.policyChecks(Sentinel legacy flow) - Agent-mode policy evaluations on
run.policyEvaluations(Sentinel and OPA)
Both can exist on the same run. Treat them as separate metric families so you do not accidentally double count failures.
Example definitions:
{
"metrics": [
{
"name": "tfgql_run_policy_check",
"help": "Legacy Sentinel policy check state per run",
"type": "info",
"query": "query($orgs: [String!], $since: DateTime) { runs(includeOrgs: $orgs, filter: { createdAt: { _gt: $since } }) { id workspace { name organization { name } } policyChecks { status scope } } }",
"variables": {
"orgs": null,
"since": "2026-01-01T00:00:00Z"
},
"resultPath": "runs[].policyChecks",
"valueField": "",
"labels": {
"path": "legacy_policy_checks",
"organization": "__parent.workspace.organization.name",
"workspace": "__parent.workspace.name",
"run_id": "__parent.id",
"status": "status",
"scope": "scope"
}
},
{
"name": "tfgql_run_policy_eval_mandatory_failed",
"help": "Mandatory failures from agent-mode policy evaluations",
"type": "gauge",
"query": "query($orgs: [String!], $since: DateTime) { runs(includeOrgs: $orgs, filter: { createdAt: { _gt: $since } }) { id workspace { name organization { name } } policyEvaluations { status policyKind resultCount { mandatoryFailed advisoryFailed errored } } } }",
"variables": {
"orgs": null,
"since": "2026-01-01T00:00:00Z"
},
"resultPath": "runs[].policyEvaluations",
"valueField": "resultCount.mandatoryFailed",
"labels": {
"path": "agent_policy_evaluations",
"organization": "__parent.workspace.organization.name",
"workspace": "__parent.workspace.name",
"run_id": "__parent.id",
"policy_kind": "policyKind",
"status": "status"
}
},
{
"name": "tfgql_run_policy_eval_advisory_failed",
"help": "Advisory failures from agent-mode policy evaluations",
"type": "gauge",
"query": "query($orgs: [String!], $since: DateTime) { runs(includeOrgs: $orgs, filter: { createdAt: { _gt: $since } }) { id workspace { name organization { name } } policyEvaluations { status policyKind resultCount { mandatoryFailed advisoryFailed errored } } } }",
"variables": {
"orgs": null,
"since": "2026-01-01T00:00:00Z"
},
"resultPath": "runs[].policyEvaluations",
"valueField": "resultCount.advisoryFailed",
"labels": {
"path": "agent_policy_evaluations",
"organization": "__parent.workspace.organization.name",
"workspace": "__parent.workspace.name",
"run_id": "__parent.id",
"policy_kind": "policyKind",
"status": "status"
}
}
]
}
Recommended interpretation:
- Use
tfgql_run_policy_check{status=...}for legacy Sentinel checks - Use
tfgql_run_policy_eval_*{policy_kind=...}for agent-mode Sentinel/OPA - Keep a time/window filter in your run query (
createdAt) to control scrape cost and cardinality
Troubleshooting
- Empty output with HTTP
200means no metric samples were produced (for example, no rows matched your configured queries). - A failing metric query is logged and skipped; other metrics still render.
- Invalid
TFGQL_METRICS_CONFIGcauses metrics evaluation to fail until fixed. - If scrapes time out, increase
scrape_interval/scrape_timeoutand reduce query breadth.