Skip to content

Add per db metrics #4183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 29, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions config/crd/bases/postgres-operator.crunchydata.com_pgadmins.yaml
Original file line number Diff line number Diff line change
@@ -2120,6 +2120,13 @@ spec:
type: string
x-kubernetes-validations:
- rule: duration("0") <= self && self <= duration("60m")
databases:
description: |-
The databases to target with added custom queries.
Default behavior is to target `postgres`.
items:
type: string
type: array
name:
description: |-
The name of this batch of queries, which will be used in naming the OTel
@@ -2165,6 +2172,12 @@ spec:
type: string
type: array
type: object
perDBMetricTargets:
Copy link
Member

@cbandy cbandy May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 IIUC, this and customQueries only make sense at/on Postgres. Is there a way to arrange structs so the PGAdmin API doesn't have these fields? Not a blocker; v1beta1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to make any too big changes with the spec, but I wonder if the pgadmin and postgrescluster versions will diverge even more (or if we'll eventually want to slice up instrumentation by target-type, eg., postgres-instrumentation, pgbouncer-instrumentation, pgadmin-instrumentation).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah good point... Probably makes sense to at least have a pgadmin-instrumentation and a postgrescluster-instrumentation and then pick and choose which structs make sense in each

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add "taking apart the instrumentation struct" to this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense if you have the bandwidth...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the other work due this sprint, I'm going to make a ticket for this topic and not include it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made that other ticket.

description: User defined databases to target for default
per-db metrics
items:
type: string
type: array
type: object
resources:
description: Resources holds the resource requirements for the
Original file line number Diff line number Diff line change
@@ -11965,6 +11965,13 @@ spec:
type: string
x-kubernetes-validations:
- rule: duration("0") <= self && self <= duration("60m")
databases:
description: |-
The databases to target with added custom queries.
Default behavior is to target `postgres`.
items:
type: string
type: array
name:
description: |-
The name of this batch of queries, which will be used in naming the OTel
@@ -12010,6 +12017,12 @@ spec:
type: string
type: array
type: object
perDBMetricTargets:
description: User defined databases to target for default
per-db metrics
items:
type: string
type: array
type: object
resources:
description: Resources holds the resource requirements for the
1 change: 0 additions & 1 deletion internal/collector/generated/gte_pg16_slow_metrics.json

This file was deleted.

1 change: 0 additions & 1 deletion internal/collector/generated/lt_pg16_slow_metrics.json

This file was deleted.

1 change: 0 additions & 1 deletion internal/collector/generated/pgbackrest_metrics.json

This file was deleted.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

127 changes: 0 additions & 127 deletions internal/collector/gte_pg16_slow_metrics.yaml

This file was deleted.

135 changes: 0 additions & 135 deletions internal/collector/lt_pg16_slow_metrics.yaml

This file was deleted.

161 changes: 161 additions & 0 deletions internal/collector/postgres_5m_per_db_metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# This list of queries configures an OTel SQL Query Receiver to read pgMonitor
# metrics from Postgres.
#
# https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/-/receiver/sqlqueryreceiver#metrics-queries
# https://github.com/CrunchyData/pgmonitor/blob/v5.2.1/sql_exporter/common/crunchy_per_db_collector.yml
#
# Note: Several metrics in the `crunchy_per_db_collector` track the materialized views and
# pgMonitor-extension version -- metrics that aren't meaningful in the CPK environment.
# The list of metrics that fall into this category include
# * ccp_metric_matview_refresh_last_run_fail_count
# * ccp_metric_matview_refresh_longest_runtime_seconds
# * ccp_metric_matview_refresh_longest_runtime
# * ccp_metric_table_refresh_longest_runtime
# * ccp_pgmonitor_extension_per_db

- sql: >
SELECT current_database() as dbname
, n.nspname as schemaname
, c.relname
, pg_catalog.pg_total_relation_size(c.oid) as bytes
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE NOT pg_is_other_temp_schema(n.oid)
AND relkind IN ('r', 'm', 'f');
metrics:
- metric_name: ccp_table_size_bytes
value_type: double
value_column: bytes
description: "Table size in bytes including indexes"
attribute_columns: ["dbname", "schemaname", "relname"]
static_attributes:
server: "localhost:5432"
- sql: >
SELECT current_database() as dbname
, p.schemaname
, p.relname
, p.seq_scan
, p.seq_tup_read
, COALESCE(p.idx_scan, 0) AS idx_scan
, COALESCE(p.idx_tup_fetch, 0) as idx_tup_fetch
, p.n_tup_ins
, p.n_tup_upd
, p.n_tup_del
, p.n_tup_hot_upd
, CASE
WHEN current_setting('server_version_num')::int >= 160000
THEN p.n_tup_newpage_upd
ELSE 0::bigint
END AS n_tup_newpage_upd
, p.n_live_tup
, p.n_dead_tup
, p.vacuum_count
, p.autovacuum_count
, p.analyze_count
, p.autoanalyze_count
FROM pg_catalog.pg_stat_user_tables p;
metrics:
- metric_name: ccp_stat_user_tables_seq_scan
data_type: sum
value_column: seq_scan
description: "Number of sequential scans initiated on this table"
attribute_columns: ["dbname", "schemaname", "relname"]
static_attributes:
server: "localhost:5432"
- metric_name: ccp_stat_user_tables_seq_tup_read
data_type: sum
value_column: seq_tup_read
description: "Number of live rows fetched by sequential scans"
attribute_columns: ["dbname", "schemaname", "relname"]
static_attributes:
server: "localhost:5432"
- metric_name: ccp_stat_user_tables_idx_scan
data_type: sum
description: "Number of index scans initiated on this table"
value_column: idx_scan
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_idx_tup_fetch
data_type: sum
description: "Number of live rows fetched by index scans"
value_column: idx_tup_fetch
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_tup_ins
data_type: sum
description: "Number of rows inserted"
value_column: n_tup_ins
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_tup_upd
data_type: sum
description: "Number of rows updated"
value_column: n_tup_upd
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_tup_del
data_type: sum
description: "Number of rows deleted"
value_column: n_tup_del
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_tup_hot_upd
data_type: sum
description: "Number of rows HOT updated (i.e., with no separate index update required)"
value_column: n_tup_hot_upd
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_tup_newpage_upd
data_type: sum
description: "Number of rows updated where the successor version goes onto a new heap page, leaving behind an original version with a t_ctid field that points to a different heap page. These are always non-HOT updates."
value_column: n_tup_newpage_upd
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_live_tup
description: "Estimated number of live rows"
value_column: n_live_tup
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_n_dead_tup
description: "Estimated number of dead rows"
value_column: n_dead_tup
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_vacuum_count
data_type: sum
description: "Number of times this table has been manually vacuumed (not counting VACUUM FULL)"
value_column: vacuum_count
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_autovacuum_count
data_type: sum
description: "Number of times this table has been vacuumed by the autovacuum daemon"
value_column: autovacuum_count
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_analyze_count
data_type: sum
description: "Number of times this table has been manually analyzed"
value_column: analyze_count
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
- metric_name: ccp_stat_user_tables_autoanalyze_count
data_type: sum
description: "Number of times this table has been analyzed by the autovacuum daemon"
value_column: autoanalyze_count
static_attributes:
server: "localhost:5432"
attribute_columns: ["dbname", "schemaname", "relname"]
115 changes: 75 additions & 40 deletions internal/collector/postgres_metrics.go
Original file line number Diff line number Diff line change
@@ -21,6 +21,9 @@ import (
//go:embed "generated/postgres_5s_metrics.json"
var fiveSecondMetrics json.RawMessage

//go:embed "generated/postgres_5m_per_db_metrics.json"
var fiveMinutePerDBMetrics json.RawMessage

//go:embed "generated/postgres_5m_metrics.json"
var fiveMinuteMetrics json.RawMessage

@@ -33,15 +36,9 @@ var ltPG17Fast json.RawMessage
//go:embed "generated/eq_pg16_fast_metrics.json"
var eqPG16Fast json.RawMessage

//go:embed "generated/gte_pg16_slow_metrics.json"
var gtePG16Slow json.RawMessage

//go:embed "generated/lt_pg16_fast_metrics.json"
var ltPG16Fast json.RawMessage

//go:embed "generated/lt_pg16_slow_metrics.json"
var ltPG16Slow json.RawMessage

type queryMetrics struct {
Metrics []*metric `json:"metrics"`
Query string `json:"sql"`
@@ -71,6 +68,7 @@ func EnablePostgresMetrics(ctx context.Context, inCluster *v1beta1.PostgresClust
// will continually append to it and blow up our ConfigMap
fiveSecondMetricsClone := slices.Clone(fiveSecondMetrics)
fiveMinuteMetricsClone := slices.Clone(fiveMinuteMetrics)
fiveMinutePerDBMetricsClone := slices.Clone(fiveMinutePerDBMetrics)

if inCluster.Spec.PostgresVersion >= 17 {
fiveSecondMetricsClone, err = appendToJSONArray(fiveSecondMetricsClone, gtePG17Fast)
@@ -91,20 +89,11 @@ func EnablePostgresMetrics(ctx context.Context, inCluster *v1beta1.PostgresClust
log.Error(err, "error compiling metrics for postgres 16")
}

if inCluster.Spec.PostgresVersion >= 16 {
fiveMinuteMetricsClone, err = appendToJSONArray(fiveMinuteMetricsClone, gtePG16Slow)
if err != nil {
log.Error(err, "error compiling metrics for postgres 16 and greater")
}
} else {
if inCluster.Spec.PostgresVersion < 16 {
fiveSecondMetricsClone, err = appendToJSONArray(fiveSecondMetricsClone, ltPG16Fast)
if err != nil {
log.Error(err, "error compiling fast metrics for postgres versions less than 16")
}
fiveMinuteMetricsClone, err = appendToJSONArray(fiveMinuteMetricsClone, ltPG16Slow)
if err != nil {
log.Error(err, "error compiling slow metrics for postgres versions less than 16")
}
}

// Remove any queries that user has specified in the spec
@@ -117,7 +106,7 @@ func EnablePostgresMetrics(ctx context.Context, inCluster *v1beta1.PostgresClust
var fiveSecondMetricsArr []queryMetrics
err := json.Unmarshal(fiveSecondMetricsClone, &fiveSecondMetricsArr)
if err != nil {
log.Error(err, "error compiling postgres metrics")
log.Error(err, "error compiling five second postgres metrics")
}

// Remove any specified metrics from the five second metrics
@@ -128,19 +117,31 @@ func EnablePostgresMetrics(ctx context.Context, inCluster *v1beta1.PostgresClust
var fiveMinuteMetricsArr []queryMetrics
err = json.Unmarshal(fiveMinuteMetricsClone, &fiveMinuteMetricsArr)
if err != nil {
log.Error(err, "error compiling postgres metrics")
log.Error(err, "error compiling five minute postgres metrics")
}

// Remove any specified metrics from the five minute metrics
fiveMinuteMetricsArr = removeMetricsFromQueries(
inCluster.Spec.Instrumentation.Metrics.CustomQueries.Remove, fiveMinuteMetricsArr)

// Convert json to array of queryMetrics objects
var fiveMinutePerDBMetricsArr []queryMetrics
err = json.Unmarshal(fiveMinutePerDBMetricsClone, &fiveMinutePerDBMetricsArr)
if err != nil {
log.Error(err, "error compiling per-db postgres metrics")
}

// Remove any specified metrics from the five minute per-db metrics
fiveMinutePerDBMetricsArr = removeMetricsFromQueries(
inCluster.Spec.Instrumentation.Metrics.CustomQueries.Remove, fiveMinutePerDBMetricsArr)

// Convert back to json data
// The error return value can be ignored as the errchkjson linter
// deems the []queryMetrics to be a safe argument:
// https://github.com/breml/errchkjson
fiveSecondMetricsClone, _ = json.Marshal(fiveSecondMetricsArr)
fiveMinuteMetricsClone, _ = json.Marshal(fiveMinuteMetricsArr)
fiveMinutePerDBMetricsClone, _ = json.Marshal(fiveMinutePerDBMetricsArr)
}

// Add Prometheus exporter
@@ -180,31 +181,65 @@ func EnablePostgresMetrics(ctx context.Context, inCluster *v1beta1.PostgresClust
Exporters: []ComponentID{Prometheus},
}

// Add custom queries if they are defined in the spec
// Add custom queries and per-db metrics if they are defined in the spec
if inCluster.Spec.Instrumentation != nil &&
inCluster.Spec.Instrumentation.Metrics != nil &&
inCluster.Spec.Instrumentation.Metrics.CustomQueries != nil &&
inCluster.Spec.Instrumentation.Metrics.CustomQueries.Add != nil {

for _, querySet := range inCluster.Spec.Instrumentation.Metrics.CustomQueries.Add {
// Create a receiver for the query set
receiverName := "sqlquery/" + querySet.Name
config.Receivers[receiverName] = map[string]any{
"driver": "postgres",
"datasource": fmt.Sprintf(
`host=localhost dbname=postgres port=5432 user=%s password=${env:PGPASSWORD}`,
MonitoringUser),
"collection_interval": querySet.CollectionInterval,
// Give Postgres time to finish setup.
"initial_delay": "15s",
"queries": "${file:/etc/otel-collector/" +
querySet.Name + "/" + querySet.Queries.Key + "}",
inCluster.Spec.Instrumentation.Metrics != nil {

if inCluster.Spec.Instrumentation.Metrics.CustomQueries != nil &&
inCluster.Spec.Instrumentation.Metrics.CustomQueries.Add != nil {

for _, querySet := range inCluster.Spec.Instrumentation.Metrics.CustomQueries.Add {
// Create a receiver for the query set

dbs := []string{"postgres"}
if len(querySet.Databases) != 0 {
dbs = querySet.Databases
}
Comment on lines +194 to +197
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely need to make a note in the documentation that if the user provides any databases for a custom query set that the default "postgres" database will not be included (unless they include it themselves).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, working on the documentation now.

for _, db := range dbs {
receiverName := fmt.Sprintf(
"sqlquery/%s-%s", querySet.Name, db)
Comment on lines +199 to +200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

config.Receivers[receiverName] = map[string]any{
"driver": "postgres",
"datasource": fmt.Sprintf(
`host=localhost dbname=%s port=5432 user=%s password=${env:PGPASSWORD}`,
db,
MonitoringUser),
"collection_interval": querySet.CollectionInterval,
// Give Postgres time to finish setup.
"initial_delay": "15s",
"queries": "${file:/etc/otel-collector/" +
querySet.Name + "/" + querySet.Queries.Key + "}",
}

// Add the receiver to the pipeline
pipeline := config.Pipelines[PostgresMetrics]
pipeline.Receivers = append(pipeline.Receivers, receiverName)
config.Pipelines[PostgresMetrics] = pipeline
}
}
}
if inCluster.Spec.Instrumentation.Metrics.PerDBMetricTargets != nil {

for _, db := range inCluster.Spec.Instrumentation.Metrics.PerDBMetricTargets {
// Create a receiver for the query set for the db
receiverName := "sqlquery/" + db
config.Receivers[receiverName] = map[string]any{
"driver": "postgres",
"datasource": fmt.Sprintf(
`host=localhost dbname=%s port=5432 user=%s password=${env:PGPASSWORD}`,
db,
MonitoringUser),
"collection_interval": "5m",
// Give Postgres time to finish setup.
"initial_delay": "15s",
"queries": slices.Clone(fiveMinutePerDBMetricsClone),
}

// Add the receiver to the pipeline
pipeline := config.Pipelines[PostgresMetrics]
pipeline.Receivers = append(pipeline.Receivers, receiverName)
config.Pipelines[PostgresMetrics] = pipeline
// Add the receiver to the pipeline
pipeline := config.Pipelines[PostgresMetrics]
pipeline.Receivers = append(pipeline.Receivers, receiverName)
config.Pipelines[PostgresMetrics] = pipeline
}
}
}
}
Original file line number Diff line number Diff line change
@@ -107,6 +107,11 @@ type InstrumentationMetricsSpec struct {
// ---
// +optional
CustomQueries *InstrumentationCustomQueriesSpec `json:"customQueries,omitempty"`

// User defined databases to target for default per-db metrics
// ---
// +optional
PerDBMetricTargets []string `json:"perDBMetricTargets,omitempty"`
}

type InstrumentationCustomQueriesSpec struct {
@@ -159,6 +164,12 @@ type InstrumentationCustomQueries struct {
// +default="5s"
// +optional
CollectionInterval *Duration `json:"collectionInterval,omitempty"`

// The databases to target with added custom queries.
// Default behavior is to target `postgres`.
// ---
// +optional
Databases []string `json:"databases,omitempty"`
}

// ---

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/11--add-per-db-metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
# First, check that all containers in the instance pod are ready.
# Then, grab the collector metrics output and check that the per-db metrics
# are present for the single added target.
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
check_containers_ready() { bash -ceu 'echo "$1" | jq -e ".[] | select(.type==\"ContainersReady\") | .status==\"True\""' - "$@"; }
contains() { bash -ceu '[[ "$1" == *"$2"* ]]' - "$@"; }
pod=$(kubectl get pods -o name -n "${NAMESPACE}" \
-l postgres-operator.crunchydata.com/cluster=otel-cluster,postgres-operator.crunchydata.com/data=postgres)
[ "$pod" = "" ] && retry "Pod not found" && exit 1
condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}")
[ "$condition_json" = "" ] && retry "conditions not found" && exit 1
{ check_containers_ready "$condition_json"; } || {
retry "containers not ready"
exit 1
}
scrape_metrics=$(kubectl exec "${pod}" -c collector -n "${NAMESPACE}" -- \
curl --insecure --silent http://localhost:9187/metrics)
{ contains "${scrape_metrics}" 'ccp_table_size_bytes{dbname="pikachu"'; } || {
retry "ccp_table_size_bytes not found for pikachu"
exit 1
}
{ ! contains "${scrape_metrics}" 'ccp_table_size_bytes{dbname="onix"'; } || {
retry "ccp_table_size_bytes found for onix"
exit 1
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/13--add-per-db-metrics.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
# First, check that all containers in the instance pod are ready.
# Then, grab the collector metrics output and check that the per-db metrics
# are present for both added targets.
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
check_containers_ready() { bash -ceu 'echo "$1" | jq -e ".[] | select(.type==\"ContainersReady\") | .status==\"True\""' - "$@"; }
contains() { bash -ceu '[[ "$1" == *"$2"* ]]' - "$@"; }
pod=$(kubectl get pods -o name -n "${NAMESPACE}" \
-l postgres-operator.crunchydata.com/cluster=otel-cluster,postgres-operator.crunchydata.com/data=postgres)
[ "$pod" = "" ] && retry "Pod not found" && exit 1
condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}")
[ "$condition_json" = "" ] && retry "conditions not found" && exit 1
{ check_containers_ready "$condition_json"; } || {
retry "containers not ready"
exit 1
}
scrape_metrics=$(kubectl exec "${pod}" -c collector -n "${NAMESPACE}" -- \
curl --insecure --silent http://localhost:9187/metrics)
{ contains "${scrape_metrics}" 'ccp_table_size_bytes{dbname="pikachu"'; } || {
retry "ccp_table_size_bytes not found for pikachu"
exit 1
}
{ contains "${scrape_metrics}" 'ccp_table_size_bytes{dbname="onix"'; } || {
retry "ccp_table_size_bytes not found for onix"
exit 1
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/15--remove-per-db-metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
# First, check that all containers in the instance pod are ready.
# Then, grab the collector metrics output and check that the per-db metrics
# are absent from the targets since they've been removed.
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
check_containers_ready() { bash -ceu 'echo "$1" | jq -e ".[] | select(.type==\"ContainersReady\") | .status==\"True\""' - "$@"; }
contains() { bash -ceu '[[ "$1" == *"$2"* ]]' - "$@"; }
pod=$(kubectl get pods -o name -n "${NAMESPACE}" \
-l postgres-operator.crunchydata.com/cluster=otel-cluster,postgres-operator.crunchydata.com/data=postgres)
[ "$pod" = "" ] && retry "Pod not found" && exit 1
condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}")
[ "$condition_json" = "" ] && retry "conditions not found" && exit 1
{ check_containers_ready "$condition_json"; } || {
retry "containers not ready"
exit 1
}
scrape_metrics=$(kubectl exec "${pod}" -c collector -n "${NAMESPACE}" -- \
curl --insecure --silent http://localhost:9187/metrics)
{ ! contains "${scrape_metrics}" 'ccp_table_size_bytes{dbname="pikachu"'; } || {
retry "ccp_table_size_bytes found for pikachu"
exit 1
}
{ ! contains "${scrape_metrics}" 'ccp_table_size_bytes{dbname="onix"'; } || {
retry "ccp_table_size_bytes found for onix"
exit 1
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/17--add-custom-queries-per-db.yaml
assert:
- files/17-custom-queries-per-db-added.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
# First, check that all containers in the instance pod are ready.
# Then, grab the collector metrics output and check that the two metrics that we
# checked for earlier are no longer there.
# Then, check that the two custom metrics that we added are present
# only for the targets that were specified.
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
check_containers_ready() { bash -ceu 'echo "$1" | jq -e ".[] | select(.type==\"ContainersReady\") | .status==\"True\""' - "$@"; }
contains() { bash -ceu '[[ "$1" == *"$2"* ]]' - "$@"; }
pod=$(kubectl get pods -o name -n "${NAMESPACE}" \
-l postgres-operator.crunchydata.com/cluster=otel-cluster,postgres-operator.crunchydata.com/data=postgres)
[ "$pod" = "" ] && retry "Pod not found" && exit 1
condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}")
[ "$condition_json" = "" ] && retry "conditions not found" && exit 1
{ check_containers_ready "$condition_json"; } || {
retry "containers not ready"
exit 1
}
scrape_metrics=$(kubectl exec "${pod}" -c collector -n "${NAMESPACE}" -- \
curl --insecure --silent http://localhost:9187/metrics)
{ contains "${scrape_metrics}" 'ccp_table_size_bytes_1{dbname="pikachu"'; } || {
retry "custom metric not found for pikachu db"
exit 1
}
{ contains "${scrape_metrics}" 'ccp_table_size_bytes_1{dbname="onix"'; } || {
retry "custom metric found for onix db"
exit 1
}
{ contains "${scrape_metrics}" 'ccp_table_size_bytes_2{dbname="onix"'; } || {
retry "custom metric not found for onix db"
exit 1
}
{ ! contains "${scrape_metrics}" 'ccp_table_size_bytes_2{dbname="pikachu"'; } || {
retry "custom metric found for pikachu db"
exit 1
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/19--add-logs-exporter.yaml
assert:
- files/19-logs-exporter-added.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/21--create-cluster.yaml
assert:
- files/21-cluster-created.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
apiVersion: kuttl.dev/v1beta1
kind: TestStep
apply:
- files/15--add-backups.yaml
- files/23--add-backups.yaml
assert:
- files/15-backups-added.yaml
- files/23-backups-added.yaml
Original file line number Diff line number Diff line change
@@ -4,4 +4,4 @@ commands:
- command: kubectl annotate postgrescluster otel-cluster-no-backups postgres-operator.crunchydata.com/authorizeBackupRemoval="true"
namespaced: true
assert:
- files/17-backups-removed.yaml
- files/25-backups-removed.yaml
15 changes: 10 additions & 5 deletions testing/kuttl/e2e/otel-logging-and-metrics/README.md
Original file line number Diff line number Diff line change
@@ -6,24 +6,29 @@ This test assumes that the operator has both OpenTelemetryLogs and OpenTelemetry

## Process

1. Create a basic cluster with pgbouncer and pgadmin in place.
1. Create a basic cluster with pgbouncer and pgadmin in place. (00)
1. Ensure cluster comes up, that all containers are running and ready, and that the initial backup is complete.
2. Add the `instrumentation` spec to both PostgresCluster and PGAdmin manifests.
2. Add the `instrumentation` spec to both PostgresCluster and PGAdmin manifests. (01-08)
1. Ensure that OTel collector containers and `crunchy-otel-collector` labels are added to the four pods (postgres instance, repo-host, pgbouncer, & pgadmin) and that the collector containers are running and ready.
2. Assert that the instance pod collector is getting postgres and patroni metrics and postgres, patroni, and pgbackrest logs.
3. Assert that the pgbouncer pod collector is getting pgbouncer metrics and logs.
4. Assert that the pgAdmin pod collector is getting pgAdmin and gunicorn logs.
5. Assert that the repo-host pod collector is NOT getting pgbackrest logs. We do not expect logs yet as the initial backup completed and created a log file; however, we configure the collector to only ingest new logs after it has started up.
6. Create a manual backup and ensure that it completes successfully.
7. Ensure that the repo-host pod collector is now getting pgbackrest logs.
3. Add both "add" and "remove" custom queries to the PostgresCluster `instrumentation` spec and create a ConfigMap that holds the custom queries to add.
3. Add both "add" and "remove" custom queries to the PostgresCluster `instrumentation` spec and create a ConfigMap that holds the custom queries to add. (09-10)
1. Ensure that the ConfigMap is created.
2. Assert that the metrics that were removed (which we checked for earlier) are in fact no longer present in the collector metrics.
3. Assert that the custom metrics that were added are present in the collector metrics.
4. Add an `otlp` exporter to both PostgresCluster and PGAdmin `instrumentation` specs and create a standalone OTel collector to receive data from our sidecar collectors.
4. Exercise per-db metric functionality by adding users, per-db targets, removing metrics from per-db defaults, adding custom metric db target. (11-18)
1. Add users and per-db target, assert that per-db default metric is available for named target.
2. Add second per-db target, assert that per-db default metric is available for all named targets.
3. Remove per-db metric, assert that the per-db default metric is absent for all targets.
4. Add custom metrics with a specified db, assert that we get that metric just for the specified target.
5. Add an `otlp` exporter to both PostgresCluster and PGAdmin `instrumentation` specs and create a standalone OTel collector to receive data from our sidecar collectors. (9-20)
1. Ensure that the ConfigMap, Service, and Deployment for the standalone OTel collector come up and that the collector container is running and ready.
2. Assert that the standalone collector is receiving logs from all of our components (i.e. the standalone collector is getting logs for postgres, patroni, pgbackrest, pgbouncer, pgadmin, and gunicorn).
5. Create a new cluster with `instrumentation` spec in place, but no `backups` spec to test the OTel features with optional backups.
6. Create a new cluster with `instrumentation` spec in place, but no `backups` spec to test the OTel features with optional backups. (21-25)
1. Ensure that the cluster comes up and the database and collector containers are running and ready.
2. Add a backups spec to the new cluster and ensure that pgbackrest is added to the instance pod, a repo-host pod is created, and the collector runs on both pods.
3. Remove the backups spec from the new cluster.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: otel-cluster
spec:
users:
- name: ash
databases:
- pikachu
- name: brock
databases:
- onix
instrumentation:
metrics:
perDBMetricTargets:
- pikachu
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: otel-cluster
spec:
instrumentation:
metrics:
perDBMetricTargets:
- pikachu
- onix
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: otel-cluster
spec:
instrumentation:
metrics:
customQueries:
remove:
- ccp_connection_stats_active
- ccp_database_size_bytes
- ccp_table_size_bytes
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: otel-cluster
spec:
instrumentation:
metrics:
customQueries:
add:
- name: custom1
databases: [pikachu, onix]
queries:
name: my-custom-queries2
key: custom1.yaml
- name: custom2
databases: [onix]
queries:
name: my-custom-queries2
key: custom2.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-custom-queries2
data:
custom1.yaml: |
- sql: >
SELECT current_database() as dbname
, n.nspname as schemaname
, c.relname
, pg_catalog.pg_total_relation_size(c.oid) as bytes
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE NOT pg_is_other_temp_schema(n.oid)
AND relkind IN ('r', 'm', 'f');
metrics:
- metric_name: ccp_table_size_bytes_1
value_type: double
value_column: bytes
description: "Table size in bytes including indexes"
attribute_columns: ["dbname", "schemaname", "relname"]
static_attributes:
server: "localhost:5432"
custom2.yaml: |
- sql: >
SELECT current_database() as dbname
, n.nspname as schemaname
, c.relname
, pg_catalog.pg_total_relation_size(c.oid) as bytes
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE NOT pg_is_other_temp_schema(n.oid)
AND relkind IN ('r', 'm', 'f');
metrics:
- metric_name: ccp_table_size_bytes_2
value_type: double
value_column: bytes
description: "Table size in bytes including indexes"
attribute_columns: ["dbname", "schemaname", "relname"]
static_attributes:
server: "localhost:5432"
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: otel-cluster
status:
instances:
- name: instance1
readyReplicas: 1
replicas: 1
updatedReplicas: 1
proxy:
pgBouncer:
readyReplicas: 1
replicas: 1
---
apiVersion: v1
kind: Pod
metadata:
labels:
postgres-operator.crunchydata.com/data: postgres
postgres-operator.crunchydata.com/role: master
postgres-operator.crunchydata.com/cluster: otel-cluster
postgres-operator.crunchydata.com/crunchy-otel-collector: "true"
status:
containerStatuses:
- name: collector
ready: true
started: true
- name: database
ready: true
started: true
- name: pgbackrest
ready: true
started: true
- name: pgbackrest-config
ready: true
started: true
- name: replication-cert-copy
ready: true
started: true
phase: Running
---
apiVersion: v1
kind: Pod
metadata:
labels:
postgres-operator.crunchydata.com/data: pgbackrest
postgres-operator.crunchydata.com/cluster: otel-cluster
postgres-operator.crunchydata.com/crunchy-otel-collector: "true"
status:
containerStatuses:
- name: collector
ready: true
started: true
- name: pgbackrest
ready: true
started: true
- name: pgbackrest-config
ready: true
started: true
phase: Running
---
apiVersion: v1
kind: Pod
metadata:
labels:
postgres-operator.crunchydata.com/role: pgbouncer
postgres-operator.crunchydata.com/cluster: otel-cluster
postgres-operator.crunchydata.com/crunchy-otel-collector: "true"
status:
containerStatuses:
- name: collector
ready: true
started: true
- name: pgbouncer
ready: true
started: true
- name: pgbouncer-config
ready: true
started: true
phase: Running
---
apiVersion: v1
kind: Service
metadata:
name: otel-cluster-primary
---
apiVersion: v1
kind: ConfigMap
metadata:
labels:
postgres-operator.crunchydata.com/role: pgadmin
postgres-operator.crunchydata.com/pgadmin: otel-pgadmin
---
apiVersion: v1
kind: Pod
metadata:
labels:
postgres-operator.crunchydata.com/data: pgadmin
postgres-operator.crunchydata.com/role: pgadmin
postgres-operator.crunchydata.com/pgadmin: otel-pgadmin
postgres-operator.crunchydata.com/crunchy-otel-collector: "true"
status:
containerStatuses:
- name: collector
ready: true
started: true
- name: pgadmin
ready: true
started: true
phase: Running
---
apiVersion: v1
kind: Secret
metadata:
labels:
postgres-operator.crunchydata.com/role: pgadmin
postgres-operator.crunchydata.com/pgadmin: otel-pgadmin
type: Opaque
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-custom-queries2