Skip to content

test: hookcmds integration tests for secret access and port management#2562

Open
tonyandrewmeyer wants to merge 23 commits into
canonical:mainfrom
tonyandrewmeyer:hookcmds-followups-3-4-5
Open

test: hookcmds integration tests for secret access and port management#2562
tonyandrewmeyer wants to merge 23 commits into
canonical:mainfrom
tonyandrewmeyer:hookcmds-followups-3-4-5

Conversation

@tonyandrewmeyer

@tonyandrewmeyer tonyandrewmeyer commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Follow-up to #2520. This PR adds three integration tests to test/integration/test_hookcmds.py:

  • test_secret_grant_revoke. Exercises secret_grant and secret_revoke over the real cross-app relation.
  • test_ports_endpoint_scoped. Opens TCP port 7766 scoped to the peer endpoint, asserts opened_ports(endpoints=True) returns endpoints=['peer'], then closes it.
  • test_relation_model_get. This really exists for CMR, but doing that seems too much for these tests, so it just validates it's the right shape and consistent across use.

I also (had Claude) rewrite test_juju_reboot_queues_reboot. I wasn't paying enough last time - you can't reboot in K8s, and you can't reboot in an action (I definitely knew both of these already). I did the same kind of workaround as I've done previously: trigger it via config.

Also a few extra fixes for problems that running against all of 3/stable, 4.0/stable, 4.0/edge, 4.1/edge × machine/k8s) showed up. Some of this was that I failed to run the whole matrix after changes from refiew: I'll make sure to do that this time, sorry!

  • test-file resource declaration needed filename: for juju 3/stable and newer to accept the deploy.
  • data storage needed multiple: { range: 1-2 } so test_storage_add could attach a second instance.
  • test_setup deploys with trust=True so credential_get works.
  • test_network_get_returns_addresses xfails the Juju 4.0/k8s network-get placeholder issue (juju/juju#22616) via the new shape; the interface-name non-empty assertion is gated on machine substrates.
  • _xfail_juju4_commit_bug helper centralises the xfail-on-error pattern for the Juju 4 uniter commit-phase bugs (juju/juju#22523, juju/juju#22524), widened to catch TimeoutError alongside TaskError since the symptom on 4.0/k8s is sometimes a driver timeout.

Run on my fork: tonyandrewmeyer/operator run 27414403142. All eight hookcmds cells green:

Cell Result
hookcmds (machine, 3/stable) ✅ 8m37s
hookcmds (k8s, 3/stable) ✅ 6m13s
hookcmds (machine, 4.0/stable) ✅ 7m38s
hookcmds (k8s, 4.0/stable) ✅ 10m03s
hookcmds (machine, 4.0/edge) ✅ 7m59s
hookcmds (k8s, 4.0/edge) ✅ 9m14s
hookcmds (machine, 4.1/edge) ✅ 7m31s
hookcmds (k8s, 4.1/edge) ✅ 9m28s

The secrets (4.0/stable) failure is the pre-existing flaky test_secrets cell from main, not anything this PR touches.

@tonyandrewmeyer tonyandrewmeyer changed the title test(hookcmds): integration-test follow-ups 3 + 4 test: hookcmds integration tests for secret access and port management Jun 11, 2026
…-scoped ports)

Follow-up 3: add any-charm as a second application in test_setup, add
test-secret-grant and test-secret-revoke actions and a corresponding
test_secret_grant_revoke jubilant test that exercises secret_grant and
secret_revoke over a real cross-app relation.

Follow-up 4: add test-ports-endpoint-scoped action and test_ports_endpoint_scoped
jubilant test that opens a TCP port scoped to the peer endpoint, verifies
it appears with endpoints=['peer'] from opened_ports(endpoints=True), and
confirms it is closed cleanly.

31 tests collected; no import or fixture-resolution errors.
The lockfile pinned ops at 3.8.0.dev0 from before the 3.7.1 release;
charmcraft pack failed with "Package metadata version 3.7.1 does not
match given version 3.8.0.dev0" because uv built ops.tar.gz from the
current 3.7.1 source while the lock demanded 3.8.0.dev0.

The other three integration-test charms (test_relation, test_secrets,
test_tracing) ship the same pyproject shape with no uv.lock and
resolve ops cleanly against whatever the local tarball provides.
Match that pattern so a future ops version bump doesn't break the
hookcmds matrix again.
- test_network_get_returns_addresses: on Juju 4.0/k8s, network-get puts
  an empty "k8s service" placeholder at bind_addresses[0] and the
  cloud-container with the pod IP at bind_addresses[1], so the action's
  `first-address` field is absent. Filed as juju/juju#22616. Juju 3.6
  and Juju 4.1 both put the address-carrying entry first.

- test_secret_grant_revoke: jubilant raises Python's built-in
  TimeoutError (not jubilant.TaskError) when juju run times out, so
  widen the except. Same Juju 4.0 uniter commit-phase root cause as
  the existing _JUJU4_COMMIT_BUG xfails (juju/juju#22524).

- test_ports_endpoint_scoped: same TimeoutError symptom on Juju 4.0/k8s
  endpoint-scoped opens; xfail with the same constant pending a more
  specific upstream issue.

A new _is_caas() helper checks model.type for the k8s gate on
network-get; the constant _JUJU4_NETGET_K8S_BUG points at juju/juju#22616.
…grant step too

The grant action (secret_add + secret_grant) queues changes that the
Juju 4.0 uniter commit-phase bug also affects, and on 4.0/k8s the grant
times out at least as often as the revoke. Move the existing
try/(TaskError, TimeoutError) wrapper around the grant call too.
@tonyandrewmeyer tonyandrewmeyer force-pushed the hookcmds-followups-3-4-5 branch from 67ee265 to cc34d9b Compare June 11, 2026 07:31
Widen the _xfail_juju4_commit_bug helper to accept any Exception (the
follow-up tests catch TimeoutError in addition to jubilant.TaskError),
and route test_secret_grant_revoke and test_ports_endpoint_scoped
through it. They now also fail loudly if Juju 4 ever stops hitting the
bug.
Comment thread test/integration/test_hookcmds.py Outdated
Comment thread test/integration/test_hookcmds.py Outdated
The matrix expanded in the merged PR canonical#2520 to include juju 3/stable
(newer 3.x), 4.0/edge, and 4.1/edge. Those releases reject a deploy of
a charm whose 'type: file' resource declaration is missing the
'filename:' field with the error 'resource missing filename'. The older
3.6/stable cell tolerated it. Set filename: test-file.txt; the existing
integration assert that the resource path's basename equals
'test-file.txt' continues to hold.
After canonical#2520 squash-merged, the matrix expanded to 3/stable, 4.0/edge
and 4.1/edge — channels the original PR was never tested against.
Five distinct problems showed up across cells. Fix all in one pass:

- test_network_get_returns_addresses: the rebase took the follow-up's
  old-shape assertions against the new-shape charm action (which was
  rewritten in PR review to emit JSON). Restore the JSON-parsing
  assertions and re-express the Juju 4.0/k8s xfail in terms of the new
  shape (first bind-address has an empty addresses list, juju/juju#22616).

- _xfail_juju4_commit_bug: was over-aggressive on Juju 4 machine cells
  — the action succeeded so the strict pytest.fail fired ('no longer
  hits ... remove the guard'), but the commit-phase bug only affects
  k8s. Gate the strict-fail on _is_caas(juju).

- test_storage_add: storage-add queues an end-of-hook commit and hits
  the same Juju 4.0/k8s commit-phase regression. Route through the
  helper.

- test_relation_model_get: was calling juju.show_model() which doesn't
  exist on the pinned jubilant~=1.5. Read the UUID from 'juju show-model'
  directly.

- test_credential_get / test_juju_reboot_queues_reboot: the action
  fails on substrates without --trust (credential-get) or where the
  unit can't actually reboot (containers, k8s). The hookcmd plumbing
  itself is fine; skip with a clear reason rather than fail.
Iteration on top of the previous matrix-fix commit, after the dispatch
narrowed the failures down to three:

- _xfail_juju4_commit_bug was over-gated: the previous version only
  fired on Juju 4 + k8s, but the commit-phase bug is observed on
  machine too (state-delete and secret-remove fail on both substrates,
  per the existing module-level comment). Restore xfail-on-error for
  any Juju >=4 while still strict-failing only on Juju 4 k8s success
  (where the bug is most reliable).

- test_relation_model_get was comparing the action's uuid result to
  'juju show-model'.model-uuid. That equality holds on Juju 3.6 machine
  but not on Juju 4 or Juju 3 k8s — relation_model_get is really meant
  for cross-model relations, and the local-model coincidence on 3.6 was
  incidental. Drop the comparison and assert only that the call
  returned a UUID-shaped string. Equality belongs in a future
  cross-model scenario.

- test_storage_add was failing 'cannot attach, storage is singular'.
  The data storage in test_hookcmds/charmcraft.yaml had no 'multiple:'
  clause, so Juju treated it as singular and rejected the second
  attach. Add 'multiple: { range: 1-2 }'.

- test_network_get_returns_addresses was asserting non-empty
  interface-name; on k8s the first bind-address may be a service-
  network entry with no NIC, so the field is empty. Gate the assertion
  on machine substrates.
The strict-fail semantics work for a single-version test target but
break in a multi-channel matrix: 4.0/edge and 4.1/edge ship the upstream
fix for juju/juju#22523/#22524, so the action succeeds there — but
4.0/stable still has the bug, so the guard must stay. Strict-failing
on edge success makes the guard impossible to keep.

Drop the strict-fail branch; xfail-on-error remains, which is what
the original PR shipped. The 'remind me to remove the guard' signal
is the cells that succeed in the matrix — no test-author action
needed until 4.0/stable also moves past the fix.
…ften the relation-model-get note

- test_credential_get: deploy with trust=True instead of try/except-skip.
  The hookcmd needs trust to read the credential; granting it in
  test_setup is the right default. Negative-case testing (no trust → fails;
  juju.trust() → succeeds; clean up) is plausible but out of scope here.

- test_juju_reboot_queues_reboot: skip when _is_caas(juju) instead of
  catching the runtime TaskError. Kubernetes pods can't reboot themselves;
  it's an upfront substrate condition, not a runtime failure to recover
  from.

- test_relation_model_get: keep the shape-only check, but soften the
  rationale comment. Whether the action's uuid should equal the local
  model UUID for a peer relation is being validated against a real
  juju locally; tighten the assertion back once that lands.
…peer relations

Validated live on Juju 3.6.23 and 4.0.5 LXD: for a peer relation, the
server-side path (apiserver/facades/agent/uniter/uniter.go) defaults
otherApplication.ModelUUID to the local model UUID and only overrides
on a successful cross-model lookup. So relation_model_get returns
the local model UUID, matching JUJU_MODEL_UUID inside the hook env.

The earlier comparison via 'juju show-model' from the test side was a
fragile cross-environment dance and was getting the wrong UUID under
CI; emit JUJU_MODEL_UUID from the action and assert equality directly
inside the same hook context.
The reboot hookcmd needs an actual rebootable substrate; both
Kubernetes pods and LXD/localhost containers refuse it (juju-reboot
exits status 1). Skip on cloud == 'localhost' too so machine cells
don't fail with an unactionable error.
…tly forbidden

Verified live on Juju 3.6 LXD: juju-reboot returns
'ERROR juju-reboot is not supported when running an action.' regardless
of substrate. The previous CAAS + localhost gating attributed it to
container substrates; the real constraint is the action context itself.

Exercising this hookcmd needs a regular-hook trigger (e.g. install /
config-changed) and a harness that drives that hook, not an action-
invoked path. Skip with the accurate reason until such a harness lands.
juju forbids juju-reboot in action contexts (verified live + confirmed
in juju source at internal/worker/uniter/runner/jujuc/reboot.go:82-85).
So the previous action-based test could never run.

Add a config option 'reboot-trigger'; the config-changed handler calls
hookcmds.juju_reboot() the first time it sees the trigger value, using a
marker file under JUJU_CHARM_DIR to prevent the post-reboot
config-changed re-run from looping. A read-only action exposes the
marker existence so the integration test can verify the path actually
ran.

Drop the test-juju-reboot action and handler (always failed in juju
4.x and could only ever fail with 'not supported when running an
action'); replace with test-reboot-marker. Skip on CAAS — Kubernetes
pods can't reboot themselves and the hookcmd surfaces as a runtime
error there.
…uptime

- Reformat the _REBOOT_MARKER constant to match ruff --preview's
  preferred layout (lint was failing on the previous form).

- Have the test-reboot-marker action also return /proc/uptime so the
  test can compare before/after. The marker alone only proves the
  config-changed handler ran the path; a fresh (lower) uptime after the
  trigger is what proves juju-reboot actually rebooted the unit.
… list

test/charms/test_hookcmds/.gitignore had a single '*' line, which
silently hid every new file inside that directory and made git add
emit an 'ignored' warning even for tracked files. Replace with the
explicit pattern test_secrets already uses (uv.lock, *.tar.gz, *.charm).
Tracked files are unaffected; new files under this dir will now show up
in git status as intended.
@tonyandrewmeyer tonyandrewmeyer marked this pull request as ready for review June 11, 2026 11:38
@tonyandrewmeyer

Copy link
Copy Markdown
Collaborator Author

@dwilding and @tromai sorry to hit you up again immediately following the other one, but I assume bringing over that context is the most efficient thing to do. This one should be the end of this work!

Comment thread test/charms/test_hookcmds/.gitignore
Comment thread test/integration/test_hookcmds.py Outdated
Comment thread test/integration/test_hookcmds.py Outdated
Comment thread test/integration/test_hookcmds.py Outdated
Comment on lines +594 to +596
# On Juju 4.0/k8s this action times out on the driver side, matching the
# uniter commit-phase regression pattern. May be a separate upstream bug
# from the secret ones; xfail until pinned down.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this (potential) timeout bug tracked somewhere for further investigation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has the same underlying cause and will get fixed at the same time as the other one. To be honest, partly I got tired digging into Juju bugs deep enough to write a decent issue, even with agent help, and didn't want to do another one when I thought it was likely the same.

When the other one is fixed, if this doesn't start passing, I'll do the investigation then. The tracking is this comment plus my subscriptions to the Juju issues.

Comment thread test/integration/test_hookcmds.py Outdated
Comment thread test/charms/test_hookcmds/src/charm.py Outdated
Comment thread test/integration/test_hookcmds.py Outdated
tonyandrewmeyer and others added 3 commits June 12, 2026 20:15
Co-authored-by: Dave Wilding <tech@dpw.me>
Co-authored-by: Dave Wilding <tech@dpw.me>
Review follow-ups on canonical#2562:

- Rename the test-side substrate helper from `_is_caas` to `_is_k8s` so the
  Kubernetes terminology matches what we use elsewhere. Document the juju-
  internal `caas`/`iaas` model-type pair in the docstring.

- Switch the juju-reboot proof from `/proc/uptime` to `/proc/stat`'s `btime`.
  The previous strict `uptime_after < uptime_before` check has a false-
  negative race when `uptime_before` is small: if recovery takes longer than
  the pre-reboot uptime, the post-reboot uptime can already exceed it.
  `btime` is the epoch second the kernel booted and changes if and only if
  a reboot happened, so `btime_after > btime_before` is race-free.

Also chmod +x on the test charm to satisfy the shebang-executable pre-commit
hook (matches the other test charms).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tonyandrewmeyer and others added 2 commits June 12, 2026 21:10
/proc/stat's `btime` reports the host kernel's boot time, which doesn't
change when a juju-reboot restarts an LXD container — so the previous
check fired a false negative on every machine cell. /proc/uptime is
virtualised per container by lxcfs and does reset on container restart,
so deriving the boot epoch as `time.time() - uptime` gives a value that
changes if and only if the unit actually rebooted, while still avoiding
the original strict-inequality race on raw uptime.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On Juju 4 machine substrates, juju-reboot queued during config-changed
is silently dropped — the container is never restarted, even though the
hook commits its other side-effects (marker file, status-set). The
previous reboot test (before the tightened boot-time assertion) only
checked the marker, so it accidentally passed on Juju 4; the new
boot-epoch check correctly catches the missing reboot.

Add a juju-major-version-gated xfail pointing at the upstream bug
(juju/juju#22639) so the test fails on Juju 3 (where reboot still works)
and xfails cleanly on Juju 4 until the upstream fix lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants