test: hookcmds integration tests for secret access and port management#2562
Open
tonyandrewmeyer wants to merge 23 commits into
Open
test: hookcmds integration tests for secret access and port management#2562tonyandrewmeyer wants to merge 23 commits into
tonyandrewmeyer wants to merge 23 commits into
Conversation
…-scoped ports) Follow-up 3: add any-charm as a second application in test_setup, add test-secret-grant and test-secret-revoke actions and a corresponding test_secret_grant_revoke jubilant test that exercises secret_grant and secret_revoke over a real cross-app relation. Follow-up 4: add test-ports-endpoint-scoped action and test_ports_endpoint_scoped jubilant test that opens a TCP port scoped to the peer endpoint, verifies it appears with endpoints=['peer'] from opened_ports(endpoints=True), and confirms it is closed cleanly. 31 tests collected; no import or fixture-resolution errors.
The lockfile pinned ops at 3.8.0.dev0 from before the 3.7.1 release; charmcraft pack failed with "Package metadata version 3.7.1 does not match given version 3.8.0.dev0" because uv built ops.tar.gz from the current 3.7.1 source while the lock demanded 3.8.0.dev0. The other three integration-test charms (test_relation, test_secrets, test_tracing) ship the same pyproject shape with no uv.lock and resolve ops cleanly against whatever the local tarball provides. Match that pattern so a future ops version bump doesn't break the hookcmds matrix again.
- test_network_get_returns_addresses: on Juju 4.0/k8s, network-get puts an empty "k8s service" placeholder at bind_addresses[0] and the cloud-container with the pod IP at bind_addresses[1], so the action's `first-address` field is absent. Filed as juju/juju#22616. Juju 3.6 and Juju 4.1 both put the address-carrying entry first. - test_secret_grant_revoke: jubilant raises Python's built-in TimeoutError (not jubilant.TaskError) when juju run times out, so widen the except. Same Juju 4.0 uniter commit-phase root cause as the existing _JUJU4_COMMIT_BUG xfails (juju/juju#22524). - test_ports_endpoint_scoped: same TimeoutError symptom on Juju 4.0/k8s endpoint-scoped opens; xfail with the same constant pending a more specific upstream issue. A new _is_caas() helper checks model.type for the k8s gate on network-get; the constant _JUJU4_NETGET_K8S_BUG points at juju/juju#22616.
…grant step too The grant action (secret_add + secret_grant) queues changes that the Juju 4.0 uniter commit-phase bug also affects, and on 4.0/k8s the grant times out at least as often as the revoke. Move the existing try/(TaskError, TimeoutError) wrapper around the grant call too.
67ee265 to
cc34d9b
Compare
Widen the _xfail_juju4_commit_bug helper to accept any Exception (the follow-up tests catch TimeoutError in addition to jubilant.TaskError), and route test_secret_grant_revoke and test_ports_endpoint_scoped through it. They now also fail loudly if Juju 4 ever stops hitting the bug.
tonyandrewmeyer
commented
Jun 11, 2026
tonyandrewmeyer
commented
Jun 11, 2026
The matrix expanded in the merged PR canonical#2520 to include juju 3/stable (newer 3.x), 4.0/edge, and 4.1/edge. Those releases reject a deploy of a charm whose 'type: file' resource declaration is missing the 'filename:' field with the error 'resource missing filename'. The older 3.6/stable cell tolerated it. Set filename: test-file.txt; the existing integration assert that the resource path's basename equals 'test-file.txt' continues to hold.
After canonical#2520 squash-merged, the matrix expanded to 3/stable, 4.0/edge and 4.1/edge — channels the original PR was never tested against. Five distinct problems showed up across cells. Fix all in one pass: - test_network_get_returns_addresses: the rebase took the follow-up's old-shape assertions against the new-shape charm action (which was rewritten in PR review to emit JSON). Restore the JSON-parsing assertions and re-express the Juju 4.0/k8s xfail in terms of the new shape (first bind-address has an empty addresses list, juju/juju#22616). - _xfail_juju4_commit_bug: was over-aggressive on Juju 4 machine cells — the action succeeded so the strict pytest.fail fired ('no longer hits ... remove the guard'), but the commit-phase bug only affects k8s. Gate the strict-fail on _is_caas(juju). - test_storage_add: storage-add queues an end-of-hook commit and hits the same Juju 4.0/k8s commit-phase regression. Route through the helper. - test_relation_model_get: was calling juju.show_model() which doesn't exist on the pinned jubilant~=1.5. Read the UUID from 'juju show-model' directly. - test_credential_get / test_juju_reboot_queues_reboot: the action fails on substrates without --trust (credential-get) or where the unit can't actually reboot (containers, k8s). The hookcmd plumbing itself is fine; skip with a clear reason rather than fail.
Iteration on top of the previous matrix-fix commit, after the dispatch
narrowed the failures down to three:
- _xfail_juju4_commit_bug was over-gated: the previous version only
fired on Juju 4 + k8s, but the commit-phase bug is observed on
machine too (state-delete and secret-remove fail on both substrates,
per the existing module-level comment). Restore xfail-on-error for
any Juju >=4 while still strict-failing only on Juju 4 k8s success
(where the bug is most reliable).
- test_relation_model_get was comparing the action's uuid result to
'juju show-model'.model-uuid. That equality holds on Juju 3.6 machine
but not on Juju 4 or Juju 3 k8s — relation_model_get is really meant
for cross-model relations, and the local-model coincidence on 3.6 was
incidental. Drop the comparison and assert only that the call
returned a UUID-shaped string. Equality belongs in a future
cross-model scenario.
- test_storage_add was failing 'cannot attach, storage is singular'.
The data storage in test_hookcmds/charmcraft.yaml had no 'multiple:'
clause, so Juju treated it as singular and rejected the second
attach. Add 'multiple: { range: 1-2 }'.
- test_network_get_returns_addresses was asserting non-empty
interface-name; on k8s the first bind-address may be a service-
network entry with no NIC, so the field is empty. Gate the assertion
on machine substrates.
The strict-fail semantics work for a single-version test target but break in a multi-channel matrix: 4.0/edge and 4.1/edge ship the upstream fix for juju/juju#22523/#22524, so the action succeeds there — but 4.0/stable still has the bug, so the guard must stay. Strict-failing on edge success makes the guard impossible to keep. Drop the strict-fail branch; xfail-on-error remains, which is what the original PR shipped. The 'remind me to remove the guard' signal is the cells that succeed in the matrix — no test-author action needed until 4.0/stable also moves past the fix.
…ften the relation-model-get note - test_credential_get: deploy with trust=True instead of try/except-skip. The hookcmd needs trust to read the credential; granting it in test_setup is the right default. Negative-case testing (no trust → fails; juju.trust() → succeeds; clean up) is plausible but out of scope here. - test_juju_reboot_queues_reboot: skip when _is_caas(juju) instead of catching the runtime TaskError. Kubernetes pods can't reboot themselves; it's an upfront substrate condition, not a runtime failure to recover from. - test_relation_model_get: keep the shape-only check, but soften the rationale comment. Whether the action's uuid should equal the local model UUID for a peer relation is being validated against a real juju locally; tighten the assertion back once that lands.
…peer relations Validated live on Juju 3.6.23 and 4.0.5 LXD: for a peer relation, the server-side path (apiserver/facades/agent/uniter/uniter.go) defaults otherApplication.ModelUUID to the local model UUID and only overrides on a successful cross-model lookup. So relation_model_get returns the local model UUID, matching JUJU_MODEL_UUID inside the hook env. The earlier comparison via 'juju show-model' from the test side was a fragile cross-environment dance and was getting the wrong UUID under CI; emit JUJU_MODEL_UUID from the action and assert equality directly inside the same hook context.
The reboot hookcmd needs an actual rebootable substrate; both Kubernetes pods and LXD/localhost containers refuse it (juju-reboot exits status 1). Skip on cloud == 'localhost' too so machine cells don't fail with an unactionable error.
…tly forbidden Verified live on Juju 3.6 LXD: juju-reboot returns 'ERROR juju-reboot is not supported when running an action.' regardless of substrate. The previous CAAS + localhost gating attributed it to container substrates; the real constraint is the action context itself. Exercising this hookcmd needs a regular-hook trigger (e.g. install / config-changed) and a harness that drives that hook, not an action- invoked path. Skip with the accurate reason until such a harness lands.
juju forbids juju-reboot in action contexts (verified live + confirmed in juju source at internal/worker/uniter/runner/jujuc/reboot.go:82-85). So the previous action-based test could never run. Add a config option 'reboot-trigger'; the config-changed handler calls hookcmds.juju_reboot() the first time it sees the trigger value, using a marker file under JUJU_CHARM_DIR to prevent the post-reboot config-changed re-run from looping. A read-only action exposes the marker existence so the integration test can verify the path actually ran. Drop the test-juju-reboot action and handler (always failed in juju 4.x and could only ever fail with 'not supported when running an action'); replace with test-reboot-marker. Skip on CAAS — Kubernetes pods can't reboot themselves and the hookcmd surfaces as a runtime error there.
…uptime - Reformat the _REBOOT_MARKER constant to match ruff --preview's preferred layout (lint was failing on the previous form). - Have the test-reboot-marker action also return /proc/uptime so the test can compare before/after. The marker alone only proves the config-changed handler ran the path; a fresh (lower) uptime after the trigger is what proves juju-reboot actually rebooted the unit.
… list test/charms/test_hookcmds/.gitignore had a single '*' line, which silently hid every new file inside that directory and made git add emit an 'ignored' warning even for tracked files. Replace with the explicit pattern test_secrets already uses (uv.lock, *.tar.gz, *.charm). Tracked files are unaffected; new files under this dir will now show up in git status as intended.
Collaborator
Author
dwilding
approved these changes
Jun 12, 2026
Comment on lines
+594
to
+596
| # On Juju 4.0/k8s this action times out on the driver side, matching the | ||
| # uniter commit-phase regression pattern. May be a separate upstream bug | ||
| # from the secret ones; xfail until pinned down. |
Contributor
There was a problem hiding this comment.
Is this (potential) timeout bug tracked somewhere for further investigation?
Collaborator
Author
There was a problem hiding this comment.
I think it has the same underlying cause and will get fixed at the same time as the other one. To be honest, partly I got tired digging into Juju bugs deep enough to write a decent issue, even with agent help, and didn't want to do another one when I thought it was likely the same.
When the other one is fixed, if this doesn't start passing, I'll do the investigation then. The tracking is this comment plus my subscriptions to the Juju issues.
Co-authored-by: Dave Wilding <tech@dpw.me>
Co-authored-by: Dave Wilding <tech@dpw.me>
Review follow-ups on canonical#2562: - Rename the test-side substrate helper from `_is_caas` to `_is_k8s` so the Kubernetes terminology matches what we use elsewhere. Document the juju- internal `caas`/`iaas` model-type pair in the docstring. - Switch the juju-reboot proof from `/proc/uptime` to `/proc/stat`'s `btime`. The previous strict `uptime_after < uptime_before` check has a false- negative race when `uptime_before` is small: if recovery takes longer than the pre-reboot uptime, the post-reboot uptime can already exceed it. `btime` is the epoch second the kernel booted and changes if and only if a reboot happened, so `btime_after > btime_before` is race-free. Also chmod +x on the test charm to satisfy the shebang-executable pre-commit hook (matches the other test charms). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
/proc/stat's `btime` reports the host kernel's boot time, which doesn't change when a juju-reboot restarts an LXD container — so the previous check fired a false negative on every machine cell. /proc/uptime is virtualised per container by lxcfs and does reset on container restart, so deriving the boot epoch as `time.time() - uptime` gives a value that changes if and only if the unit actually rebooted, while still avoiding the original strict-inequality race on raw uptime. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On Juju 4 machine substrates, juju-reboot queued during config-changed is silently dropped — the container is never restarted, even though the hook commits its other side-effects (marker file, status-set). The previous reboot test (before the tightened boot-time assertion) only checked the marker, so it accidentally passed on Juju 4; the new boot-epoch check correctly catches the missing reboot. Add a juju-major-version-gated xfail pointing at the upstream bug (juju/juju#22639) so the test fails on Juju 3 (where reboot still works) and xfails cleanly on Juju 4 until the upstream fix lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #2520. This PR adds three integration tests to
test/integration/test_hookcmds.py:test_secret_grant_revoke. Exercisessecret_grantandsecret_revokeover the real cross-app relation.test_ports_endpoint_scoped. Opens TCP port 7766 scoped to thepeerendpoint, assertsopened_ports(endpoints=True)returnsendpoints=['peer'], then closes it.test_relation_model_get. This really exists for CMR, but doing that seems too much for these tests, so it just validates it's the right shape and consistent across use.I also (had Claude) rewrite
test_juju_reboot_queues_reboot. I wasn't paying enough last time - you can't reboot in K8s, and you can't reboot in an action (I definitely knew both of these already). I did the same kind of workaround as I've done previously: trigger it via config.Also a few extra fixes for problems that running against all of 3/stable, 4.0/stable, 4.0/edge, 4.1/edge × machine/k8s) showed up. Some of this was that I failed to run the whole matrix after changes from refiew: I'll make sure to do that this time, sorry!
test-fileresource declaration neededfilename:for juju 3/stable and newer to accept the deploy.datastorage neededmultiple: { range: 1-2 }sotest_storage_addcould attach a second instance.test_setupdeploys withtrust=Truesocredential_getworks.test_network_get_returns_addressesxfails the Juju 4.0/k8snetwork-getplaceholder issue (juju/juju#22616) via the new shape; theinterface-namenon-empty assertion is gated on machine substrates._xfail_juju4_commit_bughelper centralises the xfail-on-error pattern for the Juju 4 uniter commit-phase bugs (juju/juju#22523,juju/juju#22524), widened to catchTimeoutErroralongsideTaskErrorsince the symptom on 4.0/k8s is sometimes a driver timeout.Run on my fork:
tonyandrewmeyer/operatorrun 27414403142. All eight hookcmds cells green:hookcmds (machine, 3/stable)hookcmds (k8s, 3/stable)hookcmds (machine, 4.0/stable)hookcmds (k8s, 4.0/stable)hookcmds (machine, 4.0/edge)hookcmds (k8s, 4.0/edge)hookcmds (machine, 4.1/edge)hookcmds (k8s, 4.1/edge)The
secrets (4.0/stable)failure is the pre-existing flakytest_secretscell from main, not anything this PR touches.