Skip to content

Add HA/CARP safety for PPP link startup#9896

Open
RedVortex wants to merge 2 commits intoopnsense:masterfrom
RedVortex:patch-1
Open

Add HA/CARP safety for PPP link startup#9896
RedVortex wants to merge 2 commits intoopnsense:masterfrom
RedVortex:patch-1

Conversation

@RedVortex
Copy link
Copy Markdown

Implement HA/CARP safety checks for PPP links to prevent startup if no parent interface is CARP MASTER when 'Disconnect dialup interfaces' is enabled.

Implement HA/CARP safety checks for PPP links to prevent startup if no parent interface is CARP MASTER when 'Disconnect dialup interfaces' is enabled.
@RedVortex
Copy link
Copy Markdown
Author

This patch fixes a long-standing issue affecting PPPoE interfaces in OPNsense HA deployments using CARP with the disconnectppps option enabled.

Problem
In an HA setup, when a backup node boots, CARP can correctly transition the interface to BACKUP and trigger rc.syshook.d/carp/20-ppp, which attempts to suspend the PPP interface. However, at that point in the boot sequence, the PPP pseudo-interface may not exist yet, so the suspend happens too early to be effective.

Later in the normal boot process, interface_ppps_configure() is still called and starts PPP anyway, even though the node is already in CARP BACKUP state.

This causes an unwanted PPPoE session to be established on the backup node during boot.

Impact
In real HA environments, this can lead to multiple problems:

  • the backup node may briefly or fully establish the PPPoE session while the master is already active
  • the active session on the master may be disrupted or conflicted
  • the backup may temporarily receive a public IP it should never own
  • Dynamic DNS or other WAN-triggered services may publish or act on the wrong IP
  • the issue only self-corrects later if a new CARP transition occurs, which means the backup can remain in a bad state until failover/failback happens

In short, the existing CARP hook reacts to the state change, but the later PPP startup path can override that decision during boot.

Goal of this patch
The goal of this patch is to prevent PPP from starting at all when the node is not the CARP master.

This patch adds a guard inside interface_ppps_configure() so that when hasync.disconnectppps is enabled, PPP startup is allowed only if at least one of the underlying parent interfaces is currently in carp: MASTER state.

If no parent interface is MASTER, the function returns before launching mpd5.

Why this approach
This issue is specifically a boot-order / startup sequencing issue in HA environments.

The existing 20-ppp CARP hook is still useful for runtime CARP transitions, but it is not sufficient during boot because it can run before the PPP interface exists. The correct place to enforce this safely is therefore in the PPP startup path itself.

This makes the behavior fail closed:

  • MASTER -> PPP is allowed to start
  • BACKUP / INIT / not yet settled -> PPP is not started

This is safer for HA because it prevents the backup node from briefly establishing an unwanted PPP session.

Testing
I tested this patch in my own OPNsense HA environment using CARP with PPPoE on WAN and disconnectppps enabled.

In my test case, before this patch:

  • the backup node would boot
  • CARP would correctly transition to BACKUP
  • 20-ppp would run
  • but later in boot, PPPoE would still start on the backup node

After this patch:

  • the backup node no longer starts PPPoE during boot
  • the active/master node continues to operate normally
  • the behavior now matches the intended HA expectation for disconnectppps

Request for review
This patch has been validated in my own HA environment and resolves the issue for me, but I would appreciate review from maintainers to confirm:

  • that this change does not introduce side effects in other PPP configurations
  • that the logic is correct for other HA / CARP scenarios
  • that the implementation style and placement are aligned with the surrounding code in interfaces.inc
  • and whether there is a preferred alternative location or helper for this check within the core codebase

I believe the current behavior is a real HA bug, but I would like reviewer feedback to ensure the fix is correct and consistent with the project’s expectations before it is merged.

Proper indentation and better comment for HA/CARP safety checks for PPP links
@fichtner fichtner self-assigned this Mar 5, 2026
Comment on lines +1352 to +1362
foreach ($ports as $port) {
$ifconfig_out = shell_exec('/sbin/ifconfig ' . escapeshellarg($port) . ' 2>/dev/null');
if (!is_string($ifconfig_out)) {
continue;
}

if (preg_match('/\bcarp:\s+MASTER\b/m', $ifconfig_out)) {
$carp_master = true;
break;
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can adjust this for better use of helper functions but for now my question is if we're applying the right metric here for multi-link PPP and general startup.

  1. Shouldn't $carp_master be true unless one or all ports are not master?
  2. Perhaps "BACKUP || INIT is a better metric? Looking at
    if ($type == 'BACKUP' || $type == 'INIT') {
  3. We're kinda too far into the configuration at this point do to anything drastic such as interface_suspend() in contrast to 20-ppp so I'm wondering where the best spot is to run this code in this function or elsewhere as a preamble. It's only called in 3 places after all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last thought is why not defer to 20-ppp completely. I suspect the master doesn't come up then on a reboot or both machines think they are master for a bit? I'm talking about a hard return of the code, but that would be ouside the function kinda like 3.) implies

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the funny thing is 20-ppp isn't using MLPPP

if ($ppp['ports'] == $iface) {

Maybe we should emulate this here and annotate it so the behaviour matches (more or less point 2)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I really wanted and tried to use 20-ppp but unfortunately it doesn't work since it is called early in the bootup. It tries and fails to bring down pppoe because it is not even up yet, we even see errors logs in the bootup regarding this.

And some time during bootup, after 20-ppp was called, the bootup brings up all interfaces and the code that handles bringing up all interfaces actually doesn't seem to care if it should or not bring up pppoe (because of CARP status) and since CARP is stable at this point (BACKUP since this is the BACKUP CARP node, the other node is MASTER), 20-ppp is never called again. Furthermore, even if it would be called later on, the damage is done during bootup since it started a pppoe session but it shouldn't have (thus creating issues since we now have 2 nodes having a public WAN on the same PPPoE user and also breaking Dynamic DNS thus flapping the Dynamic DNS public IP between the 2 nodes now).

This issue really happens only during a reboot of opnsense (during bootup), never otherwise in my case (except if I edit the pppoe interface and save it, then it also brings it up (but it shouldn't since the node is CARP BACKUP) but I haven't tried that (save the pppoe interface on the backup to see if my patch also fixes/handle this, I'll give it a shot later tonight).

Thus why I tried to put this code in this akward place to handle this situation (pppoe coming up on bootup whether or not in CARP BACKUP mode). This patch also only applies when the HA setting is configured to bring down ppp interfaces when CARP becomes BACKUP.

Also, to be honest I never considered if it would have an impact on MLPPP but I assume the logic is the same, I'm not sure it is handled at the same place in the script though.

I thought also about creating a custom 99-scriptsomething that would trigger and bring down pppoe later or a cronjob checking the state of CARP and bring it down in case it came up after bootup but it is very bad that the pppoe came up at all. Some providers with block you or rate limit you if you try to bring up 2 pppoe on the same user or link or even create conflicts if, for example at my work, we have a static IP on our PPPoE username, thus resulting in conflict between the 2 nodes. So it definitely should never come up at all if the node is BACKUP even if bring it down later, it creates important problems.

@fichtner
Copy link
Copy Markdown
Member

fichtner commented Mar 6, 2026

I've just gone over the code bits and reduce your solution. Can you confirm this one works on top? 2d30a94

I want do take this further but not without getting some basics out of the way first.

Thanks,
Franco

@RedVortex
Copy link
Copy Markdown
Author

RedVortex commented Mar 7, 2026

I've just gone over the code bits and reduce your solution. Can you confirm this one works on top? 2d30a94

Sorry about the delay getting back to you, I needed to make sure nobody would be complaining if I brought the network down, hehehe. And I also wanted to test this properly for you.

Also, I had some issues applying the patch, seems like I had to apply everything in between 26.1.3 and this patch for both files

My 2 patches after 26.1.3 on src/etc/inc/interfaces.inc

opnsense-patch 0c1404e
opnsense-patch 673acfb

2 patches you did on 20-ppp after 26.1.3 on src/etc/rc.syshook.d/carp/20-ppp

opnsense-patch b7166dc
opnsense-patch 81a0db2

And finally your last one on top of those 4 for both src/etc/inc/interfaces.inc and src/etc/rc.syshook.d/carp/20-ppp

opnsense-patch 2d30a94

I'm happy to report that this works very well for me.

I tested:

  • Manual Failover and Failback (Enter Persistent CARP Maintenance Mode) and it worked without any issue. pppoe got enabled on the BACKUP once it became MASTER and MASTER brought pppoe DOWN when it became BACKUP

  • Reboot of the BACKUP node and pppoe remained DOWN as it should be (that's the fix we want, yesss !)
    /usr/local/etc/rc.bootup: interface_ppps_configure() skipped for wan (disconnectppps enabled, no parent interface is CARP MASTER)

  • Went into the interface on the BACKUP node, saved it and applied it and it remained DOWN as it should be (bonus fix, great !)
    /interfaces.php: interface_ppps_configure() skipped for wan (disconnectppps enabled, no parent interface is CARP MASTER)

  • Went into the interface on the MASTER node, saved it and applied it and pppoe restarted the pppoe properly as it should

So far so good to me, looks great on my side, no issue to report from my point of view and setup. And I definitely prefer the way you did that versus my way too looong code in the interfaces includes...

Thanks,
Patrick

@fichtner
Copy link
Copy Markdown
Member

@RedVortex thanks for your tests. I'm not really satisfied with the code for historic reasons. Did a futile refactor in 5914ce7 but realized we're depending on CARP when we start before CARP is even configured (the first time). This needs more brain cells and time, but I will continue.

Cheers,
Franco

@RedVortex
Copy link
Copy Markdown
Author

@fichtner Thanks.

This is definitely not urgent at all, I've been living with this little issue for years so I'm used to it, there is absolutely no reason to rush this in any way.

This is indeed a "touchy" patch to do, especially in those include files. I totally understand that we have to play it very safe here and not burn ourselves and especially not burn everyone else either. Especially to fix something that is as small as this bug. Pretty sure not a lot of people have this issue and/or they don't bother to try to fix or debug it, lol, I took a chance.

I'll be monitor this thread closely in case you need me for anything, in the meantime I'll use our little patches that are good enough for me until we get something more robust.

Pat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants