Add HA/CARP safety for PPP link startup#9896
Add HA/CARP safety for PPP link startup#9896RedVortex wants to merge 2 commits intoopnsense:masterfrom
Conversation
Implement HA/CARP safety checks for PPP links to prevent startup if no parent interface is CARP MASTER when 'Disconnect dialup interfaces' is enabled.
|
This patch fixes a long-standing issue affecting PPPoE interfaces in OPNsense HA deployments using CARP with the Problem Later in the normal boot process, This causes an unwanted PPPoE session to be established on the backup node during boot. Impact
In short, the existing CARP hook reacts to the state change, but the later PPP startup path can override that decision during boot. Goal of this patch This patch adds a guard inside If no parent interface is Why this approach The existing This makes the behavior fail closed:
This is safer for HA because it prevents the backup node from briefly establishing an unwanted PPP session. Testing In my test case, before this patch:
After this patch:
Request for review
I believe the current behavior is a real HA bug, but I would like reviewer feedback to ensure the fix is correct and consistent with the project’s expectations before it is merged. |
Proper indentation and better comment for HA/CARP safety checks for PPP links
| foreach ($ports as $port) { | ||
| $ifconfig_out = shell_exec('/sbin/ifconfig ' . escapeshellarg($port) . ' 2>/dev/null'); | ||
| if (!is_string($ifconfig_out)) { | ||
| continue; | ||
| } | ||
|
|
||
| if (preg_match('/\bcarp:\s+MASTER\b/m', $ifconfig_out)) { | ||
| $carp_master = true; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
we can adjust this for better use of helper functions but for now my question is if we're applying the right metric here for multi-link PPP and general startup.
- Shouldn't $carp_master be true unless one or all ports are not master?
- Perhaps "BACKUP || INIT is a better metric? Looking at
core/src/etc/rc.syshook.d/carp/20-ppp
Line 54 in 43c933d
- We're kinda too far into the configuration at this point do to anything drastic such as interface_suspend() in contrast to 20-ppp so I'm wondering where the best spot is to run this code in this function or elsewhere as a preamble. It's only called in 3 places after all.
There was a problem hiding this comment.
The last thought is why not defer to 20-ppp completely. I suspect the master doesn't come up then on a reboot or both machines think they are master for a bit? I'm talking about a hard return of the code, but that would be ouside the function kinda like 3.) implies
There was a problem hiding this comment.
the funny thing is 20-ppp isn't using MLPPP
core/src/etc/rc.syshook.d/carp/20-ppp
Line 50 in 43c933d
Maybe we should emulate this here and annotate it so the behaviour matches (more or less point 2)
There was a problem hiding this comment.
Yeah I really wanted and tried to use 20-ppp but unfortunately it doesn't work since it is called early in the bootup. It tries and fails to bring down pppoe because it is not even up yet, we even see errors logs in the bootup regarding this.
And some time during bootup, after 20-ppp was called, the bootup brings up all interfaces and the code that handles bringing up all interfaces actually doesn't seem to care if it should or not bring up pppoe (because of CARP status) and since CARP is stable at this point (BACKUP since this is the BACKUP CARP node, the other node is MASTER), 20-ppp is never called again. Furthermore, even if it would be called later on, the damage is done during bootup since it started a pppoe session but it shouldn't have (thus creating issues since we now have 2 nodes having a public WAN on the same PPPoE user and also breaking Dynamic DNS thus flapping the Dynamic DNS public IP between the 2 nodes now).
This issue really happens only during a reboot of opnsense (during bootup), never otherwise in my case (except if I edit the pppoe interface and save it, then it also brings it up (but it shouldn't since the node is CARP BACKUP) but I haven't tried that (save the pppoe interface on the backup to see if my patch also fixes/handle this, I'll give it a shot later tonight).
Thus why I tried to put this code in this akward place to handle this situation (pppoe coming up on bootup whether or not in CARP BACKUP mode). This patch also only applies when the HA setting is configured to bring down ppp interfaces when CARP becomes BACKUP.
Also, to be honest I never considered if it would have an impact on MLPPP but I assume the logic is the same, I'm not sure it is handled at the same place in the script though.
I thought also about creating a custom 99-scriptsomething that would trigger and bring down pppoe later or a cronjob checking the state of CARP and bring it down in case it came up after bootup but it is very bad that the pppoe came up at all. Some providers with block you or rate limit you if you try to bring up 2 pppoe on the same user or link or even create conflicts if, for example at my work, we have a static IP on our PPPoE username, thus resulting in conflict between the 2 nodes. So it definitely should never come up at all if the node is BACKUP even if bring it down later, it creates important problems.
|
I've just gone over the code bits and reduce your solution. Can you confirm this one works on top? 2d30a94 I want do take this further but not without getting some basics out of the way first. Thanks, |
Sorry about the delay getting back to you, I needed to make sure nobody would be complaining if I brought the network down, hehehe. And I also wanted to test this properly for you. Also, I had some issues applying the patch, seems like I had to apply everything in between 26.1.3 and this patch for both files My 2 patches after 26.1.3 on src/etc/inc/interfaces.inc 2 patches you did on 20-ppp after 26.1.3 on src/etc/rc.syshook.d/carp/20-ppp And finally your last one on top of those 4 for both src/etc/inc/interfaces.inc and src/etc/rc.syshook.d/carp/20-ppp
I'm happy to report that this works very well for me. I tested:
So far so good to me, looks great on my side, no issue to report from my point of view and setup. And I definitely prefer the way you did that versus my way too looong code in the interfaces includes... Thanks, |
|
@RedVortex thanks for your tests. I'm not really satisfied with the code for historic reasons. Did a futile refactor in 5914ce7 but realized we're depending on CARP when we start before CARP is even configured (the first time). This needs more brain cells and time, but I will continue. Cheers, |
|
@fichtner Thanks. This is definitely not urgent at all, I've been living with this little issue for years so I'm used to it, there is absolutely no reason to rush this in any way. This is indeed a "touchy" patch to do, especially in those include files. I totally understand that we have to play it very safe here and not burn ourselves and especially not burn everyone else either. Especially to fix something that is as small as this bug. Pretty sure not a lot of people have this issue and/or they don't bother to try to fix or debug it, lol, I took a chance. I'll be monitor this thread closely in case you need me for anything, in the meantime I'll use our little patches that are good enough for me until we get something more robust. Pat |
Implement HA/CARP safety checks for PPP links to prevent startup if no parent interface is CARP MASTER when 'Disconnect dialup interfaces' is enabled.