Skip to content

Conversation

@maxkpower
Copy link
Contributor

🎟️ Tracking

https://bitwarden.atlassian.net/browse/PM-28548

📔 Objective

Migrates phishing detection from domain-based to URL-based matching.

Detection Logic:

  • Changed from hostname-only matching to full URL matching with prefix support
  • Added support for subpath matching (e.g., /login matches /login/oauth)
  • Implemented case-insensitive URL normalization
  • Handles trailing slashes and query parameters correctly

📸 Screenshots

⏰ Reminders before review

  • Contributor guidelines followed
  • All formatters and local linters executed and passed
  • Written new unit and / or integration tests where applicable
  • Protected functional changes with optionality (feature flags)
  • Used internationalization (i18n) for all UI strings
  • CI builds passed
  • Communicated to DevOps any deployment requirements
  • Updated any necessary documentation (Confluence, contributing docs) or informed the documentation team

🦮 Reviewer guidelines

  • 👍 (:+1:) or similar for great changes
  • 📝 (:memo:) or ℹ️ (:information_source:) for notes or general info
  • ❓ (:question:) for questions
  • 🤔 (:thinking:) or 💭 (:thought_balloon:) for more open inquiry that's not quite a confirmed issue and could potentially benefit from discussion
  • 🎨 (:art:) for suggestions / improvements
  • ❌ (:x:) or ⚠️ (:warning:) for more significant problems or concerns needing attention
  • 🌱 (:seedling:) or ♻️ (:recycle:) for future improvements or indications of technical debt
  • ⛏ (:pick:) for minor or nitpick changes

@maxkpower maxkpower requested a review from a team as a code owner November 21, 2025 00:29
@maxkpower maxkpower enabled auto-merge (squash) November 21, 2025 00:29
@claude
Copy link

claude bot commented Nov 21, 2025

Claude finished @maxkpower's task —— View job


Code Review Complete

Summary

This PR migrates phishing detection from domain-based to URL-based matching. The implementation adds valuable functionality for path-specific phishing detection, but there are several concerns that should be addressed before merging.

Findings

Finding 1: Performance issue - Linear search on every page navigation could cause noticeable delays with large databases (apps/browser/src/dirt/phishing-detection/services/phishing-data.service.ts:147)

Finding 2: Query parameter bypass vulnerability - URLs with query params in database can be bypassed (apps/browser/src/dirt/phishing-detection/services/phishing-data.service.ts:155)

Finding 3: Session-based ignore behavior changed without documentation - Security trade-off not mentioned in PR description (apps/browser/src/dirt/phishing-detection/services/phishing-detection.service.ts:76)

Finding 4: Test link protocol mismatch - QA test link changed from domain to HTTP-only URL (apps/browser/src/dirt/phishing-detection/services/phishing-data.service.ts:63)

Finding 5: Redundant trailing slash handling adds unnecessary complexity (apps/browser/src/dirt/phishing-detection/services/phishing-data.service.ts:160)

Finding 6: Technical debt - Field named domains stores URLs, creating maintainability issues (apps/browser/src/dirt/phishing-detection/services/phishing-data.service.ts:24)

Finding 7: Missing test coverage for edge cases - Query params, fragments, ports, protocols, and path case sensitivity not tested (apps/browser/src/dirt/phishing-detection/services/phishing-data.service.spec.ts:123)

PR Description Feedback

💭 The PR description could be improved by documenting:

  1. The behavioral change to session-based hostname ignoring (previously warned on each visit)
  2. How the system handles protocol differences (HTTP vs HTTPS)
  3. Performance considerations for the linear search implementation
  4. Migration strategy for existing cached domain data


// Prefix match: Check if the current URL starts with the phishing link
// This handles cases where the user visits a subpath of a known phishing URL
if (normalizedUrl.startsWith(link + "/") || normalizedUrl.startsWith(link + "?")) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential bypass with query parameters

The prefix matching logic has a gap. Consider this scenario:

Database contains: http://phish.com/login?param=malicious
User visits: http://phish.com/login

The current logic checks normalizedUrl.startsWith(link + "?") which would fail here since:

  • normalizedUrl = "http://phish.com/login"
  • link = "http://phish.com/login?param=malicious"
  • Check: "http://phish.com/login".startsWith("http://phish.com/login?param=malicious" + "?") → false

Impact: If the phishing database contains URLs with query parameters, users can bypass detection by visiting the same path without those parameters.

Suggested fix:

// Parse both URLs to compare base path separately from query params
const normalizedUrlObj = new URL(normalizedUrl);
const linkObj = new URL(link);

// Compare without query params first
const normalizedWithoutQuery = normalizedUrlObj.origin + normalizedUrlObj.pathname;
const linkWithoutQuery = linkObj.origin + linkObj.pathname;

if (normalizedWithoutQuery === linkWithoutQuery) {
  return true;
}

Or consider if the database should only contain paths without query parameters.

// - "http://phish.com/login/oauth" (subpath match)
// - "http://phish.com/login?param=value" (with query params)

for (const link of links) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance issue: O(n) lookup on every navigation

This linear search through all phishing links happens on every single page navigation. With the phishing database containing thousands of URLs, this could cause noticeable delays.

Current complexity: O(n) where n = number of phishing links
Expected database size: Potentially 10,000+ URLs

Impact:

  • Slows down every page load in the browser
  • Scales poorly as database grows
  • Multiple string operations per link (prefix checks, trailing slash handling)

Recommended solution:
Use a more efficient data structure. Consider:

  1. Trie (Prefix Tree): Optimal for prefix matching, O(m) lookup where m = URL length
  2. Hash map by hostname: Pre-group by hostname for O(1) hostname lookup, then check paths
  3. Pre-computed normalized Set: At minimum, remove the trailing slash logic from the hot path

Example optimization using hostname grouping:

private _linksByHostname$ = this._cachedState.state$.pipe(
  map((state) => {
    const map = new Map<string, Set<string>>();
    for (const link of state?.domains ?? []) {
      const url = new URL(link);
      const hostname = url.hostname.toLowerCase();
      if (!map.has(hostname)) {
        map.set(hostname, new Set());
      }
      map.get(hostname)!.add(url.pathname.toLowerCase().replace(/\/$/, ""));
    }
    return map;
  })
);

This would reduce the search space significantly when checking URLs.

}

// Also check if the link in database has a trailing slash
if (link.endsWith("/")) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Redundant trailing slash handling

The code handles trailing slashes in three different places:

  1. Line 60: Normalization when loading data
  2. Line 139: Normalization of input URL
  3. Lines 160-168: Additional trailing slash checks during matching

Question: Is the third check (lines 160-168) necessary given that both the database links and input URL are already normalized to remove trailing slashes?

If both link and normalizedUrl have trailing slashes removed at lines 60 and 139, when would link.endsWith("/") be true at line 160?

Consider removing this redundant check or clarifying why it's needed despite prior normalization.

.map((line) => line.toLowerCase().replace(/\/$/, "")) // Normalize URLs
.concat(
this._testLinks,
"http://phishing.testcategory.com", // Included for QA to test in prod
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Test link missing protocol

The test link "http://phishing.testcategory.com" is added here, but it's being concatenated with already-normalized URLs that have protocols.

However, the old test link was "phishing.testcategory.com" (line 63 in the previous version). Now it's "http://phishing.testcategory.com".

Question: Does QA expect to test with http:// or https:// protocol? If they use https://phishing.testcategory.com, it won't match since only the http:// version is in the set.

Consider adding both protocols or documenting which protocol QA should use.

const url = new URL("HTTP://PHISH.COM");
const result = await service.isPhishingDomain(url);
expect(result).toBe(true);
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎨 Missing test cases for comprehensive URL matching

The test suite doesn't cover several important edge cases for the new URL-based matching logic:

Missing scenarios:

  1. Query parameter handling: Does http://phish.com/login?param=value match http://phish.com/login?
  2. Fragment/hash handling: Does http://phish.com/login#section match http://phish.com/login?
  3. Port numbers: Does http://phish.com:8080/login match http://phish.com/login?
  4. Protocol differences: Does https://phish.com/login match http://phish.com/login?
  5. Case sensitivity in paths: Does http://phish.com/LOGIN match http://phish.com/login?
  6. Similar prefixes (false positive check): Ensure http://phish.com/log doesn't match http://phish.com/login

Consider adding tests for these scenarios to ensure the matching logic behaves as expected.

concatMap(async (message) => {
const url = new URL(message.url);
// Store the hostname so we ignore all URLs on this domain for the session
this._ignoredHostnames.add(url.hostname);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Session-based ignore behavior change

The behavior changed from "block again on next visit" to "allow all pages on this domain for the session."

Previous behavior: User clicks "continue" → navigates to phishing site → if they visit again later in the same session, they'd be warned again
New behavior: User clicks "continue" once → all future visits to that hostname are allowed for the entire session

Questions:

  1. Is this intentional? The comment was updated but the actual behavioral change isn't mentioned in the PR description.
  2. Should this be more granular (per-URL) instead of per-hostname? For example, if a user allows http://phish.com/safe-page, should http://phish.com/login also be allowed?

This seems like a security trade-off that should be explicitly acknowledged in the PR description.


export type PhishingData = {
domains: string[];
domains: string[]; // Note: Despite the name, this now stores full phishing URLs/links
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Technical debt: Misleading field name

The field is named domains but now stores full URLs. This creates confusion and technical debt.

Why this matters:

  • Code readers will expect domain strings like "phish.com", not URLs like "http://phish.com/login"
  • The type definition doesn't reflect the actual data structure
  • Future maintainers might introduce bugs by treating these as domains

Recommendation:
While backward compatibility is important for the storage key name, consider:

  1. Renaming the TypeScript field to urls or links in the type definition
  2. Using a migration/transformation layer when reading/writing to maintain storage compatibility
  3. Updating the KeyDefinition comment to explain the mismatch

This would make the code more maintainable without breaking cached data.

@github-actions
Copy link
Contributor

Logo
Checkmarx One – Scan Summary & Details3a8a78ef-67df-4a6c-b4e4-2ef59966e88e

Great job! No new security vulnerabilities introduced in this pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants