This repository contains the open-source artifact of:
"LeakyLinks: Measuring the Security and Privacy Risks of URL Scanning Services"
Accepted at IEEE Symposium on Security and Privacy (S&P) 2026.
The LeakyLinks framework identifies SPI URLs by analyzing data from multiple URL scanning services. It processes URLs through a multi-stage pipeline:
- Scraping: Collects URLs from 6 URL scanning services
- Live Crawl: Visits URLs and captures before/after snapshots to detect session state
- Token Detection: Identifies high-entropy tokens in URLs (potential session identifiers)
- Page Difference Check: For URLs without tokens, compares before/after pages to detect session state changes
- Screenshot Analysis: Analyzes screenshots of potentially sensitive URLs using vision-based LLM
- Anyrun
- Cloudflare Radar
- Hybrid-Analysis
- Joe Sandbox
- URLQuery
- URLScan
-
Scraper (
scraper/): Continuously collects URLs from the 6 services and stores them in service-specific result tables (*_results) -
Database Triggers: Automatically create entries in
analysis_outputtable when new URLs are scraped -
Pipeline Workers (run in sequence):
- CrawlWorker (
--crawl): Visits each URL twice (before/after dropping session) and captures snapshots - URLTokenCheckWorker (
--url_token_check): Detects high-entropy tokens in the final URL - PageDifferenceCheckWorker (
--page_difference_check): Only processes URLs without tokens; compares before/after pages to detect session state - ScreenshotAnalysisWorker (
--spi_detector): Analyzes screenshots for URLs that have tokens OR page differences
- CrawlWorker (
The pipeline uses task_phase_status table to track progress through phases:
- live_crawl: Visit URL, capture before/after snapshots, store in
live_crawl_analysisJSON - url_token_check: Check if
finalUrlBeforecontains high-entropy tokens → setsfinalurlbefore_has_token - page_difference_check: Only for URLs with
finalurlbefore_has_token = False; compares HTML similarity → setspage_different - spi_detector: Only for URLs with
(finalurlbefore_has_token = True OR page_different = True); analyzes screenshots for sensitive content
-
State Drop: The process of visiting a URL twice - once normally, then again after dropping session cookies/values. If the page content differs, it indicates the URL is an SPI URL. This is implemented in the
live_crawlphase and analyzed in thepage_difference_checkphase. -
analysis_output table: Central table that tracks all URLs through the pipeline. Contains:
live_crawl_analysis: JSON with before/after snapshots and redirectsfinalurlbefore_has_token: Boolean flag set by token detectionpage_different: Boolean flag set by page difference checkhas_redirection: Boolean flag indicating redirects occurred
- Build and start the services
docker compose up -d --build- Exec into the main application container
docker compose exec leakylinks bash- Add fake scraped examples to the database
python config/fake_plugin_fill.py examples- Run the pipeline phases in order:
# Phase 1: Live crawl (visits URLs, captures snapshots)
python pipeline/pipeline/run_pipeline.py --crawl
# Phase 2: Token detection (checks for high-entropy tokens in URLs)
python pipeline/pipeline/run_pipeline.py --url_token_check
# Phase 3: Page difference check (only for URLs without tokens)
python pipeline/pipeline/run_pipeline.py --page_difference_check
# Phase 4: Screenshot analysis (for URLs with tokens or page differences)
python pipeline/pipeline/run_pipeline.py --spi_detector-
Scraper (
scraper/): Collects data from the 6 URL scanning services. It gathers details like the URL, screenshot URL, and results from the API. Runs continuously to accumulate data over time. -
URL Token Checker (
url_token_checker/): Parses URLs (with full path+query), applies basic checks, then uses entropy analysis to detect high-entropy tokens and flag potentially sensitive URLs. -
Live Crawl (
live_crawl/): Visits URLs twice (with and without session values) to capture before/after snapshots. This implements the "State Drop" technique to detect SPI URLs. -
Page Difference Checker (
page_difference_checker/): Compares before/after HTML pages to detect session state changes. Only processes URLs that don't have tokens. -
Screenshot Analyzer (
spi_detector/): Processes screenshots from URLs that have tokens or showed page differences, using vision-based LLM analysis to detect sensitive content. Performs concurrent batch processing with checkpointing support. -
Honey (
honey/): Infrastructure for the honeypot experiment including submitters and the base honeypage used.
- The pipeline configuration is located in
config/settings.py - Use
.envas a reference for environment variables - The model used in the actual project was
qwen3-vl:30b-a3b-instruct-q8_0which needs more than 34 GB of VRAM, but this docker usesqwen3-vl:2b-instructto make it smaller. The docker compose will only finalize when the LLM is downloaded and ready. Make sure to have 8 GB of VRAM.
The main tables are:
*_results: Service-specific tables storing scraped URLsanalysis_output: Central table tracking URLs through the pipelinetask_phase_status: Tracks progress through pipeline phasesscreenshot_analysis_results: Stores screenshot analysis results
Ali Mustafa — ali.mustafa@cispa.de
The paper will be available at the IEEE Computer Society Digital Library after publication.
@INPROCEEDINGS {,
author = { Mustafa, Ali and Rautenstrauch, Jannis and Hantke, Florian and Agarwal, Shubham and Calzavara, Stefano and Stock, Ben },
booktitle = { 2026 IEEE Symposium on Security and Privacy (SP) },
title = {{ LeakyLinks: Measuring the Security and Privacy Risks of URL Scanning Services }},
year = {2026},
volume = {},
ISSN = {2375-1207},
pages = {834-853},
abstract = { URL scanning services are widely used in security workflows to detect malicious websites and protect users from online threats. However, their common practice of publicly indexing scanned URLs may unintentionally expose sensitive user information through URL-embedded access credentials. Although isolated accounts of such privacy incidents exist, a systematic assessment of their prevalence is still lacking. We present LEAKYLINKS, an automated analysis pipeline that combines URL filtering with LLM-driven semantic classification to identify URLs exposing Sensitive Personal Information (SPI). Using LEAKYLINKS, we analyze URLs collected from public feeds of six prominent URL scanning services over a period of three weeks. With the framework, we visited 338k URLs, identifying over 4k URLs which leak SPI with a precision of 97%. To further assess the extent to which published URLs are actively accessed by third parties, we deploy honeypages and submit their links to the selected URL scanning services. Our measurements confirm that external entities access URLs submitted to these scanners, often from potentially suspicious IPs exhibiting behavior commonly associated with reconnaissance or opportunistic probing. Taken together, these findings indicate that URL scanning services represent a valuable target for web adversaries and may already be subject to active exploitation in the wild. },
keywords = {},
doi = {10.1109/SP63933.2026.00130},
url = {https://doi.ieeecomputersociety.org/10.1109/SP63933.2026.00130},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month =May}