LeakyLinks

This repository contains the open-source artifact of:

"LeakyLinks: Measuring the Security and Privacy Risks of URL Scanning Services"

Accepted at IEEE Symposium on Security and Privacy (S&P) 2026.

Overview

The LeakyLinks framework identifies SPI URLs by analyzing data from multiple URL scanning services. It processes URLs through a multi-stage pipeline:

Scraping: Collects URLs from 6 URL scanning services
Live Crawl: Visits URLs and captures before/after snapshots to detect session state
Token Detection: Identifies high-entropy tokens in URLs (potential session identifiers)
Page Difference Check: For URLs without tokens, compares before/after pages to detect session state changes
Screenshot Analysis: Analyzes screenshots of potentially sensitive URLs using vision-based LLM

The 6 URL Scanning Services

Anyrun
Cloudflare Radar
Hybrid-Analysis
Joe Sandbox
URLQuery
URLScan

Pipeline Architecture

Data Flow

Scraper (scraper/): Continuously collects URLs from the 6 services and stores them in service-specific result tables (*_results)
Database Triggers: Automatically create entries in analysis_output table when new URLs are scraped
Pipeline Workers (run in sequence):
- CrawlWorker (--crawl): Visits each URL twice (before/after dropping session) and captures snapshots
- URLTokenCheckWorker (--url_token_check): Detects high-entropy tokens in the final URL
- PageDifferenceCheckWorker (--page_difference_check): Only processes URLs without tokens; compares before/after pages to detect session state
- ScreenshotAnalysisWorker (--spi_detector): Analyzes screenshots for URLs that have tokens OR page differences

Pipeline Phases

The pipeline uses task_phase_status table to track progress through phases:

live_crawl: Visit URL, capture before/after snapshots, store in live_crawl_analysis JSON
url_token_check: Check if finalUrlBefore contains high-entropy tokens → sets finalurlbefore_has_token
page_difference_check: Only for URLs with finalurlbefore_has_token = False; compares HTML similarity → sets page_different
spi_detector: Only for URLs with (finalurlbefore_has_token = True OR page_different = True); analyzes screenshots for sensitive content

Key Concepts

State Drop: The process of visiting a URL twice - once normally, then again after dropping session cookies/values. If the page content differs, it indicates the URL is an SPI URL. This is implemented in the live_crawl phase and analyzed in the page_difference_check phase.
analysis_output table: Central table that tracks all URLs through the pipeline. Contains:
- live_crawl_analysis: JSON with before/after snapshots and redirects
- finalurlbefore_has_token: Boolean flag set by token detection
- page_different: Boolean flag set by page difference check
- has_redirection: Boolean flag indicating redirects occurred

Quickstart (With docker and docker compose installed)

Build and start the services

docker compose up -d --build

Exec into the main application container

docker compose exec leakylinks bash

Add fake scraped examples to the database

python config/fake_plugin_fill.py examples

Run the pipeline phases in order:

# Phase 1: Live crawl (visits URLs, captures snapshots)
python pipeline/pipeline/run_pipeline.py --crawl

# Phase 2: Token detection (checks for high-entropy tokens in URLs)
python pipeline/pipeline/run_pipeline.py --url_token_check

# Phase 3: Page difference check (only for URLs without tokens)
python pipeline/pipeline/run_pipeline.py --page_difference_check

# Phase 4: Screenshot analysis (for URLs with tokens or page differences)
python pipeline/pipeline/run_pipeline.py --spi_detector

Components

Scraper (scraper/): Collects data from the 6 URL scanning services. It gathers details like the URL, screenshot URL, and results from the API. Runs continuously to accumulate data over time.
URL Token Checker (url_token_checker/): Parses URLs (with full path+query), applies basic checks, then uses entropy analysis to detect high-entropy tokens and flag potentially sensitive URLs.
Live Crawl (live_crawl/): Visits URLs twice (with and without session values) to capture before/after snapshots. This implements the "State Drop" technique to detect SPI URLs.
Page Difference Checker (page_difference_checker/): Compares before/after HTML pages to detect session state changes. Only processes URLs that don't have tokens.
Screenshot Analyzer (spi_detector/): Processes screenshots from URLs that have tokens or showed page differences, using vision-based LLM analysis to detect sensitive content. Performs concurrent batch processing with checkpointing support.
Honey (honey/): Infrastructure for the honeypot experiment including submitters and the base honeypage used.

Configuration

The pipeline configuration is located in config/settings.py
Use .env as a reference for environment variables
The model used in the actual project was qwen3-vl:30b-a3b-instruct-q8_0 which needs more than 34 GB of VRAM, but this docker uses qwen3-vl:2b-instruct to make it smaller. The docker compose will only finalize when the LLM is downloaded and ready. Make sure to have 8 GB of VRAM.

Database Schema

The main tables are:

*_results: Service-specific tables storing scraped URLs
analysis_output: Central table tracking URLs through the pipeline
task_phase_status: Tracks progress through pipeline phases
screenshot_analysis_results: Stores screenshot analysis results

Contact

Ali Mustafa — ali.mustafa@cispa.de

Citation

The paper will be available at the IEEE Computer Society Digital Library after publication.

@INPROCEEDINGS {,
author = { Mustafa, Ali and Rautenstrauch, Jannis and Hantke, Florian and Agarwal, Shubham and Calzavara, Stefano and Stock, Ben },
booktitle = { 2026 IEEE Symposium on Security and Privacy (SP) },
title = {{ LeakyLinks: Measuring the Security and Privacy Risks of URL Scanning Services }},
year = {2026},
volume = {},
ISSN = {2375-1207},
pages = {834-853},
abstract = { URL scanning services are widely used in security workflows to detect malicious websites and protect users from online threats. However, their common practice of publicly indexing scanned URLs may unintentionally expose sensitive user information through URL-embedded access credentials. Although isolated accounts of such privacy incidents exist, a systematic assessment of their prevalence is still lacking. We present LEAKYLINKS, an automated analysis pipeline that combines URL filtering with LLM-driven semantic classification to identify URLs exposing Sensitive Personal Information (SPI). Using LEAKYLINKS, we analyze URLs collected from public feeds of six prominent URL scanning services over a period of three weeks. With the framework, we visited 338k URLs, identifying over 4k URLs which leak SPI with a precision of 97%. To further assess the extent to which published URLs are actively accessed by third parties, we deploy honeypages and submit their links to the selected URL scanning services. Our measurements confirm that external entities access URLs submitted to these scanners, often from potentially suspicious IPs exhibiting behavior commonly associated with reconnaissance or opportunistic probing. Taken together, these findings indicate that URL scanning services represent a valuable target for web adversaries and may already be subject to active exploitation in the wild. },
keywords = {},
doi = {10.1109/SP63933.2026.00130},
url = {https://doi.ieeecomputersociety.org/10.1109/SP63933.2026.00130},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month =May}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
honey		honey
leakylinks		leakylinks
live_crawl		live_crawl
page_difference_checker		page_difference_checker
pipeline		pipeline
scraper		scraper
spi_detector		spi_detector
url_token_checker		url_token_checker
.env		.env
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LeakyLinks

Overview

The 6 URL Scanning Services

Pipeline Architecture

Data Flow

Pipeline Phases

Key Concepts

Quickstart (With docker and docker compose installed)

Components

Configuration

Database Schema

Contact

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LeakyLinks

Overview

The 6 URL Scanning Services

Pipeline Architecture

Data Flow

Pipeline Phases

Key Concepts

Quickstart (With docker and docker compose installed)

Components

Configuration

Database Schema

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages