|
1 | 1 | # Copilot Instructions for mail-parser |
2 | 2 |
|
3 | | -## Project Overview |
| 3 | +mail-parser is a **production-grade email parsing library** for Python that transforms raw email messages into |
| 4 | +structured Python objects. Originally built as the foundation for [SpamScope](https://github.com/SpamScope/spamscope), |
| 5 | +it excels at security analysis, forensics, and RFC-compliant email processing. |
4 | 6 |
|
5 | | -mail-parser is a Python library that parses raw email messages into structured Python objects, |
6 | | -serving as the foundation for [SpamScope](https://github.com/SpamScope/spamscope). It handles both |
7 | | -standard email formats and Outlook .msg files, with a focus on security analysis and forensics. |
| 7 | +## Core Architecture |
8 | 8 |
|
9 | | -## Architecture & Key Components |
| 9 | +### Factory-Based API Pattern |
10 | 10 |
|
11 | | -### Core Parser (`src/mailparser/core.py`) |
| 11 | +**Always use factory functions** instead of direct `MailParser()` instantiation: |
12 | 12 |
|
13 | | -- **MailParser class**: Main parser with factory methods (`from_file`, `from_string`, `from_bytes`, |
14 | | - etc.) |
15 | | -- **Property-based API**: Email components accessible as properties (`.subject`, `.from_`, |
16 | | - `.attachments`) |
17 | | -- **Multi-format access**: Each property has `_raw`, `_json` variants (e.g., `mail.to`, |
18 | | - `mail.to_raw`, `mail.to_json`) |
19 | | -- **Defect detection**: Identifies RFC non-compliance for security analysis (`mail.defects`, |
20 | | - `mail.defects_categories`) |
| 13 | +```python |
| 14 | +import mailparser |
| 15 | +mail = mailparser.parse_from_file(filepath) # Standard email files |
| 16 | +mail = mailparser.parse_from_string(raw_email) # Email as string |
| 17 | +mail = mailparser.parse_from_bytes(email_bytes) # Email as bytes |
| 18 | +mail = mailparser.parse_from_file_msg(msg_file) # Outlook .msg files |
| 19 | +``` |
21 | 20 |
|
22 | | -### Your skills and knowledge on RFC and Email Parsing |
| 21 | +### Triple-Format Property Access |
23 | 22 |
|
24 | | -You are an AI assistant with expert-level knowledge of all email protocol RFCs, including but not |
25 | | -limited to RFC 5321 (SMTP), RFC 5322 (Internet Message Format), RFC 2045–2049 (MIME), RFC 3501 |
26 | | -(IMAP), RFC 1939 (POP3), RFC 8620 (JMAP), and related security, extension, and header RFCs. Your |
27 | | -responsibilities include: |
| 23 | +Every parsed component offers **three access patterns** (`src/mailparser/core.py:550-570`): |
28 | 24 |
|
29 | | -Providing accurate, comprehensive technical explanations and guidance based on these RFCs. |
| 25 | +```python |
| 26 | +mail.subject # Python object (decoded string) |
| 27 | +mail.subject_raw # Raw header value (JSON list) |
| 28 | +mail.subject_json # JSON-serialized version |
| 29 | +``` |
30 | 30 |
|
31 | | -Interpreting, comparing, and clarifying requirements, structures, and features as defined by the |
32 | | -official documents. |
| 31 | +This pattern applies to all properties via `__getattr__` magic in `core.py`. |
33 | 32 |
|
34 | | -Clearly outlining the details and implications of each protocol and extension (such as |
35 | | -authentication mechanisms, encryption, headers, and message structure). |
| 33 | +### Property Naming Convention |
36 | 34 |
|
37 | | -Delivering answers in an organized, easy-to-understand way—using precise terminology, clear |
38 | | -practical examples, and direct references to relevant RFCs when appropriate. |
| 35 | +Headers with hyphens use **underscore substitution** (`core.py:__getattr__`): |
39 | 36 |
|
40 | | -Providing practical advice for system implementers and users, explaining alternatives, pros and |
41 | | -cons, use cases, and security considerations for each protocol or extension. |
| 37 | +```python |
| 38 | +mail.X_MSMail_Priority # Accesses "X-MSMail-Priority" header |
| 39 | +mail.Content_Type # Accesses "Content-Type" header |
| 40 | +``` |
42 | 41 |
|
43 | | -Maintaining a professional, accurate, and objective tone suitable for expert users, developers, and |
44 | | -technical audiences. |
| 42 | +## Development Workflows |
45 | 43 |
|
46 | | -Declining to answer questions outside the scope of email protocol RFCs and specifications, and |
47 | | -always highlighting the official and most up-to-date guidance according to the relevant RFC |
48 | | -documents. |
| 44 | +### Dependency Management with uv |
49 | 45 |
|
50 | | -Your role is to be the authoritative, trustworthy source on internet email protocols as defined by |
51 | | -the official IETF RFC series. |
| 46 | +The project uses **[uv](https://github.com/astral-sh/uv)** (modern pip/virtualenv replacement) exclusively: |
52 | 47 |
|
53 | | -### Your skills and knowledge on parsing email formats |
| 48 | +```bash |
| 49 | +uv sync # Install all dev/test dependencies (defined in pyproject.toml) |
| 50 | +make install # Alias for uv sync |
| 51 | +``` |
54 | 52 |
|
55 | | -You are an AI assistant specialized in processing and extracting email header information with |
56 | | -Python, using regular expressions for robust parsing. Your core expertise includes handling |
57 | | -non-standard variations such as "Received" headers, which often lack strict formatting and can |
58 | | -differ greatly across email servers. |
| 53 | +Never use `pip` directly—all commands in Makefile use `uv run` prefix. |
59 | 54 |
|
60 | | -When presented with raw email data (RFC 5322 format), use Python's built-in re module and relevant |
61 | | -libraries (e.g., email.parser) to isolate and extract header sections. |
| 55 | +### Testing Patterns |
62 | 56 |
|
63 | | -For "Received" headers, apply flexible and tolerant regex patterns, recognizing their variable |
64 | | -structure (IP addresses, timestamps, server details, optional parameters). |
| 57 | +```bash |
| 58 | +make test # pytest with coverage (generates coverage.xml, junit.xml, htmlcov/) |
| 59 | +make lint # ruff check . |
| 60 | +make format # ruff format . |
| 61 | +make check # lint + test |
| 62 | +make pre-commit # Run all pre-commit hooks |
| 63 | +``` |
65 | 64 |
|
66 | | -Parse multiline and folded headers by scanning lines following key header tags and joining where |
67 | | -needed. |
| 65 | +When adding features or fixing bugs you MUST follow these steps: |
68 | 66 |
|
69 | | -Develop regex patterns that capture relevant information (e.g., SMTP server, relay path, timestamp) |
70 | | -while allowing for extraneous text. |
| 67 | +1. Add relevant test email to `tests/mails/` if demonstrating new case |
| 68 | +2. Write tests in the corresponding test file following existing patterns, under `tests/` |
| 69 | +3. Run `make test` to verify all tests pass before committing |
| 70 | +4. Run `uv run mail-parser -f tests/mails/mail_test_11 -j` to manually verify JSON output and that new changes |
| 71 | + work as expected |
| 72 | +5. Run `make pre-commit` to ensure code style compliance before pushing |
71 | 73 |
|
72 | | -Document the extraction process: explain which regexes are designed for typical cases and how to |
73 | | -adapt them for mismatches, edge cases, or partial matches. |
| 74 | +**Test data location**: `tests/mails/` contains malformed emails, Outlook files, and various encodings |
| 75 | +(`mail_test_1` through `mail_test_17`, `mail_malformed_1-3`, `mail_outlook_1`). |
74 | 76 |
|
75 | | -When parsing fails due to extreme non-standard formats, log the error and return a best-effort |
76 | | -result. Always explain any limitations or ambiguities in the extraction. |
| 77 | +**Critical testing rule**: When modifying parsing logic, test against malformed emails to ensure security defect |
| 78 | +detection still works. |
77 | 79 |
|
78 | | -Example generic regex for a "Received" header: Received:\s*(.*?);(.*) (captures server info and |
79 | | -date), but you should adapt and test patterns as needed. |
| 80 | +### Build & Release Process |
80 | 81 |
|
81 | | -Provide code comments, extraction summaries, and references for each regex used to ensure |
82 | | -maintainability and clarity. |
| 82 | +```bash |
| 83 | +make build # uv build → creates dist/*.tar.gz and dist/*.whl |
| 84 | +make release # build + twine upload to PyPI |
| 85 | +``` |
83 | 86 |
|
84 | | -Avoid making assumptions about the order or presence of specific header fields, and handle edge |
85 | | -cases gracefully. |
| 87 | +Version is **dynamically loaded** from `src/mailparser/version.py` (see |
| 88 | +`pyproject.toml:tool.hatch.version`). |
86 | 89 |
|
87 | | -When possible, recommend combining regex with Python's email module for initial header separation, |
88 | | -then dive deep with regex for specific, non-standard value extraction. |
| 90 | +## Security-First Parsing |
89 | 91 |
|
90 | | -Your responses must prioritize accuracy, transparency in limitations, and practical utility for |
91 | | -anyone parsing complex email headers. |
| 92 | +### Defect Detection System |
92 | 93 |
|
93 | | -### Entry Points (`src/mailparser/__init__.py`) |
| 94 | +The parser identifies RFC violations that could indicate malicious intent (`core.py:240-268`): |
94 | 95 |
|
95 | 96 | ```python |
96 | | -# Factory functions are the primary API |
97 | | -import mailparser |
98 | | -mail = mailparser.parse_from_file(filepath) |
99 | | -mail = mailparser.parse_from_string(raw_email) |
100 | | -mail = mailparser.parse_from_bytes(email_bytes) |
101 | | -mail = mailparser.parse_from_file_msg(outlook_file) # .msg files |
| 97 | +mail.has_defects # Boolean flag |
| 98 | +mail.defects # List of defect dicts by content type |
| 99 | +mail.defects_categories # Set of defect class names (e.g., "StartBoundaryNotFoundDefect") |
102 | 100 | ``` |
103 | 101 |
|
104 | | -### CLI Tool (`src/mailparser/__main__.py`) |
105 | | - |
106 | | -- Entry point: `mail-parser` command |
107 | | -- JSON output mode (`-j`) for integration with other tools |
108 | | -- Multiple input methods: file (`-f`), string (`-s`), stdin (`-k`) |
109 | | -- Outlook support (`-o`) with system dependency on `libemail-outlook-message-perl` |
| 102 | +**Epilogue defect handling** (`core.py:320-335`): When `EPILOGUE_DEFECTS` are detected, parser extracts hidden |
| 103 | +content between MIME boundaries that could contain malicious payloads. |
110 | 104 |
|
111 | | -## Development Workflows |
| 105 | +### IP Address Extraction |
112 | 106 |
|
113 | | -### Setup & Dependencies |
| 107 | +`get_server_ipaddress(trust)` method (`core.py:487-528`) extracts sender IPs with **trust-level validation**: |
114 | 108 |
|
115 | | -```bash |
116 | | -# Use uv for dependency management (modern pip replacement) |
117 | | -uv sync # Installs all dev/test dependencies |
118 | | -make install # Alias for uv sync |
| 109 | +```python |
| 110 | +# Finds first non-private IP in trusted headers |
| 111 | +mail.get_server_ipaddress(trust="Received") |
119 | 112 | ``` |
120 | 113 |
|
121 | | -### Testing & Quality |
| 114 | +Filters out private IP ranges using Python's `ipaddress` module. |
122 | 115 |
|
123 | | -```bash |
124 | | -make test # pytest with coverage (outputs coverage.xml, junit.xml) |
125 | | -make lint # ruff linting |
126 | | -make format # ruff formatting |
127 | | -make check # lint + test |
128 | | -make pre-commit # runs pre-commit hooks |
129 | | -``` |
130 | | - |
131 | | -For all unittest use `pytest` framework and mock external dependencies as needed. |
132 | | -When you modify code, ensure all tests pass and coverage remains high. |
| 116 | +### Received Header Parsing |
133 | 117 |
|
134 | | -### Build & Release |
| 118 | +Complex regex-based parsing (`utils.py:302-360`, patterns in `const.py:24-73`) extracts hop-by-hop routing: |
135 | 119 |
|
136 | | -```bash |
137 | | -make build # uv build (creates wheel/sdist in dist/) |
138 | | -make release # build + twine upload to PyPI |
| 120 | +```python |
| 121 | +# Returns list of dicts with: by, from, date, date_utc, delay, envelope_from, hop, with |
| 122 | +mail.received |
139 | 123 | ``` |
140 | 124 |
|
141 | | -### Docker Development |
| 125 | +**Key pattern**: `RECEIVED_COMPILED_LIST` contains pre-compiled regexes for "from", "by", "with", "id", "for", |
| 126 | +"via", "envelope-from", "envelope-sender", and date patterns. Recent fixes addressed IBM gateway duplicate matches |
| 127 | +(see comments in `const.py:26-38`). |
142 | 128 |
|
143 | | -- Dockerfile uses Python 3.10-slim with `libemail-outlook-message-perl` |
144 | | -- docker-compose.yml mounts `~/mails` for testing |
145 | | -- Image available as `fmantuano/spamscope-mail-parser` |
| 129 | +If parsing fails, falls back to `receiveds_not_parsed()` returning `{"raw": <header>, "hop": <n>}` |
| 130 | +structure. |
146 | 131 |
|
147 | | -## Key Patterns & Conventions |
| 132 | +## Project Structure Specifics |
148 | 133 |
|
149 | | -### Header Access Pattern |
| 134 | +### src/ Layout |
150 | 135 |
|
151 | | -Headers with hyphens use underscore substitution: |
| 136 | +Package uses modern **src-layout** (`src/mailparser/`) for cleaner imports and testing isolation: |
152 | 137 |
|
153 | | -```python |
154 | | -mail.X_MSMail_Priority # for X-MSMail-Priority header |
| 138 | +```text |
| 139 | +src/mailparser/ |
| 140 | +├── __init__.py # Exports factory functions |
| 141 | +├── __main__.py # CLI entry point (mail-parser command) |
| 142 | +├── core.py # MailParser class (760 lines) |
| 143 | +├── utils.py # Parsing utilities (582 lines) |
| 144 | +├── const.py # Regex patterns and constants |
| 145 | +├── exceptions.py # Exception hierarchy |
| 146 | +└── version.py # Version string |
155 | 147 | ``` |
156 | 148 |
|
157 | | -### Attachment Structure |
| 149 | +### External Dependency: Outlook Support |
158 | 150 |
|
159 | | -```python |
160 | | -# Each attachment is a dict with standardized keys |
161 | | -for attachment in mail.attachments: |
162 | | - attachment['filename'] |
163 | | - attachment['payload'] # base64 encoded |
164 | | - attachment['content_transfer_encoding'] |
165 | | - attachment['binary'] # boolean flag |
| 151 | +Outlook `.msg` file parsing requires **system-level Perl module**: |
| 152 | + |
| 153 | +```bash |
| 154 | +apt-get install libemail-outlook-message-perl # Debian/Ubuntu |
166 | 155 | ``` |
167 | 156 |
|
168 | | -### Received Header Parsing |
| 157 | +Triggered via `msgconvert()` function in `utils.py` that shells out to Perl script. Raises `MailParserOutlookError` |
| 158 | +if unavailable. |
169 | 159 |
|
170 | | -Complex parsing in `receiveds_parsing()` extracts hop-by-hop email routing: |
| 160 | +### CLI Tool Pattern |
171 | 161 |
|
172 | | -```python |
173 | | -mail.received # List of parsed received headers with structured data |
174 | | -# Each hop contains: by, from, date, delay, envelope_from, etc. |
| 162 | +`__main__.py` provides production CLI with mutually exclusive input modes (`-f`, `-s`, `-k`), JSON output (`-j`), |
| 163 | +and selective printing (`-b`, `-a`, `-r`, `-t`). |
| 164 | + |
| 165 | +**Entry point defined** in `pyproject.toml:project.scripts`: |
| 166 | + |
| 167 | +```toml |
| 168 | +[project.scripts] |
| 169 | +mail-parser = "mailparser.__main__:main" |
175 | 170 | ``` |
176 | 171 |
|
177 | | -### Error Handling Hierarchy |
| 172 | +## Code Style & Tooling |
178 | 173 |
|
179 | | -```python |
180 | | -MailParserError # Base exception |
181 | | -├── MailParserOutlookError # Outlook .msg issues |
182 | | -├── MailParserEnvironmentError # Missing dependencies |
183 | | -├── MailParserOSError # File system issues |
184 | | -└── MailParserReceivedParsingError # Header parsing failures |
| 174 | +### Ruff Configuration |
| 175 | + |
| 176 | +Single linter/formatter (replaces black, isort, flake8): |
| 177 | + |
| 178 | +```toml |
| 179 | +[tool.ruff.lint] |
| 180 | +select = ["E", "F", "I"] # pycodestyle, pyflakes, isort |
| 181 | +# "UP", "B", "SIM", "S", "PT" commented out in pyproject.toml |
185 | 182 | ``` |
186 | 183 |
|
187 | | -## Testing Approach |
| 184 | +### Pytest Configuration |
188 | 185 |
|
189 | | -- Test emails in `tests/mails/` (malformed, Outlook, various encodings) |
190 | | -- Comprehensive property testing for all email components |
191 | | -- CLI integration tests in CI pipeline |
192 | | -- Coverage reporting with pytest-cov |
| 186 | +Key markers in `pyproject.toml:tool.pytest.ini_options`: |
193 | 187 |
|
194 | | -## Security Focus |
| 188 | +- `integration`: marks integration tests |
| 189 | +- Coverage outputs: XML (for CI), HTML (for local), terminal |
| 190 | +- JUnit XML for CI integration |
195 | 191 |
|
196 | | -- **Defect detection**: Identifies malformed boundaries that could hide malicious content |
197 | | -- **IP extraction**: `get_server_ipaddress()` with trust levels for forensic analysis |
198 | | -- **Epilogue analysis**: Detects hidden content in malformed MIME boundaries |
199 | | -- **Fingerprinting**: Mail and attachment hashing for threat intelligence |
| 192 | +## Common Pitfalls |
200 | 193 |
|
201 | | -## Build System Specifics |
| 194 | +1. **Don't instantiate `MailParser()` directly**—use factory functions from `__init__.py` |
| 195 | +2. **Don't use `pip`**—always use `uv` or Makefile targets |
| 196 | +3. **Don't ignore defects**—they're critical for security analysis |
| 197 | +4. **Don't assume headers exist**—use `.get()` pattern or handle `None` |
| 198 | +5. **Test against malformed emails**—`tests/mails/mail_malformed_*` files exist for this reason |
| 199 | + |
| 200 | +## Docker Development |
| 201 | + |
| 202 | +Dockerfile uses **Python 3.10-slim-bookworm** with Outlook dependencies pre-installed. Container runs as non-root |
| 203 | +`mailparser` user. |
| 204 | + |
| 205 | +```bash |
| 206 | +docker build -t mail-parser . |
| 207 | +docker run mail-parser -f /path/to/email |
| 208 | +``` |
202 | 209 |
|
203 | | -- **pyproject.toml**: Modern Python packaging with hatch backend |
204 | | -- **uv**: Used instead of pip for faster, reliable dependency resolution |
205 | | -- **src/ layout**: Package in `src/mailparser/` for cleaner imports |
206 | | -- **Dynamic versioning**: Version from `src/mailparser/version.py` |
| 210 | +## Key Reference Points |
207 | 211 |
|
208 | | -## External Dependencies |
| 212 | +- **Property implementation**: `core.py:540-730` (all `@property` decorators) |
| 213 | +- **Attachment extraction**: `core.py:355-475` (walks multipart, handles encoding) |
| 214 | +- **Received parsing logic**: `utils.py:302-455` + `const.py:24-73` (regex patterns) |
| 215 | +- **CLI implementation**: `__main__.py:30-347` (argparse + output formatting) |
| 216 | +- **Exception hierarchy**: `exceptions.py:20-60` (5 exception types) |
209 | 217 |
|
210 | | -- **Outlook support**: Requires system package `libemail-outlook-message-perl` + Perl module `Email::Outlook::Message` |
211 | | -- **six**: Python 2/3 compatibility (legacy requirement) |
212 | | -- **Minimal runtime deps**: Only `six>=1.17.0` required |
| 218 | +## Testing Strategy |
213 | 219 |
|
214 | | -When working with this codebase: |
| 220 | +When adding features: |
215 | 221 |
|
216 | | -- Use factory functions, not direct MailParser() instantiation |
217 | | -- Test with various malformed emails from `tests/mails/` |
218 | | -- Remember header property naming (underscores for hyphens) |
219 | | -- Consider security implications of email parsing edge cases |
| 222 | +1. Add test email to `tests/mails/` if demonstrating new case |
| 223 | +2. Write tests in `tests/test_mail_parser.py` following existing patterns |
| 224 | +3. Test both normal and `_raw`/`_json` property variants |
| 225 | +4. Verify defect detection for security-relevant changes |
| 226 | +5. Run `make check` before committing |
0 commit comments