Skip to content

Commit 093e889

Browse files
Update Python version requirements to support up to 3.14 and adjust CI workflow accordingly
1 parent a59e785 commit 093e889

File tree

4 files changed

+160
-152
lines changed

4 files changed

+160
-152
lines changed

.github/copilot-instructions.md

Lines changed: 156 additions & 149 deletions
Original file line numberDiff line numberDiff line change
@@ -1,219 +1,226 @@
11
# Copilot Instructions for mail-parser
22

3-
## Project Overview
3+
mail-parser is a **production-grade email parsing library** for Python that transforms raw email messages into
4+
structured Python objects. Originally built as the foundation for [SpamScope](https://github.com/SpamScope/spamscope),
5+
it excels at security analysis, forensics, and RFC-compliant email processing.
46

5-
mail-parser is a Python library that parses raw email messages into structured Python objects,
6-
serving as the foundation for [SpamScope](https://github.com/SpamScope/spamscope). It handles both
7-
standard email formats and Outlook .msg files, with a focus on security analysis and forensics.
7+
## Core Architecture
88

9-
## Architecture & Key Components
9+
### Factory-Based API Pattern
1010

11-
### Core Parser (`src/mailparser/core.py`)
11+
**Always use factory functions** instead of direct `MailParser()` instantiation:
1212

13-
- **MailParser class**: Main parser with factory methods (`from_file`, `from_string`, `from_bytes`,
14-
etc.)
15-
- **Property-based API**: Email components accessible as properties (`.subject`, `.from_`,
16-
`.attachments`)
17-
- **Multi-format access**: Each property has `_raw`, `_json` variants (e.g., `mail.to`,
18-
`mail.to_raw`, `mail.to_json`)
19-
- **Defect detection**: Identifies RFC non-compliance for security analysis (`mail.defects`,
20-
`mail.defects_categories`)
13+
```python
14+
import mailparser
15+
mail = mailparser.parse_from_file(filepath) # Standard email files
16+
mail = mailparser.parse_from_string(raw_email) # Email as string
17+
mail = mailparser.parse_from_bytes(email_bytes) # Email as bytes
18+
mail = mailparser.parse_from_file_msg(msg_file) # Outlook .msg files
19+
```
2120

22-
### Your skills and knowledge on RFC and Email Parsing
21+
### Triple-Format Property Access
2322

24-
You are an AI assistant with expert-level knowledge of all email protocol RFCs, including but not
25-
limited to RFC 5321 (SMTP), RFC 5322 (Internet Message Format), RFC 2045–2049 (MIME), RFC 3501
26-
(IMAP), RFC 1939 (POP3), RFC 8620 (JMAP), and related security, extension, and header RFCs. Your
27-
responsibilities include:
23+
Every parsed component offers **three access patterns** (`src/mailparser/core.py:550-570`):
2824

29-
Providing accurate, comprehensive technical explanations and guidance based on these RFCs.
25+
```python
26+
mail.subject # Python object (decoded string)
27+
mail.subject_raw # Raw header value (JSON list)
28+
mail.subject_json # JSON-serialized version
29+
```
3030

31-
Interpreting, comparing, and clarifying requirements, structures, and features as defined by the
32-
official documents.
31+
This pattern applies to all properties via `__getattr__` magic in `core.py`.
3332

34-
Clearly outlining the details and implications of each protocol and extension (such as
35-
authentication mechanisms, encryption, headers, and message structure).
33+
### Property Naming Convention
3634

37-
Delivering answers in an organized, easy-to-understand way—using precise terminology, clear
38-
practical examples, and direct references to relevant RFCs when appropriate.
35+
Headers with hyphens use **underscore substitution** (`core.py:__getattr__`):
3936

40-
Providing practical advice for system implementers and users, explaining alternatives, pros and
41-
cons, use cases, and security considerations for each protocol or extension.
37+
```python
38+
mail.X_MSMail_Priority # Accesses "X-MSMail-Priority" header
39+
mail.Content_Type # Accesses "Content-Type" header
40+
```
4241

43-
Maintaining a professional, accurate, and objective tone suitable for expert users, developers, and
44-
technical audiences.
42+
## Development Workflows
4543

46-
Declining to answer questions outside the scope of email protocol RFCs and specifications, and
47-
always highlighting the official and most up-to-date guidance according to the relevant RFC
48-
documents.
44+
### Dependency Management with uv
4945

50-
Your role is to be the authoritative, trustworthy source on internet email protocols as defined by
51-
the official IETF RFC series.
46+
The project uses **[uv](https://github.com/astral-sh/uv)** (modern pip/virtualenv replacement) exclusively:
5247

53-
### Your skills and knowledge on parsing email formats
48+
```bash
49+
uv sync # Install all dev/test dependencies (defined in pyproject.toml)
50+
make install # Alias for uv sync
51+
```
5452

55-
You are an AI assistant specialized in processing and extracting email header information with
56-
Python, using regular expressions for robust parsing. Your core expertise includes handling
57-
non-standard variations such as "Received" headers, which often lack strict formatting and can
58-
differ greatly across email servers.
53+
Never use `pip` directly—all commands in Makefile use `uv run` prefix.
5954

60-
When presented with raw email data (RFC 5322 format), use Python's built-in re module and relevant
61-
libraries (e.g., email.parser) to isolate and extract header sections.
55+
### Testing Patterns
6256

63-
For "Received" headers, apply flexible and tolerant regex patterns, recognizing their variable
64-
structure (IP addresses, timestamps, server details, optional parameters).
57+
```bash
58+
make test # pytest with coverage (generates coverage.xml, junit.xml, htmlcov/)
59+
make lint # ruff check .
60+
make format # ruff format .
61+
make check # lint + test
62+
make pre-commit # Run all pre-commit hooks
63+
```
6564

66-
Parse multiline and folded headers by scanning lines following key header tags and joining where
67-
needed.
65+
When adding features or fixing bugs you MUST follow these steps:
6866

69-
Develop regex patterns that capture relevant information (e.g., SMTP server, relay path, timestamp)
70-
while allowing for extraneous text.
67+
1. Add relevant test email to `tests/mails/` if demonstrating new case
68+
2. Write tests in the corresponding test file following existing patterns, under `tests/`
69+
3. Run `make test` to verify all tests pass before committing
70+
4. Run `uv run mail-parser -f tests/mails/mail_test_11 -j` to manually verify JSON output and that new changes
71+
work as expected
72+
5. Run `make pre-commit` to ensure code style compliance before pushing
7173

72-
Document the extraction process: explain which regexes are designed for typical cases and how to
73-
adapt them for mismatches, edge cases, or partial matches.
74+
**Test data location**: `tests/mails/` contains malformed emails, Outlook files, and various encodings
75+
(`mail_test_1` through `mail_test_17`, `mail_malformed_1-3`, `mail_outlook_1`).
7476

75-
When parsing fails due to extreme non-standard formats, log the error and return a best-effort
76-
result. Always explain any limitations or ambiguities in the extraction.
77+
**Critical testing rule**: When modifying parsing logic, test against malformed emails to ensure security defect
78+
detection still works.
7779

78-
Example generic regex for a "Received" header: Received:\s*(.*?);(.*) (captures server info and
79-
date), but you should adapt and test patterns as needed.
80+
### Build & Release Process
8081

81-
Provide code comments, extraction summaries, and references for each regex used to ensure
82-
maintainability and clarity.
82+
```bash
83+
make build # uv build → creates dist/*.tar.gz and dist/*.whl
84+
make release # build + twine upload to PyPI
85+
```
8386

84-
Avoid making assumptions about the order or presence of specific header fields, and handle edge
85-
cases gracefully.
87+
Version is **dynamically loaded** from `src/mailparser/version.py` (see
88+
`pyproject.toml:tool.hatch.version`).
8689

87-
When possible, recommend combining regex with Python's email module for initial header separation,
88-
then dive deep with regex for specific, non-standard value extraction.
90+
## Security-First Parsing
8991

90-
Your responses must prioritize accuracy, transparency in limitations, and practical utility for
91-
anyone parsing complex email headers.
92+
### Defect Detection System
9293

93-
### Entry Points (`src/mailparser/__init__.py`)
94+
The parser identifies RFC violations that could indicate malicious intent (`core.py:240-268`):
9495

9596
```python
96-
# Factory functions are the primary API
97-
import mailparser
98-
mail = mailparser.parse_from_file(filepath)
99-
mail = mailparser.parse_from_string(raw_email)
100-
mail = mailparser.parse_from_bytes(email_bytes)
101-
mail = mailparser.parse_from_file_msg(outlook_file) # .msg files
97+
mail.has_defects # Boolean flag
98+
mail.defects # List of defect dicts by content type
99+
mail.defects_categories # Set of defect class names (e.g., "StartBoundaryNotFoundDefect")
102100
```
103101

104-
### CLI Tool (`src/mailparser/__main__.py`)
105-
106-
- Entry point: `mail-parser` command
107-
- JSON output mode (`-j`) for integration with other tools
108-
- Multiple input methods: file (`-f`), string (`-s`), stdin (`-k`)
109-
- Outlook support (`-o`) with system dependency on `libemail-outlook-message-perl`
102+
**Epilogue defect handling** (`core.py:320-335`): When `EPILOGUE_DEFECTS` are detected, parser extracts hidden
103+
content between MIME boundaries that could contain malicious payloads.
110104

111-
## Development Workflows
105+
### IP Address Extraction
112106

113-
### Setup & Dependencies
107+
`get_server_ipaddress(trust)` method (`core.py:487-528`) extracts sender IPs with **trust-level validation**:
114108

115-
```bash
116-
# Use uv for dependency management (modern pip replacement)
117-
uv sync # Installs all dev/test dependencies
118-
make install # Alias for uv sync
109+
```python
110+
# Finds first non-private IP in trusted headers
111+
mail.get_server_ipaddress(trust="Received")
119112
```
120113

121-
### Testing & Quality
114+
Filters out private IP ranges using Python's `ipaddress` module.
122115

123-
```bash
124-
make test # pytest with coverage (outputs coverage.xml, junit.xml)
125-
make lint # ruff linting
126-
make format # ruff formatting
127-
make check # lint + test
128-
make pre-commit # runs pre-commit hooks
129-
```
130-
131-
For all unittest use `pytest` framework and mock external dependencies as needed.
132-
When you modify code, ensure all tests pass and coverage remains high.
116+
### Received Header Parsing
133117

134-
### Build & Release
118+
Complex regex-based parsing (`utils.py:302-360`, patterns in `const.py:24-73`) extracts hop-by-hop routing:
135119

136-
```bash
137-
make build # uv build (creates wheel/sdist in dist/)
138-
make release # build + twine upload to PyPI
120+
```python
121+
# Returns list of dicts with: by, from, date, date_utc, delay, envelope_from, hop, with
122+
mail.received
139123
```
140124

141-
### Docker Development
125+
**Key pattern**: `RECEIVED_COMPILED_LIST` contains pre-compiled regexes for "from", "by", "with", "id", "for",
126+
"via", "envelope-from", "envelope-sender", and date patterns. Recent fixes addressed IBM gateway duplicate matches
127+
(see comments in `const.py:26-38`).
142128

143-
- Dockerfile uses Python 3.10-slim with `libemail-outlook-message-perl`
144-
- docker-compose.yml mounts `~/mails` for testing
145-
- Image available as `fmantuano/spamscope-mail-parser`
129+
If parsing fails, falls back to `receiveds_not_parsed()` returning `{"raw": <header>, "hop": <n>}`
130+
structure.
146131

147-
## Key Patterns & Conventions
132+
## Project Structure Specifics
148133

149-
### Header Access Pattern
134+
### src/ Layout
150135

151-
Headers with hyphens use underscore substitution:
136+
Package uses modern **src-layout** (`src/mailparser/`) for cleaner imports and testing isolation:
152137

153-
```python
154-
mail.X_MSMail_Priority # for X-MSMail-Priority header
138+
```text
139+
src/mailparser/
140+
├── __init__.py # Exports factory functions
141+
├── __main__.py # CLI entry point (mail-parser command)
142+
├── core.py # MailParser class (760 lines)
143+
├── utils.py # Parsing utilities (582 lines)
144+
├── const.py # Regex patterns and constants
145+
├── exceptions.py # Exception hierarchy
146+
└── version.py # Version string
155147
```
156148

157-
### Attachment Structure
149+
### External Dependency: Outlook Support
158150

159-
```python
160-
# Each attachment is a dict with standardized keys
161-
for attachment in mail.attachments:
162-
attachment['filename']
163-
attachment['payload'] # base64 encoded
164-
attachment['content_transfer_encoding']
165-
attachment['binary'] # boolean flag
151+
Outlook `.msg` file parsing requires **system-level Perl module**:
152+
153+
```bash
154+
apt-get install libemail-outlook-message-perl # Debian/Ubuntu
166155
```
167156

168-
### Received Header Parsing
157+
Triggered via `msgconvert()` function in `utils.py` that shells out to Perl script. Raises `MailParserOutlookError`
158+
if unavailable.
169159

170-
Complex parsing in `receiveds_parsing()` extracts hop-by-hop email routing:
160+
### CLI Tool Pattern
171161

172-
```python
173-
mail.received # List of parsed received headers with structured data
174-
# Each hop contains: by, from, date, delay, envelope_from, etc.
162+
`__main__.py` provides production CLI with mutually exclusive input modes (`-f`, `-s`, `-k`), JSON output (`-j`),
163+
and selective printing (`-b`, `-a`, `-r`, `-t`).
164+
165+
**Entry point defined** in `pyproject.toml:project.scripts`:
166+
167+
```toml
168+
[project.scripts]
169+
mail-parser = "mailparser.__main__:main"
175170
```
176171

177-
### Error Handling Hierarchy
172+
## Code Style & Tooling
178173

179-
```python
180-
MailParserError # Base exception
181-
├── MailParserOutlookError # Outlook .msg issues
182-
├── MailParserEnvironmentError # Missing dependencies
183-
├── MailParserOSError # File system issues
184-
└── MailParserReceivedParsingError # Header parsing failures
174+
### Ruff Configuration
175+
176+
Single linter/formatter (replaces black, isort, flake8):
177+
178+
```toml
179+
[tool.ruff.lint]
180+
select = ["E", "F", "I"] # pycodestyle, pyflakes, isort
181+
# "UP", "B", "SIM", "S", "PT" commented out in pyproject.toml
185182
```
186183

187-
## Testing Approach
184+
### Pytest Configuration
188185

189-
- Test emails in `tests/mails/` (malformed, Outlook, various encodings)
190-
- Comprehensive property testing for all email components
191-
- CLI integration tests in CI pipeline
192-
- Coverage reporting with pytest-cov
186+
Key markers in `pyproject.toml:tool.pytest.ini_options`:
193187

194-
## Security Focus
188+
- `integration`: marks integration tests
189+
- Coverage outputs: XML (for CI), HTML (for local), terminal
190+
- JUnit XML for CI integration
195191

196-
- **Defect detection**: Identifies malformed boundaries that could hide malicious content
197-
- **IP extraction**: `get_server_ipaddress()` with trust levels for forensic analysis
198-
- **Epilogue analysis**: Detects hidden content in malformed MIME boundaries
199-
- **Fingerprinting**: Mail and attachment hashing for threat intelligence
192+
## Common Pitfalls
200193

201-
## Build System Specifics
194+
1. **Don't instantiate `MailParser()` directly**—use factory functions from `__init__.py`
195+
2. **Don't use `pip`**—always use `uv` or Makefile targets
196+
3. **Don't ignore defects**—they're critical for security analysis
197+
4. **Don't assume headers exist**—use `.get()` pattern or handle `None`
198+
5. **Test against malformed emails**`tests/mails/mail_malformed_*` files exist for this reason
199+
200+
## Docker Development
201+
202+
Dockerfile uses **Python 3.10-slim-bookworm** with Outlook dependencies pre-installed. Container runs as non-root
203+
`mailparser` user.
204+
205+
```bash
206+
docker build -t mail-parser .
207+
docker run mail-parser -f /path/to/email
208+
```
202209

203-
- **pyproject.toml**: Modern Python packaging with hatch backend
204-
- **uv**: Used instead of pip for faster, reliable dependency resolution
205-
- **src/ layout**: Package in `src/mailparser/` for cleaner imports
206-
- **Dynamic versioning**: Version from `src/mailparser/version.py`
210+
## Key Reference Points
207211

208-
## External Dependencies
212+
- **Property implementation**: `core.py:540-730` (all `@property` decorators)
213+
- **Attachment extraction**: `core.py:355-475` (walks multipart, handles encoding)
214+
- **Received parsing logic**: `utils.py:302-455` + `const.py:24-73` (regex patterns)
215+
- **CLI implementation**: `__main__.py:30-347` (argparse + output formatting)
216+
- **Exception hierarchy**: `exceptions.py:20-60` (5 exception types)
209217

210-
- **Outlook support**: Requires system package `libemail-outlook-message-perl` + Perl module `Email::Outlook::Message`
211-
- **six**: Python 2/3 compatibility (legacy requirement)
212-
- **Minimal runtime deps**: Only `six>=1.17.0` required
218+
## Testing Strategy
213219

214-
When working with this codebase:
220+
When adding features:
215221

216-
- Use factory functions, not direct MailParser() instantiation
217-
- Test with various malformed emails from `tests/mails/`
218-
- Remember header property naming (underscores for hyphens)
219-
- Consider security implications of email parsing edge cases
222+
1. Add test email to `tests/mails/` if demonstrating new case
223+
2. Write tests in `tests/test_mail_parser.py` following existing patterns
224+
3. Test both normal and `_raw`/`_json` property variants
225+
4. Verify defect detection for security-relevant changes
226+
5. Run `make check` before committing

.github/workflows/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
runs-on: ubuntu-latest
1313
strategy:
1414
matrix:
15-
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12', '3.13']
15+
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12', '3.13', '3.14']
1616

1717
steps:
1818
- uses: actions/checkout@v4

0 commit comments

Comments
 (0)