Added findInText for ArXivIdentifier #14760

D-Prasanth-Kumar · 2025-12-29T13:20:24Z

User description

This change improves arXiv identifier detection when pasting arXiv URLs that include URL fragments, such as links copied from arxiv.org HTML pages. JabRef now correctly recognizes these identifiers and fetches the corresponding entries. Unit tests were added to cover the fixed behavior.

Steps to test

Open JabRef.
Create a new empty library or open any existing library.
Use BibTeX → New entry from plain text (or paste into the search / fetch dialog).
Paste an arXiv URL copied from an arXiv HTML page, for example: https://arxiv.org/html/2503.08641v1#bib.bib5
Confirm that JabRef correctly detects the identifier as arXiv and fetches the corresponding entry.

Mandatory checks

I own the copyright of the code submitted and I license it under the MIT license
I manually tested my changes in running JabRef (always required)
I added JUnit tests for changes (if applicable)
I added screenshots in the PR description (if change is visible to the user)
I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

PR Type

Bug fix, Tests

Description

Added findInText() method to ArXivIdentifier for robust identifier extraction
Handles arXiv URLs with fragments by stripping them before parsing
Uses regex pattern to extract identifiers from various URL formats
Updated CompositeIdFetcher and Identifier to use new method
Added comprehensive unit tests covering edge cases

Diagram Walkthrough

flowchart LR
  A["User pastes arXiv URL<br/>with fragment"] -->|"e.g., arxiv.org/html/...#bib"| B["ArXivIdentifier.findInText()"]
  B -->|"Strip fragment"| C["Clean text"]
  C -->|"Try direct parse"| D{Success?}
  D -->|"Yes"| E["Return identifier"]
  D -->|"No"| F["Apply regex pattern"]
  F -->|"Match found"| G["Parse matched text"]
  G --> E
  F -->|"No match"| H["Return empty"]

File Walkthrough

Relevant files

Enhancement

CompositeIdFetcher.java `Update to use new findInText method` jablib/src/main/java/org/jabref/logic/importer/CompositeIdFetcher.java Changed `ArXivIdentifier.parse()` to `ArXivIdentifier.findInText()` in `performSearchById()` method Enables detection of arXiv identifiers from URLs with fragments	+1/-1
ArXivIdentifier.java `Add findInText method with fragment handling` jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java Added new `findInText()` static method for robust identifier extraction Strips URL fragments before processing using `split("#")[0]` Attempts direct parsing first, then falls back to regex pattern matching Regex pattern handles multiple URL formats: `arxiv.org/abs`, `arxiv.org/html`, `arxiv.org/pdf`, and plain identifiers Pattern captures version numbers (e.g., `v1`) as optional component	+26/-0
Identifier.java `Update Identifier factory method` jablib/src/main/java/org/jabref/model/entry/identifier/Identifier.java Updated `from()` method to use `ArXivIdentifier.findInText()` instead of `parse()` Ensures consistent identifier detection across the codebase	+1/-1

Tests

ArXivIdentifierTest.java `Add comprehensive findInText tests` jablib/src/test/java/org/jabref/model/entry/identifier/ArXivIdentifierTest.java Added test for HTML URLs with fragments: `findInTextFindsArxivFromHtmlUrlWithFragment()` Added test for standard arXiv URLs: `findInTextFindsArxivInsideText()` Added test for non-arXiv text: `findInTextReturnsEmptyForNonArxivText()` Tests verify correct identifier extraction and version number handling	+34/-0

Documentation

CHANGELOG.md `Document arXiv identifier improvement` CHANGELOG.md Added entry documenting improved arXiv identifier detection with URL fragments References issue Addd `findInText` for ArXivIdentifier #14659	+1/-0

qodo-free-for-open-source-projects · 2025-12-29T13:20:56Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Potential ReDoS vulnerability Description: The `split("#")[0]` operation may be vulnerable to ReDoS (Regular Expression Denial of Service) if the input text contains an extremely long string before the '#' character, potentially causing performance degradation or denial of service. ArXivIdentifier.java [147-147] Referred Code String cleanedText = text.split("#")[0];
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Regex injection risk: The `findInText()` method applies regex pattern matching on unsanitized user input without apparent length limits or validation, which could potentially be exploited for ReDoS attacks. Referred Code Pattern pattern = Pattern.compile( "(?:http(s)?://arxiv.org/(?:abs\|html\|pdf)/\|arxiv:\|arXiv:)?" + "(\\d{4}\\.\\d{4,5})(v\\d+)?", Pattern.CASE_INSENSITIVE ); Matcher matcher = pattern.matcher(cleanedText); if (matcher.find()) { return parse(matcher.group()); Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-free-for-open-source-projects · 2025-12-29T13:22:29Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Improve ArXiv identifier detection logic The `findInText` method's regex should be updated to detect legacy arXiv identifiers (e.g., `math/1234567`), not just modern ones, to ensure consistent and comprehensive detection. Examples: jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java [142-166] public static Optional<ArXivIdentifier> findInText(String text) { if (StringUtil.isBlank(text)) { return Optional.empty(); } String cleanedText = text.split("#")[0]; Optional<ArXivIdentifier> directParse = parse(cleanedText); if (directParse.isPresent()) { return directParse; ... (clipped 15 lines) Solution Walkthrough: Before: public static Optional<ArXivIdentifier> findInText(String text) { // ... String cleanedText = text.split("#")[0]; Optional<ArXivIdentifier> directParse = parse(cleanedText); if (directParse.isPresent()) { return directParse; } // Regex only matches modern format: YYYY.NNNNN Pattern pattern = Pattern.compile( "(?:...)" + "(\\d{4}\\.\\d{4,5})(v\\d+)?", // ... ); Matcher matcher = pattern.matcher(cleanedText); if (matcher.find()) { return parse(matcher.group()); } return Optional.empty(); } After: public static Optional<ArXivIdentifier> findInText(String text) { // ... String cleanedText = text.split("#")[0]; Optional<ArXivIdentifier> directParse = parse(cleanedText); if (directParse.isPresent()) { return directParse; } // Regex now also matches legacy formats like "math/1234567" Pattern pattern = Pattern.compile( "(?:...)" + "((?:[a-z-]+(?:\\.[A-Z]{2})?/\\d{7})\|\\d{4}\\.\\d{4,5})(v\\d+)?", // ... ); Matcher matcher = pattern.matcher(cleanedText); if (matcher.find()) { return parse(matcher.group()); } return Optional.empty(); } Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a significant regression where the new `findInText` method fails to detect legacy arXiv identifiers embedded in text, a capability partially supported by the old `parse` method and expected from a "find" function.	Medium
Possible issue	Prevent potential array index out-of-bounds Replace `text.split("#")[0]` with a safer `indexOf`/`substring` combination to prevent a potential `ArrayIndexOutOfBoundsException` when handling URL fragments. jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java [147] -String cleanedText = text.split("#")[0]; +int hashIndex = text.indexOf('#'); +String cleanedText = (hashIndex != -1) ? text.substring(0, hashIndex) : text; Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a potential `ArrayIndexOutOfBoundsException` for edge cases like an input of `"#"` and provides a more robust and efficient solution using `indexOf` and `substring`.	Medium
Update

CHANGELOG.md

jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java

InAnYan · 2025-12-29T14:18:20Z

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

D-Prasanth-Kumar · 2025-12-29T15:20:15Z

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

as the parse() method requires strict validation, this is not the case for findInText(), since it is only responsible for locating an arXiv identifier. is there is possibility to modifying the existing? is so it will good to adjust with that.

InAnYan · 2025-12-29T15:27:09Z

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

as the parse() method requires strict validation, this is not the case for findInText(), since it is only responsible for locating an arXiv identifier. is there is possibility to modifying the existing? is so it will good to adjust with that.

Yes, the parse really requires strict validation, but the pattern itself - no. Or maybe I'm mistaken? parse uses matches, but findInText uses find, but the pattern stays the same. Or not?

D-Prasanth-Kumar · 2025-12-29T16:00:45Z

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

as the parse() method requires strict validation, this is not the case for findInText(), since it is only responsible for locating an arXiv identifier. is there is possibility to modifying the existing? is so it will good to adjust with that.

Yes, the parse really requires strict validation, but the pattern itself - no. Or maybe I'm mistaken? parse uses matches, but findInText uses find, but the pattern stays the same. Or not?

You’re right, the strictness mainly comes from how the pattern is used (matches() in parse() vs find() in findInText()), not from the pattern itself. The existing pattern in parse() is primarily designed for full-string validation, which is why I initially used a simpler pattern for searching inside text. I agree that the same pattern could be reused with find() in findInText(), and I’m happy to refactor it that way if you think it would be cleaner.

D-Prasanth-Kumar · 2025-12-30T05:31:28Z

@InAnYan

is this behaviour valid? as the identifier type is grayed out.

InAnYan · 2025-12-30T09:15:09Z

Hmm, I would look into the code that automatically determines the identifier type. Which method it uses

D-Prasanth-Kumar · 2025-12-30T09:31:46Z

Hmm, I would look into the code that automatically determines the identifier type. Which method it uses

Here in Identifier, instead of parse there should be findInText will fix it.

InAnYan · 2025-12-30T09:32:59Z

Yeah, it also would be useful if you change this method

D-Prasanth-Kumar · 2025-12-30T09:35:19Z

Yeah, it also would be useful if you change this method
Yes I will check

D-Prasanth-Kumar · 2025-12-30T12:57:20Z

@InAnYan

even though everything looks good i don't know why the changelog.md check is getting failed.

InAnYan · 2025-12-30T13:20:05Z

Since your PR is not merged, your changes should be in Unreleased section

github-actions · 2025-12-31T03:36:57Z

Your pull request conflicts with the target branch.

Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line.

subhramit · 2025-12-31T09:46:23Z

"Unreleased" is the master heading under which there is "Added". "Fixed", etc. You have put the entry under the master heading

D-Prasanth-Kumar added 2 commits December 29, 2025 17:03

fix/added findInText for ArXivIdentifier

032db7b

updated CHANGELOG.md

4feed60

github-actions bot added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Dec 29, 2025

qodo-free-for-open-source-projects bot added the Review effort 2/5 label Dec 29, 2025

InAnYan requested changes Dec 29, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java Outdated Show resolved Hide resolved

github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 29, 2025

D-Prasanth-Kumar added 2 commits December 30, 2025 17:42

updated ArXivIdentifier

4aaa536

updated CHANGELOG.md

81326af

updated CHANGELOG.md

a5f7589

github-actions bot added status: changes-required Pull requests that are not yet complete and removed status: changes-required Pull requests that are not yet complete labels Dec 30, 2025

github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 31, 2025

updated CHANGELOG.md

596e358

D-Prasanth-Kumar force-pushed the task/add-findInText-ArXivIdentifier branch from 8501ef9 to 596e358 Compare December 31, 2025 09:56

github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 31, 2025

D-Prasanth-Kumar added 3 commits December 31, 2025 15:31

updated CHANGELOG.md

003e22d

some conflict fixes

d197bc3

some indentation fix

b9f952b

github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 31, 2025

Uh oh!

Added findInText for ArXivIdentifier #14760

Are you sure you want to change the base?

Added findInText for ArXivIdentifier #14760

Conversation

D-Prasanth-Kumar commented Dec 29, 2025 • edited by qodo-free-for-open-source-projects bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Steps to test

Mandatory checks

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-free-for-open-source-projects bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-free-for-open-source-projects bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Uh oh!

Uh oh!

InAnYan commented Dec 29, 2025

Uh oh!

D-Prasanth-Kumar commented Dec 29, 2025

Uh oh!

InAnYan commented Dec 29, 2025

Uh oh!

D-Prasanth-Kumar commented Dec 29, 2025

Uh oh!

D-Prasanth-Kumar commented Dec 30, 2025

Uh oh!

InAnYan commented Dec 30, 2025

Uh oh!

D-Prasanth-Kumar commented Dec 30, 2025

Uh oh!

InAnYan commented Dec 30, 2025

Uh oh!

D-Prasanth-Kumar commented Dec 30, 2025

Uh oh!

D-Prasanth-Kumar commented Dec 30, 2025

Uh oh!

InAnYan commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

subhramit commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

D-Prasanth-Kumar commented Dec 29, 2025 •

edited by qodo-free-for-open-source-projects bot

Loading

qodo-free-for-open-source-projects bot commented Dec 29, 2025 •

edited

Loading

qodo-free-for-open-source-projects bot commented Dec 29, 2025 •

edited

Loading