Skip to content

Conversation

@D-Prasanth-Kumar
Copy link
Contributor

@D-Prasanth-Kumar D-Prasanth-Kumar commented Dec 29, 2025

User description

Closes #14659

This change improves arXiv identifier detection when pasting arXiv URLs that include URL fragments, such as links copied from arxiv.org HTML pages. JabRef now correctly recognizes these identifiers and fetches the corresponding entries. Unit tests were added to cover the fixed behavior.

Steps to test

  1. Open JabRef.

  2. Create a new empty library or open any existing library.

  3. Use BibTeX → New entry from plain text (or paste into the search / fetch dialog).

  4. Paste an arXiv URL copied from an arXiv HTML page, for example: https://arxiv.org/html/2503.08641v1#bib.bib5

  5. Confirm that JabRef correctly detects the identifier as arXiv and fetches the corresponding entry.

Mandatory checks


PR Type

Bug fix, Tests


Description

  • Added findInText() method to ArXivIdentifier for robust identifier extraction

  • Handles arXiv URLs with fragments by stripping them before parsing

  • Uses regex pattern to extract identifiers from various URL formats

  • Updated CompositeIdFetcher and Identifier to use new method

  • Added comprehensive unit tests covering edge cases


Diagram Walkthrough

flowchart LR
  A["User pastes arXiv URL<br/>with fragment"] -->|"e.g., arxiv.org/html/...#bib"| B["ArXivIdentifier.findInText()"]
  B -->|"Strip fragment"| C["Clean text"]
  C -->|"Try direct parse"| D{Success?}
  D -->|"Yes"| E["Return identifier"]
  D -->|"No"| F["Apply regex pattern"]
  F -->|"Match found"| G["Parse matched text"]
  G --> E
  F -->|"No match"| H["Return empty"]
Loading

File Walkthrough

Relevant files
Enhancement
CompositeIdFetcher.java
Update to use new findInText method                                           

jablib/src/main/java/org/jabref/logic/importer/CompositeIdFetcher.java

  • Changed ArXivIdentifier.parse() to ArXivIdentifier.findInText() in
    performSearchById() method
  • Enables detection of arXiv identifiers from URLs with fragments
+1/-1     
ArXivIdentifier.java
Add findInText method with fragment handling                         

jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java

  • Added new findInText() static method for robust identifier extraction
  • Strips URL fragments before processing using split("#")[0]
  • Attempts direct parsing first, then falls back to regex pattern
    matching
  • Regex pattern handles multiple URL formats: arxiv.org/abs,
    arxiv.org/html, arxiv.org/pdf, and plain identifiers
  • Pattern captures version numbers (e.g., v1) as optional component
+26/-0   
Identifier.java
Update Identifier factory method                                                 

jablib/src/main/java/org/jabref/model/entry/identifier/Identifier.java

  • Updated from() method to use ArXivIdentifier.findInText() instead of
    parse()
  • Ensures consistent identifier detection across the codebase
+1/-1     
Tests
ArXivIdentifierTest.java
Add comprehensive findInText tests                                             

jablib/src/test/java/org/jabref/model/entry/identifier/ArXivIdentifierTest.java

  • Added test for HTML URLs with fragments:
    findInTextFindsArxivFromHtmlUrlWithFragment()
  • Added test for standard arXiv URLs: findInTextFindsArxivInsideText()
  • Added test for non-arXiv text: findInTextReturnsEmptyForNonArxivText()
  • Tests verify correct identifier extraction and version number handling
+34/-0   
Documentation
CHANGELOG.md
Document arXiv identifier improvement                                       

CHANGELOG.md

+1/-0     

@github-actions github-actions bot added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Dec 29, 2025
@qodo-free-for-open-source-projects
Copy link
Contributor

qodo-free-for-open-source-projects bot commented Dec 29, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Potential ReDoS vulnerability

Description: The split("#")[0] operation may be vulnerable to ReDoS (Regular Expression Denial of
Service) if the input text contains an extremely long string before the '#' character,
potentially causing performance degradation or denial of service.
ArXivIdentifier.java [147-147]

Referred Code
String cleanedText = text.split("#")[0];
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Regex injection risk: The findInText() method applies regex pattern matching on unsanitized user input without
apparent length limits or validation, which could potentially be exploited for ReDoS
attacks.

Referred Code
Pattern pattern = Pattern.compile(
        "(?:http(s)?://arxiv.org/(?:abs|html|pdf)/|arxiv:|arXiv:)?"
                + "(\\d{4}\\.\\d{4,5})(v\\d+)?",
        Pattern.CASE_INSENSITIVE
);

Matcher matcher = pattern.matcher(cleanedText);
if (matcher.find()) {
    return parse(matcher.group());

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-free-for-open-source-projects
Copy link
Contributor

qodo-free-for-open-source-projects bot commented Dec 29, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Improve ArXiv identifier detection logic

The findInText method's regex should be updated to detect legacy arXiv
identifiers (e.g., math/1234567), not just modern ones, to ensure consistent and
comprehensive detection.

Examples:

jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java [142-166]
    public static Optional<ArXivIdentifier> findInText(String text) {
        if (StringUtil.isBlank(text)) {
            return Optional.empty();
        }

        String cleanedText = text.split("#")[0];

        Optional<ArXivIdentifier> directParse = parse(cleanedText);
        if (directParse.isPresent()) {
            return directParse;

 ... (clipped 15 lines)

Solution Walkthrough:

Before:

public static Optional<ArXivIdentifier> findInText(String text) {
    // ...
    String cleanedText = text.split("#")[0];

    Optional<ArXivIdentifier> directParse = parse(cleanedText);
    if (directParse.isPresent()) {
        return directParse;
    }

    // Regex only matches modern format: YYYY.NNNNN
    Pattern pattern = Pattern.compile(
            "(?:...)"
            + "(\\d{4}\\.\\d{4,5})(v\\d+)?",
            // ...
    );

    Matcher matcher = pattern.matcher(cleanedText);
    if (matcher.find()) {
        return parse(matcher.group());
    }
    return Optional.empty();
}

After:

public static Optional<ArXivIdentifier> findInText(String text) {
    // ...
    String cleanedText = text.split("#")[0];

    Optional<ArXivIdentifier> directParse = parse(cleanedText);
    if (directParse.isPresent()) {
        return directParse;
    }

    // Regex now also matches legacy formats like "math/1234567"
    Pattern pattern = Pattern.compile(
            "(?:...)"
            + "((?:[a-z-]+(?:\\.[A-Z]{2})?/\\d{7})|\\d{4}\\.\\d{4,5})(v\\d+)?",
            // ...
    );

    Matcher matcher = pattern.matcher(cleanedText);
    if (matcher.find()) {
        return parse(matcher.group());
    }
    return Optional.empty();
}
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a significant regression where the new findInText method fails to detect legacy arXiv identifiers embedded in text, a capability partially supported by the old parse method and expected from a "find" function.

Medium
Possible issue
Prevent potential array index out-of-bounds

Replace text.split("#")[0] with a safer indexOf/substring combination to prevent
a potential ArrayIndexOutOfBoundsException when handling URL fragments.

jablib/src/main/java/org/jabref/model/entry/identifier/ArXivIdentifier.java [147]

-String cleanedText = text.split("#")[0];
+int hashIndex = text.indexOf('#');
+String cleanedText = (hashIndex != -1) ? text.substring(0, hashIndex) : text;
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential ArrayIndexOutOfBoundsException for edge cases like an input of "#" and provides a more robust and efficient solution using indexOf and substring.

Medium
  • Update

@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 29, 2025
@InAnYan
Copy link
Member

InAnYan commented Dec 29, 2025

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

@D-Prasanth-Kumar
Copy link
Contributor Author

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

as the parse() method requires strict validation, this is not the case for findInText(), since it is only responsible for locating an arXiv identifier. is there is possibility to modifying the existing? is so it will good to adjust with that.

@InAnYan
Copy link
Member

InAnYan commented Dec 29, 2025

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

as the parse() method requires strict validation, this is not the case for findInText(), since it is only responsible for locating an arXiv identifier. is there is possibility to modifying the existing? is so it will good to adjust with that.

Yes, the parse really requires strict validation, but the pattern itself - no. Or maybe I'm mistaken? parse uses matches, but findInText uses find, but the pattern stays the same. Or not?

@D-Prasanth-Kumar
Copy link
Contributor Author

Can you reuse the existing regex pattern? I think it can be easily reduced, and parse method will require that whole text is the ID, but in findInText the class would search for this pattern, but not require that it fully matches

as the parse() method requires strict validation, this is not the case for findInText(), since it is only responsible for locating an arXiv identifier. is there is possibility to modifying the existing? is so it will good to adjust with that.

Yes, the parse really requires strict validation, but the pattern itself - no. Or maybe I'm mistaken? parse uses matches, but findInText uses find, but the pattern stays the same. Or not?

You’re right, the strictness mainly comes from how the pattern is used (matches() in parse() vs find() in findInText()), not from the pattern itself. The existing pattern in parse() is primarily designed for full-string validation, which is why I initially used a simpler pattern for searching inside text. I agree that the same pattern could be reused with find() in findInText(), and I’m happy to refactor it that way if you think it would be cleaner.

@D-Prasanth-Kumar
Copy link
Contributor Author

@InAnYan
image
is this behaviour valid? as the identifier type is grayed out.

@InAnYan
Copy link
Member

InAnYan commented Dec 30, 2025

Hmm, I would look into the code that automatically determines the identifier type. Which method it uses

@D-Prasanth-Kumar
Copy link
Contributor Author

Hmm, I would look into the code that automatically determines the identifier type. Which method it uses
IMG_20251230_150009.jpg
Here in Identifier, instead of parse there should be findInText will fix it.

@InAnYan
Copy link
Member

InAnYan commented Dec 30, 2025

Yeah, it also would be useful if you change this method

@D-Prasanth-Kumar
Copy link
Contributor Author

Yeah, it also would be useful if you change this method
Yes I will check

@D-Prasanth-Kumar
Copy link
Contributor Author

@InAnYan
image
even though everything looks good i don't know why the changelog.md check is getting failed.

@InAnYan
Copy link
Member

InAnYan commented Dec 30, 2025

Since your PR is not merged, your changes should be in Unreleased section

@github-actions github-actions bot added status: changes-required Pull requests that are not yet complete and removed status: changes-required Pull requests that are not yet complete labels Dec 30, 2025
@github-actions
Copy link
Contributor

Your pull request conflicts with the target branch.

Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line.

@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 31, 2025
@subhramit
Copy link
Member

image "Unreleased" is the master heading under which there is "Added". "Fixed", etc. You have put the entry under the master heading

@D-Prasanth-Kumar D-Prasanth-Kumar force-pushed the task/add-findInText-ArXivIdentifier branch from 8501ef9 to 596e358 Compare December 31, 2025 09:56
@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 31, 2025
@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good first issue An issue intended for project-newcomers. Varies in difficulty. Review effort 2/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Addd findInText for ArXivIdentifier

4 participants