Skip to content

Conversation

@Muskan244
Copy link
Contributor

Closes #14085

This PR implements citation context extraction from PDF Related Work sections. When viewing an entry with an attached PDF, users can extract citations and their surrounding context, match them to library entries, and automatically add the context descriptions to cited entries' comment fields in the format [sourceCitationKey]: context description.
Changes:

  • Add new "Show tab 'Citation contexts'" setting in Entry editor preferences
  • Create CitationContextIntegrationService to orchestrate PDF extraction and matching
  • Create PdfSectionExtractor to identify Related Work and References sections
  • Create CitationContextExtractor to parse citation markers and surrounding text
  • Add UI component to display extraction results with clickable apply functionality
  • Contexts are written to cited entries' comment-{username} field when applied

Steps to test

  1. Open JabRef and load a library with at least one entry that has a PDF attached
  2. Go to Options → Preferences → Entry editor and ensure "Show tab 'Citation contexts'" is enabled
  3. Open an entry that has an academic PDF attached (ideally one with a Related Work or Literature Review section)
  4. Click on the Citation contexts tab in the entry editor
  5. Click "Extract from this PDF" button
  6. Wait for extraction to complete - you should see a table with:
    • Citation markers found (e.g., "(Smith 2020)", "[1]")
    • Cited entry (matched library entry or "Not found")
    • Context text (the surrounding sentences)
    • Status (Existing, New entry, or Unmatched)
    Screenshot 2025-12-14 at 6 27 01 PM
  7. Select the contexts you want to apply using the checkboxes
  8. Click "Apply selected" button
  9. Check the cited entries in your library - they should now have a comment-{yourusername} field containing the context in format: [SourcePdfKey]: description text
Screenshot 2025-12-14 at 6 27 33 PM

Mandatory checks

  • I own the copyright of the code submitted and I license it under the MIT license
  • I manually tested my changes in running JabRef (always required)
  • I added JUnit tests for changes (if applicable)
  • I added screenshots in the PR description (if change is visible to the user)
  • I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
  • [/] I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

  Implements Issue JabRef#14085: Extract citation contexts from academic PDFs and add them to cited entries' comment fields.
  Changes:
  - Add Citation contexts tab in entry editor with extraction workflow UI
  - Create CitationContextExtractor to parse citation markers from PDF text
  - Create PdfSectionExtractor to identify Related Work and References sections
  - Create PdfReferenceParser to parse bibliography entries from PDFs
  - Create CitationMatcher to match citation markers to reference entries
  - Create LibraryEntryResolver to match references to library entries
  - Create CitationCommentWriter to write contexts to comment-{username} field
  - Add CitationContext and ReferenceEntry data models
  - Add preference to enable/disable Citation contexts tab
  - Display clickable extraction results with match status in table UI
  - Add new cited entries to library when applying contexts
@github-actions github-actions bot added good third issue status: changes-required Pull requests that are not yet complete labels Dec 14, 2025
…_en.properties and Replace custom calculateSimilarity method with existing StringSimilarity class in CitationCommentWriter
  - Use AuthorListParser to extract first author family name in CitationMatcher and LibraryEntryResolver instead of custom regex
  - Extract inline regex patterns as constants (BRACKETS_PATTERN, WHITESPACE_PATTERN) in CitationMatcher
  - Replace Objects.requireNonNull with jspecify @nonnull annotations in CitationContext record
@github-actions github-actions bot added status: changes-required Pull requests that are not yet complete and removed status: changes-required Pull requests that are not yet complete labels Dec 14, 2025
@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 15, 2025
@Muskan244
Copy link
Contributor Author

Future work.

Can you also add a functionality to create new entries based on the text?

Add a new tab "Related work text"

Just to clarify, for this PR should I focus only on the current changes, and treat the “Related work text” tab as a follow-up, or do you want it included here as well?

Muskan244 and others added 2 commits December 18, 2025 20:07
…Objects.requireNonNull with @nonnull, simplify getUsername() to return String, update tests accordingly, remove null-related tests, and remove logging of tinylog
@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 18, 2025
@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 18, 2025
@koppor
Copy link
Member

koppor commented Dec 18, 2025

Future work.

Can you also add a functionality to create new entries based on the text?

Add a new tab "Related work text"

Just to clarify, for this PR should I focus only on the current changes, and treat the “Related work text” tab as a follow-up, or do you want it included here as well?

Follow up. Needs more thought....

Knowing that https://github.com/koppor/magic-merge-commit exists, you could start in parallel...

@Muskan244
Copy link
Contributor Author

Future work.
Can you also add a functionality to create new entries based on the text?
Add a new tab "Related work text"

Just to clarify, for this PR should I focus only on the current changes, and treat the “Related work text” tab as a follow-up, or do you want it included here as well?

Follow up. Needs more thought....

Knowing that koppor/magic-merge-commit exists, you could start in parallel...

Got it, I’ll focus on the current changes for this PR and treat the “Related work text” tab as a follow-up and start exploring it in parallel.

@github-actions
Copy link
Contributor

Your pull request conflicts with the target branch.

Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line.

@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 22, 2025
@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 22, 2025
@koppor
Copy link
Member

koppor commented Dec 22, 2025

/review

@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

13109 - Partially compliant

Compliant requirements:

Non-compliant requirements:

  • Make org.jabref.logic.pseudonymization.Pseudonymization available via the CLI.
  • Provide a CLI user experience similar to the consistency check command.
  • Use org.jabref.cli.CheckConsistency as implementation reference for the CLI command structure and behavior.

Requires further human verification:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The author-key matching branch checks the wrong optional (authorYearMatch) before returning authorKeyMatch, which can prevent author-key matches from ever being returned and can also log incorrect diagnostics.

Optional<ReferenceEntry> authorYearMatch = matchAuthorYearMarker(normalizedMarker, references);
if (authorYearMatch.isPresent()) {
    LOGGER.debug("Found author-year match for '{}'", citationMarker);
    return authorYearMatch;
}

Optional<ReferenceEntry> authorKeyMatch = matchAuthorKeyMarker(normalizedMarker, references);
if (authorYearMatch.isPresent()) {
    LOGGER.debug("Found author-key match for '{}'", citationMarker);
    return authorKeyMatch;
}
UX/Logic

The Apply selected button enablement only depends on whether the table has items, not on whether any row is selected and matched; this can lead to a clickable action that immediately shows “No selection” and feels broken. Consider binding disable state to “any selected & matched rows” instead.

Button applyButton = new Button(Localization.lang("Apply selected"));
applyButton.setGraphic(IconTheme.JabRefIcons.ADD.getGraphicNode());
applyButton.setOnAction(e -> applySelectedContexts());
applyButton.setDisable(true);

resultsTable.getItems().addListener((javafx.collections.ListChangeListener<ExtractedContextRow>) change -> {
    applyButton.setDisable(resultsTable.getItems().isEmpty());
});
Performance

Regex patterns are recompiled inside extractMarker (Pattern.compile(...) calls), which is avoidable overhead when parsing many references. Consider reusing the existing static patterns (or making new ones static finals) to reduce allocations and improve throughput.

private String extractMarker(String text, int index) {
    Matcher numericBracketedMatcher = Pattern.compile("^\\s*\\[(\\d{1,3})\\]").matcher(text);
    if (numericBracketedMatcher.find()) {
        return "[" + numericBracketedMatcher.group(1) + "]";
    }

    Matcher numericDottedMatcher = Pattern.compile("^\\s*(\\d{1,3})\\.\\s").matcher(text);
    if (numericDottedMatcher.find()) {
        return "[" + numericDottedMatcher.group(1) + "]";
    }

@Muskan244
Copy link
Contributor Author

Hi! Just checking in to see if there's anything more I should do.

@JabRef JabRef deleted a comment from koppor Dec 30, 2025
@github-actions github-actions bot added status: changes-required Pull requests that are not yet complete and removed status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers labels Dec 30, 2025
@github-actions github-actions bot removed the status: changes-required Pull requests that are not yet complete label Dec 30, 2025
@palukku
Copy link
Member

palukku commented Dec 31, 2025

Minor detail: if I add a pdf after seeing the "no pdf attached" error message and switch back to the citation contexts tab it still shows the same message. I have to deselect the entry and reselect it to update the citation contexts.

But idk if this is your fault or the design of JabaFX, just a thing I stumbled upon.

@github-actions github-actions bot added the status: changes-required Pull requests that are not yet complete label Dec 31, 2025
Copy link
Member

@palukku palukku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First feedback when looking through.
Didn't have the time to think through most of the newly created classes and logic till now. I will try to look into those next year xD

Comment on lines +101 to 103
Map.entry(AiTemplate.CITATION_CONTEXT_EXTRACTION_SYSTEM_MESSAGE, new SimpleStringProperty()),
Map.entry(AiTemplate.CITATION_CONTEXT_EXTRACTION_USER_MESSAGE, new SimpleStringProperty())
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see them in the AI settings templates

Found\ %0\ citation\ context(s),\ but\ none\ could\ be\ matched\ to\ library\ entries.\ Ensure\ the\ cited\ papers\ are\ in\ your\ library\ with\ matching\ author\ names\ and\ years.=Found %0 citation context(s), but none could be matched to library entries. Ensure the cited papers are in your library with matching author names and years.
Found\ %0\ citation\ context(s)...=Found %0 citation context(s)...
Found\ %0\ citation\ context(s)\:\ %1\ matched,\ %2\ unmatched.\ Select\ which\ to\ apply.=Found %0 citation context(s): %1 matched, %2 unmatched. Select which to apply.
New\ entry=New entry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is duplicate, we already have "New\ Entry"

Comment on lines +11 to +20
private static final List<String> CITATION_RELEVANT_SECTIONS = List.of(
"related work",
"literature review",
"background",
"previous work",
"state of the art",
"related studies",
"theoretical background",
"prior work"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be configurable so I can use it in other languages as well (could be a follow up pr, thats fine)

Comment on lines +30 to +39
Objects.requireNonNull(rawText, "Raw text cannot be null");
Objects.requireNonNull(marker, "Marker cannot be null");
Objects.requireNonNull(authors, "Authors optional cannot be null");
Objects.requireNonNull(title, "Title optional cannot be null");
Objects.requireNonNull(year, "Year optional cannot be null");
Objects.requireNonNull(journal, "Journal optional cannot be null");
Objects.requireNonNull(volume, "Volume optional cannot be null");
Objects.requireNonNull(pages, "Pages optional cannot be null");
Objects.requireNonNull(doi, "DOI optional cannot be null");
Objects.requireNonNull(url, "URL optional cannot be null");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +103 to +110
case NUMERIC_BRACKETED ->
references.addAll(splitByPattern(normalizedText, Pattern.compile("(?=\\[\\d{1,3}\\])")));
case NUMERIC_DOTTED ->
references.addAll(splitByPattern(normalizedText, Pattern.compile("(?=(?:^|\\n)\\d{1,3}\\.\\s)")));
case AUTHOR_YEAR ->
references.addAll(splitByBlankLinesOrIndentation(normalizedText));
case AUTHOR_KEY ->
references.addAll(splitByPattern(normalizedText, Pattern.compile("(?=\\[[A-Z][a-zA-Z]+\\d{2,4}[a-z]?\\])")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you extract those too or why can't you use already existing patterns like AUTHOR_KEY_MARKER_PATTERN?

@calixtus
Copy link
Member

Minor detail: if I add a pdf after seeing the "no pdf attached" error message and switch back to the citation contexts tab it still shows the same message. I have to deselect the entry and reselect it to update the citation contexts.

But idk if this is your fault or the design of JabaFX, just a thing I stumbled upon.

Just means that somewhere a listener is missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract text about papers from "related work" sections

5 participants