-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add citation context extraction from PDF Related Work sections #14602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements Issue JabRef#14085: Extract citation contexts from academic PDFs and add them to cited entries' comment fields. Changes: - Add Citation contexts tab in entry editor with extraction workflow UI - Create CitationContextExtractor to parse citation markers from PDF text - Create PdfSectionExtractor to identify Related Work and References sections - Create PdfReferenceParser to parse bibliography entries from PDFs - Create CitationMatcher to match citation markers to reference entries - Create LibraryEntryResolver to match references to library entries - Create CitationCommentWriter to write contexts to comment-{username} field - Add CitationContext and ReferenceEntry data models - Add preference to enable/disable Citation contexts tab - Display clickable extraction results with match status in table UI - Add new cited entries to library when applying contexts
jablib/src/main/java/org/jabref/logic/citation/contextextractor/CitationCommentWriter.java
Outdated
Show resolved
Hide resolved
…_en.properties and Replace custom calculateSimilarity method with existing StringSimilarity class in CitationCommentWriter
jablib/src/main/java/org/jabref/logic/citation/contextextractor/CitationContextExtractor.java
Show resolved
Hide resolved
jablib/src/main/java/org/jabref/logic/citation/contextextractor/CitationMatcher.java
Show resolved
Hide resolved
jablib/src/main/java/org/jabref/logic/citation/contextextractor/CitationMatcher.java
Outdated
Show resolved
Hide resolved
jablib/src/main/java/org/jabref/logic/citation/contextextractor/CitationMatcher.java
Show resolved
Hide resolved
jablib/src/main/java/org/jabref/logic/citation/contextextractor/LibraryEntryResolver.java
Show resolved
Hide resolved
jablib/src/main/java/org/jabref/logic/citation/contextextractor/LibraryEntryResolver.java
Outdated
Show resolved
Hide resolved
jablib/src/main/java/org/jabref/model/citation/CitationContext.java
Outdated
Show resolved
Hide resolved
jabgui/src/main/java/org/jabref/gui/preferences/JabRefGuiPreferences.java
Show resolved
Hide resolved
…o use assertInstanceOf instead
of assertTrue with instanceof check
- Use AuthorListParser to extract first author family name in CitationMatcher and LibraryEntryResolver instead of custom regex - Extract inline regex patterns as constants (BRACKETS_PATTERN, WHITESPACE_PATTERN) in CitationMatcher - Replace Objects.requireNonNull with jspecify @nonnull annotations in CitationContext record
…oading and fallback
Just to clarify, for this PR should I focus only on the current changes, and treat the “Related work text” tab as a follow-up, or do you want it included here as well? |
…Objects.requireNonNull with @nonnull, simplify getUsername() to return String, update tests accordingly, remove null-related tests, and remove logging of tinylog
…bref into fix-for-issue-14085
Follow up. Needs more thought.... Knowing that https://github.com/koppor/magic-merge-commit exists, you could start in parallel... |
Got it, I’ll focus on the current changes for this PR and treat the “Related work text” tab as a follow-up and start exploring it in parallel. |
|
Your pull request conflicts with the target branch. Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line. |
|
/review |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
Hi! Just checking in to see if there's anything more I should do. |
|
Minor detail: if I add a pdf after seeing the "no pdf attached" error message and switch back to the citation contexts tab it still shows the same message. I have to deselect the entry and reselect it to update the citation contexts. But idk if this is your fault or the design of JabaFX, just a thing I stumbled upon. |
palukku
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First feedback when looking through.
Didn't have the time to think through most of the newly created classes and logic till now. I will try to look into those next year xD
| Map.entry(AiTemplate.CITATION_CONTEXT_EXTRACTION_SYSTEM_MESSAGE, new SimpleStringProperty()), | ||
| Map.entry(AiTemplate.CITATION_CONTEXT_EXTRACTION_USER_MESSAGE, new SimpleStringProperty()) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see them in the AI settings templates
| Found\ %0\ citation\ context(s),\ but\ none\ could\ be\ matched\ to\ library\ entries.\ Ensure\ the\ cited\ papers\ are\ in\ your\ library\ with\ matching\ author\ names\ and\ years.=Found %0 citation context(s), but none could be matched to library entries. Ensure the cited papers are in your library with matching author names and years. | ||
| Found\ %0\ citation\ context(s)...=Found %0 citation context(s)... | ||
| Found\ %0\ citation\ context(s)\:\ %1\ matched,\ %2\ unmatched.\ Select\ which\ to\ apply.=Found %0 citation context(s): %1 matched, %2 unmatched. Select which to apply. | ||
| New\ entry=New entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is duplicate, we already have "New\ Entry"
| private static final List<String> CITATION_RELEVANT_SECTIONS = List.of( | ||
| "related work", | ||
| "literature review", | ||
| "background", | ||
| "previous work", | ||
| "state of the art", | ||
| "related studies", | ||
| "theoretical background", | ||
| "prior work" | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be configurable so I can use it in other languages as well (could be a follow up pr, thats fine)
| Objects.requireNonNull(rawText, "Raw text cannot be null"); | ||
| Objects.requireNonNull(marker, "Marker cannot be null"); | ||
| Objects.requireNonNull(authors, "Authors optional cannot be null"); | ||
| Objects.requireNonNull(title, "Title optional cannot be null"); | ||
| Objects.requireNonNull(year, "Year optional cannot be null"); | ||
| Objects.requireNonNull(journal, "Journal optional cannot be null"); | ||
| Objects.requireNonNull(volume, "Volume optional cannot be null"); | ||
| Objects.requireNonNull(pages, "Pages optional cannot be null"); | ||
| Objects.requireNonNull(doi, "DOI optional cannot be null"); | ||
| Objects.requireNonNull(url, "URL optional cannot be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use jspecify for nullness checks: https://devdocs.jabref.org/decisions/0052-jspecify-nullable-annotations.html
| case NUMERIC_BRACKETED -> | ||
| references.addAll(splitByPattern(normalizedText, Pattern.compile("(?=\\[\\d{1,3}\\])"))); | ||
| case NUMERIC_DOTTED -> | ||
| references.addAll(splitByPattern(normalizedText, Pattern.compile("(?=(?:^|\\n)\\d{1,3}\\.\\s)"))); | ||
| case AUTHOR_YEAR -> | ||
| references.addAll(splitByBlankLinesOrIndentation(normalizedText)); | ||
| case AUTHOR_KEY -> | ||
| references.addAll(splitByPattern(normalizedText, Pattern.compile("(?=\\[[A-Z][a-zA-Z]+\\d{2,4}[a-z]?\\])"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you extract those too or why can't you use already existing patterns like AUTHOR_KEY_MARKER_PATTERN?
Just means that somewhere a listener is missing |
Closes #14085
This PR implements citation context extraction from PDF Related Work sections. When viewing an entry with an attached PDF, users can extract citations and their surrounding context, match them to library entries, and automatically add the context descriptions to cited entries' comment fields in the format
[sourceCitationKey]: context description.Changes:
Steps to test
comment-{yourusername}field containing the context in format:[SourcePdfKey]: description textMandatory checks
CHANGELOG.mdin a way that is understandable for the average user (if change is visible to the user)