-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Integrate with clangd to build reduced context as input to LLM prompt? #5803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@bartlettroscoe this is a great idea, we'd likely want to use tree-sitter or similar to make adding support for more languages easy. Do you think a solution like this would be good?:
thoughts on what clangd could provide that this might fall short on? |
You mean: ? I would have to look into this in more detail, but unless they were using LLVM AST on the backend, I would not trust the ability to understand complex C++. And therefore, you would not get the correct context.
Clangd is built from the LLVM AST and, therefore, should be able to correct for all C++ code, no matter how complex. Other efforts that try to create tools for C++ that don't use LLVM AST on the backend can be problematic and less than useful. For example, we found that https://metrixplusplus.github.io/metrixplusplus/ can't actually give correct metrics for a lot of complex C++ code. (We had to develop a tool based on clang-tidy which uses LLVM AST.) So, perhaps, I really need to be contacting the developers of tree-sitter to ask these questions? |
Digging into this some more with the help of Gemini 2.5 Flash I asked:
and it gave the response:
and:
with the conclusion:
See the full chat and responses at: If this is correct, a custom RAG for C++ context built on tree-sitter is unlikely to be able to provide correct context for a large and complex C++ code (needed to understand some "code of interest"). If you don't give an AI model sufficient and correct context, we can never expect it to be able to correct refactor complex C++ code with 100% accuracy (no matter how smart or capable the models become). |
Uh oh!
There was an error while loading. Please reload this page.
Validations
Problem
Description
A feature that is badly needed for the application of AI models for C++ code is the ability to build a reduced context for large C++ code bases where only a minimal context is extracted and used to generate a prompt for a large language model (LLM). The problem is that the large C++ code bases can't fit in the prompt for even the best LLM models and providing extra C++ code that does not provide context just confuses the LLM and degrades performance. What is needed is a tool where you can point to selections of C++ "code of interest" (e.g., some functions, classes, or just few lines of C++ code), and then goes off and recursively looks up all of the classes, functions, variables, etc. that are used in that "code of interest" and produces a listing of C++ code as context with just those upstream dependencies. What this does is to basically take a very large C++ project and turn it into a smaller C++ project (at least for what the LLM needs to know).
The need for this is type of context gathering is described in the paper:
The basic outline of such a tool based on LLVM is described in the paper:
The clangd tool already does indexing of a large C++ project and has access to all of the source code (and the AST if needed). The clangd tool would seem like the logical place to add such a recursive context lookup for a large C++ project. This, together with integration with the (VSCode) Continue.dev extension, this would provide a seamless way to provide the context needed to pass to the prompt of the LLM so that it has a shot at fully understanding a selection of C++ code (so it can explain, refactor, or add unit tests for the code of interest).
Clande 4.0 suggests this is what you need to do:
I created a matching lllvm-project issue on the clangd side:
Solution
No response
The text was updated successfully, but these errors were encountered: