Skip to content

Integrate with clangd to build reduced context as input to LLM prompt? #5803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
bartlettroscoe opened this issue May 22, 2025 · 3 comments
Open
1 of 2 tasks
Assignees
Labels
area:indexing Relates to embedding and indexing area:integration Integrations (context providers, model providers, etc.) kind:enhancement Indicates a new feature request, imrovement, or extension

Comments

@bartlettroscoe
Copy link

bartlettroscoe commented May 22, 2025

Validations

  • I believe this is a way to improve. I'll try to join the Continue Discord for questions
  • I'm not able to find an open issue that requests the same enhancement

Problem

Description

A feature that is badly needed for the application of AI models for C++ code is the ability to build a reduced context for large C++ code bases where only a minimal context is extracted and used to generate a prompt for a large language model (LLM). The problem is that the large C++ code bases can't fit in the prompt for even the best LLM models and providing extra C++ code that does not provide context just confuses the LLM and degrades performance. What is needed is a tool where you can point to selections of C++ "code of interest" (e.g., some functions, classes, or just few lines of C++ code), and then goes off and recursively looks up all of the classes, functions, variables, etc. that are used in that "code of interest" and produces a listing of C++ code as context with just those upstream dependencies. What this does is to basically take a very large C++ project and turn it into a smaller C++ project (at least for what the LLM needs to know).

The need for this is type of context gathering is described in the paper:

The basic outline of such a tool based on LLVM is described in the paper:

  • "CITYWALK: Enhancing LLM-Based C++ Unit Test Generation via Project-Dependency Awareness and Language-Specific Knowledge", submitted 1/27/2025, https://arxiv.org/abs/2501.16155

The clangd tool already does indexing of a large C++ project and has access to all of the source code (and the AST if needed). The clangd tool would seem like the logical place to add such a recursive context lookup for a large C++ project. This, together with integration with the (VSCode) Continue.dev extension, this would provide a seamless way to provide the context needed to pass to the prompt of the LLM so that it has a shot at fully understanding a selection of C++ code (so it can explain, refactor, or add unit tests for the code of interest).

Clande 4.0 suggests this is what you need to do:

I created a matching lllvm-project issue on the clangd side:

Solution

No response

@dosubot dosubot bot added area:indexing Relates to embedding and indexing area:integration Integrations (context providers, model providers, etc.) kind:enhancement Indicates a new feature request, imrovement, or extension labels May 22, 2025
@RomneyDa
Copy link
Collaborator

@bartlettroscoe this is a great idea, we'd likely want to use tree-sitter or similar to make adding support for more languages easy. Do you think a solution like this would be good?:

  1. @related context provider that retrieves related imports back to packages for the current selected range in file or file
  2. Tool for agent mode get_related_symbols that effectively does the same thing for a range or file

thoughts on what clangd could provide that this might fall short on?

@bartlettroscoe
Copy link
Author

bartlettroscoe commented May 27, 2025

@bartlettroscoe this is a great idea, we'd likely want to use tree-sitter or similar to make adding support for more languages easy. Do you think a solution like this would be good?:

You mean:

?

I would have to look into this in more detail, but unless they were using LLVM AST on the backend, I would not trust the ability to understand complex C++. And therefore, you would not get the correct context.

thoughts on what clangd could provide that this might fall short on?

Clangd is built from the LLVM AST and, therefore, should be able to correct for all C++ code, no matter how complex.

Other efforts that try to create tools for C++ that don't use LLVM AST on the backend can be problematic and less than useful. For example, we found that https://metrixplusplus.github.io/metrixplusplus/ can't actually give correct metrics for a lot of complex C++ code. (We had to develop a tool based on clang-tidy which uses LLVM AST.)

So, perhaps, I really need to be contacting the developers of tree-sitter to ask these questions?

@bartlettroscoe
Copy link
Author

Digging into this some more with the help of Gemini 2.5 Flash I asked:

When trying to extract the context needed to successfully understand a piece of C++ code (e.g. a C++ function, a C++ class or just a few lines of C++) to use as context for an AI model (e.g. Claude Code 3.7), is it likely that a tool based on tree-sitter could recursively look up the context of the upstream C++ code on a large C++ project to provide sufficient information to do a complex C++ refactoring? This could include reading documentation for APIs of code being called.

and it gave the response:

It's unlikely that a tool based solely on Tree-sitter could recursively look up all the necessary "upstream" C++ code context on a large project to provide sufficient information for a complex C++ refactoring by an AI model.

and:

For serious C++ context extraction for AI models, a tool like Clangd (or direct use of the Clang compiler API) is far more suitable.

with the conclusion:

While Tree-sitter is excellent for syntactic understanding and local, interactive editor features, it falls short for the deep semantic understanding required for complex C++ refactoring by an AI. For that, you need a full compiler frontend like Clang, typically exposed through tools like Clangd, which can provide the complete, project-wide, type-aware context. The most effective solutions would likely integrate both: Tree-sitter for immediate, structural interaction and Clangd for comprehensive semantic analysis.

See the full chat and responses at:

If this is correct, a custom RAG for C++ context built on tree-sitter is unlikely to be able to provide correct context for a large and complex C++ code (needed to understand some "code of interest").

If you don't give an AI model sufficient and correct context, we can never expect it to be able to correct refactor complex C++ code with 100% accuracy (no matter how smart or capable the models become).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:indexing Relates to embedding and indexing area:integration Integrations (context providers, model providers, etc.) kind:enhancement Indicates a new feature request, imrovement, or extension
Projects
Status: Todo
Development

No branches or pull requests

2 participants