Integrate with clangd to build reduced context as input to LLM prompt? #5803

bartlettroscoe · 2025-05-22T22:54:43Z

Validations

I believe this is a way to improve. I'll try to join the Continue Discord for questions
I'm not able to find an open issue that requests the same enhancement

Problem

Description

A feature that is badly needed for the application of AI models for C++ code is the ability to build a reduced context for large C++ code bases where only a minimal context is extracted and used to generate a prompt for a large language model (LLM). The problem is that the large C++ code bases can't fit in the prompt for even the best LLM models and providing extra C++ code that does not provide context just confuses the LLM and degrades performance. What is needed is a tool where you can point to selections of C++ "code of interest" (e.g., some functions, classes, or just few lines of C++ code), and then goes off and recursively looks up all of the classes, functions, variables, etc. that are used in that "code of interest" and produces a listing of C++ code as context with just those upstream dependencies. What this does is to basically take a very large C++ project and turn it into a smaller C++ project (at least for what the LLM needs to know).

The need for this is type of context gathering is described in the paper:

"YABLoCo: Yet Another Benchmark for Long Context Code Generation", submitted 5/7/2025, https://arxiv.org/abs/2505.04406v1

The basic outline of such a tool based on LLVM is described in the paper:

"CITYWALK: Enhancing LLM-Based C++ Unit Test Generation via Project-Dependency Awareness and Language-Specific Knowledge", submitted 1/27/2025, https://arxiv.org/abs/2501.16155

The clangd tool already does indexing of a large C++ project and has access to all of the source code (and the AST if needed). The clangd tool would seem like the logical place to add such a recursive context lookup for a large C++ project. This, together with integration with the (VSCode) Continue.dev extension, this would provide a seamless way to provide the context needed to pass to the prompt of the LLM so that it has a shot at fully understanding a selection of C++ code (so it can explain, refactor, or add unit tests for the code of interest).

Clande 4.0 suggests this is what you need to do:

https://claude.ai/share/58749dfb-3fdf-4379-8512-f49068bfc335

I created a matching lllvm-project issue on the clangd side:

clangd needs a feature to build reduced context as input to large language model prompts llvm/llvm-project#141154

Solution

No response

RomneyDa · 2025-05-26T23:23:38Z

@bartlettroscoe this is a great idea, we'd likely want to use tree-sitter or similar to make adding support for more languages easy. Do you think a solution like this would be good?:

@related context provider that retrieves related imports back to packages for the current selected range in file or file
Tool for agent mode get_related_symbols that effectively does the same thing for a range or file

thoughts on what clangd could provide that this might fall short on?

bartlettroscoe · 2025-05-27T11:08:05Z

@bartlettroscoe this is a great idea, we'd likely want to use tree-sitter or similar to make adding support for more languages easy. Do you think a solution like this would be good?:

You mean:

?

I would have to look into this in more detail, but unless they were using LLVM AST on the backend, I would not trust the ability to understand complex C++. And therefore, you would not get the correct context.

thoughts on what clangd could provide that this might fall short on?

Clangd is built from the LLVM AST and, therefore, should be able to correct for all C++ code, no matter how complex.

Other efforts that try to create tools for C++ that don't use LLVM AST on the backend can be problematic and less than useful. For example, we found that https://metrixplusplus.github.io/metrixplusplus/ can't actually give correct metrics for a lot of complex C++ code. (We had to develop a tool based on clang-tidy which uses LLVM AST.)

So, perhaps, I really need to be contacting the developers of tree-sitter to ask these questions?

bartlettroscoe · 2025-05-27T16:55:33Z

Digging into this some more with the help of Gemini 2.5 Flash I asked:

When trying to extract the context needed to successfully understand a piece of C++ code (e.g. a C++ function, a C++ class or just a few lines of C++) to use as context for an AI model (e.g. Claude Code 3.7), is it likely that a tool based on tree-sitter could recursively look up the context of the upstream C++ code on a large C++ project to provide sufficient information to do a complex C++ refactoring? This could include reading documentation for APIs of code being called.

and it gave the response:

It's unlikely that a tool based solely on Tree-sitter could recursively look up all the necessary "upstream" C++ code context on a large project to provide sufficient information for a complex C++ refactoring by an AI model.

and:

For serious C++ context extraction for AI models, a tool like Clangd (or direct use of the Clang compiler API) is far more suitable.

with the conclusion:

While Tree-sitter is excellent for syntactic understanding and local, interactive editor features, it falls short for the deep semantic understanding required for complex C++ refactoring by an AI. For that, you need a full compiler frontend like Clang, typically exposed through tools like Clangd, which can provide the complete, project-wide, type-aware context. The most effective solutions would likely integrate both: Tree-sitter for immediate, structural interaction and Clangd for comprehensive semantic analysis.

See the full chat and responses at:

https://g.co/gemini/share/cc2c7967ad4a

If this is correct, a custom RAG for C++ context built on tree-sitter is unlikely to be able to provide correct context for a large and complex C++ code (needed to understand some "code of interest").

If you don't give an AI model sufficient and correct context, we can never expect it to be able to correct refactor complex C++ code with 100% accuracy (no matter how smart or capable the models become).

github-project-automation bot added this to Issues and PRs May 22, 2025

github-project-automation bot moved this to Todo in Issues and PRs May 22, 2025

sestinj assigned RomneyDa May 22, 2025

bartlettroscoe mentioned this issue May 22, 2025

clangd needs a feature to build reduced context as input to large language model prompts llvm/llvm-project#141154

Open

dosubot bot added area:indexing Relates to embedding and indexing area:integration Integrations (context providers, model providers, etc.) kind:enhancement Indicates a new feature request, imrovement, or extension labels May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate with clangd to build reduced context as input to LLM prompt? #5803

Integrate with clangd to build reduced context as input to LLM prompt? #5803

bartlettroscoe commented May 22, 2025 •

edited

Loading

RomneyDa commented May 26, 2025

Uh oh!

bartlettroscoe commented May 27, 2025 •

edited

Loading

Uh oh!

bartlettroscoe commented May 27, 2025

Uh oh!

Integrate with clangd to build reduced context as input to LLM prompt? #5803

Integrate with clangd to build reduced context as input to LLM prompt? #5803

Comments

bartlettroscoe commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validations

Problem

Description

Solution

RomneyDa commented May 26, 2025

Uh oh!

bartlettroscoe commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartlettroscoe commented May 27, 2025

Uh oh!

bartlettroscoe commented May 22, 2025 •

edited

Loading

bartlettroscoe commented May 27, 2025 •

edited

Loading