Question: Does TruLens Support Custom LLMs for Evaluation? #2305

LingeshSD · 2025-12-10T04:21:58Z

LingeshSD
Dec 10, 2025

Hi all,
I am currently working on a POC and exploring TruLens. However, on the website I only see information about using OpenAI integrations for evaluations. Our company does not allow the use of OpenAI, but we do have an internally hosted LLM in our on-premise cloud.
Is it possible to use TruLens with a custom LLM for evaluation instead of OpenAI?

Answered by dosubot[bot]

Dec 10, 2025

Yes, TruLens supports evaluation and instrumentation for LangChain-based models that implement the BaseChatModel interface, including custom/internal LLMs. You can use both the LangChain integration (TruChain) and the Langchain provider to run feedback functions without relying on OpenAI.

The Langchain provider in TruLens accepts any LangChain LLM or ChatModel (i.e., anything implementing BaseLLM or BaseChatModel), so your custom class is supported. The provider wraps your model and uses its predict or predict_messages methods to generate completions for feedback evaluation—there’s no requirement for the underlying model to be OpenAI-based. This is confirmed in the provider implementatio…

View full answer

@LingeshSD · 2025-12-10T04:24:52Z

dosubot[bot]
Bot Dec 10, 2025

Hi @LingeshSD! I'm Dosu and I’m helping the trulens team.

Yes, TruLens supports using custom or internally hosted LLMs for evaluation. The framework is designed with a provider abstraction that allows you to plug in different LLMs by subclassing the LLMProvider interface and implementing the required methods, such as _create_chat_completion. This makes it possible to integrate non-OpenAI models, including those hosted on-premise or via private APIs.

There are out-of-the-box provider packages for various LLMs, including LiteLLM (which can connect to local models like Ollama), Huggingface, Langchain, Bedrock, and more. For example, you can use the LiteLLM provider to connect to a local LLM by specifying the model engine and API base URL:

from trulens.providers.litellm import LiteLLM

provider = LiteLLM(
    model_engine="ollama/llama3.1:8b", 
    api_base="http://localhost:11434"
)

You can then use this provider with TruLens feedback functions for evaluation. The modular architecture allows you to extend or implement your own provider if your LLM exposes a compatible API. For more details and example code, see the TruLens documentation and provider examples.

If your LLM has a unique API, you may need to subclass LLMProvider and implement the required methods to handle prompt/response formatting and endpoint communication. The LiteLLM provider is a good reference for how to do this for custom endpoints and models "LiteLLM provider implementation".

In summary: you do not need to use OpenAI—TruLens is flexible and supports a wide range of LLMs, including custom and on-premise deployments.

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

0 replies

LingeshSD · 2025-12-10T04:50:30Z

LingeshSD
Dec 10, 2025
Author

@dosu We have an internal LLM integrated with LangChain using a custom class that implements the LangChain BaseChatModel interface. Does TruLens support evaluation and instrumentation for LangChain-based models like this? Specifically, can we use TruLens’ LangChain integration (TruChain) or the LangChain provider to run feedback functions without relying on OpenAI?

2 replies

dosubot[bot] Bot Dec 10, 2025

Yes, TruLens supports evaluation and instrumentation for LangChain-based models that implement the BaseChatModel interface, including custom/internal LLMs. You can use both the LangChain integration (TruChain) and the Langchain provider to run feedback functions without relying on OpenAI.

The Langchain provider in TruLens accepts any LangChain LLM or ChatModel (i.e., anything implementing BaseLLM or BaseChatModel), so your custom class is supported. The provider wraps your model and uses its predict or predict_messages methods to generate completions for feedback evaluation—there’s no requirement for the underlying model to be OpenAI-based. This is confirmed in the provider implementation and supporting tests, which show feedback functions running on arbitrary LangChain models, not just OpenAI ones (source, source).

TruChain can instrument any LangChain Runnable app, including those using custom BaseChatModel implementations, for logging and evaluation (source). There are no hardcoded dependencies on OpenAI in these integrations.

To use your custom model, just pass it to the Langchain provider and/or wrap your chain with TruChain as you would with any other LangChain model. Here’s a minimal example:

from trulens.providers.langchain import Langchain
from trulens.apps.langchain import TruChain
from trulens.core import Feedback

# Suppose `my_custom_llm` is your BaseChatModel implementation
provider = Langchain(chain=my_custom_llm)
f_relevance = Feedback(provider.relevance_with_cot_reasons).on_input_output()

# Suppose `my_chain` is your LangChain Runnable using your custom LLM
tru_recorder = TruChain(my_chain, feedbacks=[f_relevance])

with tru_recorder as recording:
    my_chain.invoke("Your prompt here")

This setup will work with your internal LLM and does not require OpenAI at any stage.

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

Answer selected by sfc-gh-jreini

sfc-gh-jreini Dec 12, 2025
Maintainer

Hey @LingeshSD - @dosu's response is right! Just noting that the TruChain usage in this example is not needed to use the Langchain feedback provider. These are separate pieces and you are free to use the Langchain feedback provider with other app recorders like TruApp, TruGraph or TruLlama.

Hope that helps!

LingeshSD · 2025-12-14T14:14:29Z

LingeshSD
Dec 14, 2025
Author

hi @dosu I’m trying to integrate TruLens with a custom LangChain flow and add evaluation metrics like responsiveness, hallucinations, and actionability using a custom LLM (not OpenAI).
Could you share a clear, complete example showing:

Wrapping a LangChain chain with TruChain
Adding multiple custom feedback functions
Logging results and viewing them in the dashboard

Current docs mostly show OpenAI or RAG triad examples, but not a full custom flow. An official snippet would help a lot.
Thanks!

1 reply

dosubot[bot] Bot Dec 14, 2025

Here's a complete example for integrating TruLens with a custom LangChain flow using your own (non-OpenAI) LLM, adding multiple feedback functions, and logging/viewing results in the dashboard. This pattern works for any LangChain chain using a custom BaseChatModel.

# 1. Import required modules
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from trulens.apps.langchain import TruChain
from trulens.core import Feedback, TruSession
from trulens.dashboard import run_dashboard

# 2. Define your custom LLM (must implement BaseChatModel)
from langchain_core.language_models.chat_models import BaseChatModel

class MyInternalLLM(BaseChatModel):
    # Implement required methods for your internal LLM
    def _generate(self, messages, stop=None):
        # Your LLM's logic here
        return "Custom LLM response"

my_llm = MyInternalLLM()

# 3. Build your LangChain chain
prompt = PromptTemplate.from_template("Answer the following: {question}")
chain = LLMChain(llm=my_llm, prompt=prompt)

# 4. Define custom feedback functions (example: responsiveness, hallucination, actionability)
from trulens.core import Provider

class MyFeedbackProvider(Provider):
    def responsiveness(self, prompt: str, response: str) -> float:
        # Custom logic: e.g., score 1.0 if response is non-empty
        return float(bool(response.strip()))
    def hallucination(self, prompt: str, response: str) -> float:
        # Dummy: always returns 1.0 (no hallucination)
        return 1.0
    def actionability(self, prompt: str, response: str) -> float:
        # Dummy: always returns 1.0 (fully actionable)
        return 1.0

provider = MyFeedbackProvider()

f_responsiveness = Feedback(provider.responsiveness, name="Responsiveness").on_input_output()
f_hallucination = Feedback(provider.hallucination, name="Hallucination").on_input_output()
f_actionability = Feedback(provider.actionability, name="Actionability").on_input_output()

# 5. Wrap your chain with TruChain and attach feedbacks
tru_recorder = TruChain(
    chain,
    app_name="My Custom LLM App",
    feedbacks=[f_responsiveness, f_hallucination, f_actionability]
)

# 6. Run your chain and log results
with tru_recorder as recording:
    chain.invoke({"question": "How do I integrate TruLens with my LLM?"})

# 7. Launch the dashboard to view results
session = TruSession()  # Uses default.sqlite for logging
run_dashboard(session)  # Opens Streamlit dashboard

You can customize the feedback logic in MyFeedbackProvider to fit your actual evaluation criteria. The dashboard will show all logged traces and feedback scores for each run. You can also connect to a different database if needed for persistent storage and team sharing.

For more on custom feedbacks and dashboard usage, see the instrumentation guide, custom feedbacks, and dashboard docs.

Let me know if you want a more advanced example or have questions about adapting this to your specific LLM or feedback logic!

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

LingeshSD · 2025-12-15T05:18:17Z

LingeshSD
Dec 15, 2025
Author

@dosu Thanks for the example!
For Deepeval I can run:

deepeval test run tests/"project"/"filename".py
Is there an equivalent CLI command for TruLens (e.g. trulens ...) to run the script and open the dashboard, or is the recommended way simply:

python tests/"project"/test_trulens_evaluation.py
using the with tru_recorder block and run_dashboard(session) as in your example?

1 reply

dosubot[bot] Bot Dec 15, 2025

TruLens doesn’t have a CLI command that runs your evaluation script and opens the dashboard in one step, like Deepeval’s test run command. The recommended workflow is to run your evaluation script with Python (using the with tru_recorder block to log results), then launch the dashboard to view results.

You can start the dashboard either programmatically with run_dashboard(session) in Python, or from the command line using trulens or trulens-eval—but these CLI commands only launch the dashboard, they don’t execute your test files or evaluation scripts automatically. So the typical pattern is:

Run your evaluation script:

python tests/<project>/test_trulens_evaluation.py

Launch the dashboard to explore results:
```
from trulens.dashboard import run_dashboard
run_dashboard(session)
```
or from the CLI (in the same folder as your database):
```
trulens
```
Docs and examples, notebook example.

You can specify the dashboard port, stop the dashboard, and connect to different databases if needed. But for now, running your script and then opening the dashboard is the standard approach.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

LingeshSD · 2025-12-17T04:33:19Z

LingeshSD
Dec 17, 2025
Author

hi @dosu
I’m trying to deeply understand how TruLens internally handles evaluation metrics across different application types (LLM tasks, RAG, agents, multi-turn, and end‑to‑end scenarios). I have a few questions about the design decisions and underlying mechanics:

Feedback Function Architecture
TruLens uses Feedback functions as the core abstraction, whereas other evaluators (e.g., DeepEval) expose many distinct metric classes (faithfulness, relevance, hallucination, etc.).

What was the design reasoning behind consolidating all evaluation logic under a single Feedback interface instead of separate metric types?
Does this choice give advantages in terms of flexibility, composability, or pipeline instrumentation?

Internal Metric Logic

How are the built-in metric prompts written internally?
Are metric prompts always generated by TruLens, or does the user need to provide prompts for certain cases?
For LLM-based metrics, how does TruLens handle scoring extraction, scaling, and standardization of outputs from different models?

Editing Built‑in Metrics

Is there an official way to extend or override built-in metrics (e.g., Groundedness, Faithfulness) with custom prompts or logic?
If so, what is the recommended pattern for modifying a built-in evaluation while retaining its existing structure?

Evaluation Coverage
TruLens supports a wide range of evaluation dimensions—single-turn, multi-turn, component-level (retriever, re-ranker, generator), and full end-to-end flows.

How does the internal tracing system decide which components a feedback function applies to?
How scalable is this approach for agentic systems where tool calls vary dynamically?

Non‑LLM vs LLM Metrics

How does TruLens differentiate execution paths for LLM-based metrics vs. non‑LLM metrics (similarity functions, lexicon-based checks, latency/cost metrics)?
Is there guidance on when to prefer one type over the other?

Complete Official Metric List
Is there one official page or document that lists every metric currently supported in TruLens (LLM, non‑LLM, RAG, agents, single/multi turn), so users can refer to a complete source of truth?

Overall, I’m trying to understand the philosophy behind the feedback‑centric design and how it compares to frameworks that expose many discrete metric classes. Any insight into the architecture or examples of best practices would be very helpful.

1 reply

dosubot[bot] Bot Dec 17, 2025

TruLens consolidates all evaluation logic under a single, flexible Feedback abstraction instead of separate metric classes. This design lets you define, compose, and instrument any evaluation—LLM-based or not—using the same interface. Each Feedback object wraps the evaluation logic, selectors for inputs/outputs, aggregation, and metadata, making it easy to plug into any pipeline and trace results across complex, multi-component, or agentic flows. This approach improves flexibility (custom logic, prompt overrides, aggregation), composability (mix and match providers, chain feedbacks), and instrumentation (unified logging, tracing, and dashboarding) compared to frameworks with many discrete metric classes source.

Built-in metric prompts are generated internally from class-level templates and criteria in the feedback metric classes (like Groundedness, Relevance, etc.). You don’t need to supply prompts for standard metrics—TruLens handles this—but you can override criteria, examples, or prompt templates to customize behavior. For LLM-based metrics, scoring is extracted and normalized to a 0–1 scale regardless of the original output space, using regex and parsing logic to enforce expected numeric ranges and standardize outputs source.

To extend or override built-in metrics, subclass the relevant feedback metric class and modify the prompt templates, criteria, or logic as needed. This pattern lets you retain the structure and output normalization of the original metric while customizing the evaluation to your needs source.

The tracing system uses OpenTelemetry spans and context managers to record computations at every component (retriever, generator, tool call, etc.). Each feedback function is mapped to the relevant component(s) via selectors, and the system dynamically tracks live objects and their spans, making it scalable for agentic and dynamic flows source.

LLM-based metrics use prompt-driven scoring pipelines (with normalization and parsing), while non-LLM metrics (similarity, sentiment, toxicity, PII, latency, etc.) use provider-specific endpoints and return native scores. Choose LLM-based metrics for nuanced, rubric-driven evaluation; use non-LLM metrics for fast, deterministic checks or when you have a specialized classifier source.

There isn’t a single, up-to-date official page listing every supported metric; definitions live in the codebase under feedback metric classes and provider implementations. For a comprehensive view, check the feedback classes in the code or the docs for each provider.

The feedback-centric design gives you a unified, extensible, and composable architecture for evaluation—contrasting with frameworks that use many discrete metric classes—making it easier to scale, customize, and trace evaluation across all types of LLM and agentic applications.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

LingeshSD · 2025-12-17T08:52:10Z

LingeshSD
Dec 17, 2025
Author

hi @dosu What if a required metric is not available in the feedback wrapper?
Is there any provision to create a new custom metric for the use case using Trulens?
Is there any way to define our own metric?

How are the built‑in metrics calculated and combined into a single score?
What is the underlying mechanism—are there specific mathematical calculations or processes involved?

1 reply

dosubot[bot] Bot Dec 17, 2025

If a required metric isn't available in the built-in feedback wrappers, you can define your own custom metric in TruLens. You do this by creating a new Provider class (or extending an existing one) and adding a method that implements your metric logic—this method should return a float score between 0 and 1. Then, wrap your method with the Feedback class and specify how to extract the relevant inputs (using selectors like on_input_output or custom mappings). Your custom feedback function can be used just like any built-in metric in evaluation and logging workflows. See the official custom feedback guide for examples.

Built-in metrics are implemented as Feedback subclasses with internal prompt templates, criteria, and output space definitions (e.g., Likert 0–3, binary 0–1). For LLM-based metrics, TruLens generates prompts, parses the model's output, and normalizes scores to a standard scale. For non-LLM metrics, the provider returns a native score directly.

When a feedback function produces multiple results (e.g., chunked evaluation or multi-output), TruLens aggregates them using an aggregator function. By default, this is numpy.mean (simple average), but you can override it with any custom function using the .aggregate method on your Feedback instance. The aggregation happens inside the Feedback run loop, so you have full control over how scores are combined—whether that's mean, weighted mean, or something more complex. For advanced use cases, you can use or extend the GroundTruthAggregator class, which supports statistical and correlation-based aggregation (AUC, F1, precision, recall, etc.) source.

Here's a minimal example of a custom metric and aggregation:

from trulens.core import Feedback, Provider, Select
import numpy as np

class MyProvider(Provider):
    def my_metric(self, prompt: str, response: str) -> float:
        # Custom logic here
        return float(len(response) > 10)

provider = MyProvider()
f_custom = Feedback(provider.my_metric, name="MyMetric").on_input_output().aggregate(np.mean)

You can now use f_custom in your evaluation pipeline, and the results will be aggregated as you specify.

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

Question: Does TruLens Support Custom LLMs for Evaluation? #2305

Uh oh!

LingeshSD Dec 10, 2025

Replies: 6 comments · 6 replies

Uh oh!

Uh oh!

dosubot[bot] Bot Dec 10, 2025

Uh oh!

LingeshSD Dec 10, 2025 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Dec 10, 2025

Uh oh!

sfc-gh-jreini Dec 12, 2025 Maintainer

Uh oh!

LingeshSD Dec 14, 2025 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Dec 14, 2025

Uh oh!

Uh oh!

LingeshSD Dec 15, 2025 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Dec 15, 2025

Uh oh!

LingeshSD Dec 17, 2025 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Dec 17, 2025

Uh oh!

LingeshSD Dec 17, 2025 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Dec 17, 2025

LingeshSD
Dec 10, 2025

Replies: 6 comments 6 replies

dosubot[bot]
Bot Dec 10, 2025

LingeshSD
Dec 10, 2025
Author

sfc-gh-jreini Dec 12, 2025
Maintainer

LingeshSD
Dec 14, 2025
Author

LingeshSD
Dec 15, 2025
Author

LingeshSD
Dec 17, 2025
Author

LingeshSD
Dec 17, 2025
Author