Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 191 additions & 0 deletions docs/sampling-scope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Sampling Scope

The `sampling` configuration defines which tables are allowed to have **row samples** collected during database introspection.

This feature is meant for customers who want metadata introspection (catalogs/schemas/tables/columns) but want to avoid reading actual row values from objects they consider sensitive.

The scope operates at the **(catalog, schema, table)** level.

---

## Scope of application

- A **catalog + schema + table** that is *in sampling scope* may have sample rows collected.
- A **catalog + schema + table** that is *out of sampling scope* will **not** have sample rows collected.
- Sampling scope is applied **only to the sampling step** (i.e., “collect samples from each table”).
- Sampling scope does **not** change what is introspected for metadata. Use `introspection-scope` to control metadata introspection.

For database engines that do not support catalogs, the engine is treated as having a single implicit catalog. In such cases, rules omit the `catalog` field.

---

## YAML configuration

```yaml
sampling:
enabled: true # optional (default: true)
scope: # optional
include:
- catalog: <glob-pattern> # optional
schemas: [<glob>, <glob>] # optional (string also allowed)
tables: [<glob>, <glob>] # optional (string also allowed)
exclude:
- catalog: <glob-pattern> # optional
schemas: [<glob>, <glob>] # optional (string also allowed)
tables: [<glob>, <glob>] # optional (string also allowed)
except_schemas: [<glob>] # optional (string also allowed)
except_tables: [<glob>] # optional (string also allowed)
```

### Rules

Each entry under `include` or `exclude` is a **rule**.

A rule may contain:

- `catalog`
A glob pattern matching catalog names.
If omitted, the rule matches **any catalog**.

- `schemas`
One or more glob patterns matching schema names.
If omitted, the rule matches **any schema**.

- `tables`
One or more glob patterns matching table names.
If omitted, the rule matches **any table**.

- `except_schemas` / `except_tables` (exclude rules only)
One or more glob patterns defining exceptions.
If a target matches `except_schemas` / `except_tables`, it is **not excluded by that rule**, even if the rule otherwise matches.

Each rule **must specify at least one** of `catalog`, `schemas`, or `tables`.

---

## Glob pattern matching

All matching uses **glob patterns** and is **case-insensitive**.

### Supported glob syntax

| Pattern | Meaning | Example |
|------|--------|--------|
| `*` | Matches any number of characters (including zero) | `order_*` matches `order_items` |
| `?` | Matches exactly one character | `dev?` matches `dev1`, `devA` |
| `[seq]` | Matches any single character in `seq` | `dev[12]` matches `dev1`, `dev2` |
| `[!seq]` | Matches any single character **not** in `seq` | `dev[!0-9]` matches `devA` |

---

## Semantics and precedence

Sampling scope evaluation follows these rules:

### 1. Master switch

- If `sampling.enabled` is `false`, **no tables are sampled**, regardless of `scope`.

### 2. Initial scope selection

- If `scope.include` is **absent or empty**, the initial sampling scope consists of **all discovered tables** (subject to `introspection-scope` and any ignored schemas).
- If `scope.include` is **present and non-empty**, the initial sampling scope consists only of tables that match **at least one include rule**.

### 3. Exclusion

After the initial scope is determined:

- Any table that matches **any exclude rule** is removed from the sampling scope.
- Exclusion always takes precedence over inclusion (**exclude wins**).
- `except_schemas` / `except_tables` apply only to the exclude rule in which they are defined and prevent that rule from excluding matching targets.

Exclude rules are combined using **OR** semantics.

---

## Relationship to introspection-scope

Sampling rules are evaluated **only for objects that are introspected**.

- If a schema is excluded by `introspection-scope`, its tables are not introspected and therefore cannot be sampled.
- `sampling` only controls whether sample rows are collected for tables that are otherwise discovered and introspected.

---

## Examples

### Disable sampling entirely

```yaml
sampling:
enabled: false
```

Result:
- Metadata introspection still runs (subject to `introspection-scope`).
- No sample rows are collected from any table.

---

### Exclude a schema everywhere

```yaml
sampling:
scope:
exclude:
- schemas: [hr]
```

Result:
- Tables in `hr` are not sampled.
- Tables in other schemas remain eligible for sampling.

---

### Allowlist only a subset of tables

```yaml
sampling:
scope:
include:
- schemas: [analytics]
tables: ["revenue_*"]
```

Result:
- Only `analytics.revenue_*` tables are sampled.
- Everything else is not sampled.

---

### Exclude sensitive tables but sample everything else

```yaml
sampling:
scope:
exclude:
- schemas: [shop]
tables: [customers]
- schemas: [hr]
tables: [employees, active_employees]
```

Result:
- `shop.customers`, `hr.employees`, and `hr.active_employees` are not sampled.
- All other tables remain eligible for sampling.

---

### Exclude an entire schema, but allow a few safe tables

```yaml
sampling:
scope:
exclude:
- schemas: [hr]
except_tables: [departments, orders_by_employee]
```

Result:
- In `hr`, only `departments` and `orders_by_employee` are sampled.
- All other `hr` tables are not sampled.
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from databao_context_engine.plugins.databases.database_chunker import build_database_chunks
from databao_context_engine.plugins.databases.databases_types import DatabaseIntrospectionResult
from databao_context_engine.plugins.databases.introspection_scope import IntrospectionScope
from databao_context_engine.plugins.databases.sampling_scope import SamplingConfig


class BaseDatabaseConfigFile(BaseModel, AbstractConfigFile):
Expand All @@ -25,6 +26,9 @@ class BaseDatabaseConfigFile(BaseModel, AbstractConfigFile):
introspection_scope: Annotated[
IntrospectionScope | None, ConfigPropertyAnnotation(ignored_for_config_wizard=True)
] = Field(default=None, alias="introspection-scope")
sampling: Annotated[SamplingConfig | None, ConfigPropertyAnnotation(ignored_for_config_wizard=True)] = Field(
default=None
)


T = TypeVar("T", bound="BaseDatabaseConfigFile")
Expand Down
22 changes: 18 additions & 4 deletions src/databao_context_engine/plugins/databases/base_introspector.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
)
from databao_context_engine.plugins.databases.introspection_scope import IntrospectionScope
from databao_context_engine.plugins.databases.introspection_scope_matcher import IntrospectionScopeMatcher
from databao_context_engine.plugins.databases.sampling_scope import SamplingConfig
from databao_context_engine.plugins.databases.sampling_scope_matcher import SamplingScopeMatcher

logger = logging.getLogger(__name__)

Expand All @@ -21,7 +23,17 @@ class SupportsIntrospectionScope(Protocol):
introspection_scope: IntrospectionScope | None


T = TypeVar("T", bound="SupportsIntrospectionScope")
class SupportsSamplingScope(Protocol):
sampling: SamplingConfig | None


class SupportsDatabaseScopes(SupportsIntrospectionScope, SupportsSamplingScope, Protocol):
"""Marker protocol for configs usable with BaseIntrospector."""

pass


T = TypeVar("T", bound="SupportsDatabaseScopes")


class BaseIntrospector(Generic[T], ABC):
Expand Down Expand Up @@ -60,11 +72,13 @@ def introspect_database(self, file_config: T) -> DatabaseIntrospectionResult:
if not introspected_schemas:
continue

sampling_matcher = SamplingScopeMatcher(file_config.sampling, ignored_schemas=self._ignored_schemas())
for schema in introspected_schemas:
for table in schema.tables:
table.samples = self._collect_samples_for_table(
catalog_connection, catalog, schema.name, table.name
)
if sampling_matcher.should_sample(catalog, schema.name, table.name):
table.samples = self._collect_samples_for_table(
catalog_connection, catalog, schema.name, table.name
)

introspected_catalogs.append(DatabaseCatalog(name=catalog, schemas=introspected_schemas))
return DatabaseIntrospectionResult(catalogs=introspected_catalogs)
Expand Down
99 changes: 99 additions & 0 deletions src/databao_context_engine/plugins/databases/sampling_scope.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
from __future__ import annotations

from typing import Any

from pydantic import BaseModel, ConfigDict, field_validator, model_validator


def _normalize_str_or_list(v: Any) -> Any:
if v is None:
return None
if isinstance(v, str):
return [v]
return v


class SamplingIncludeRule(BaseModel):
"""Allowlist selector for sampling.

Attributes:
catalog: optional glob pattern
schemas: optional list of glob patterns (string also accepted and normalized to a list)
tables: optional list of glob patterns (string also accepted and normalized to a list)

A rule must specify at least one of: catalog, schemas, tables.
"""

model_config = ConfigDict(extra="forbid")

catalog: str | None = None
schemas: list[str] | None = None
tables: list[str] | None = None

@field_validator("schemas", "tables", mode="before")
@classmethod
def _normalize_lists(cls, v: Any) -> Any:
return _normalize_str_or_list(v)

@model_validator(mode="after")
def _validate_rule(self) -> SamplingIncludeRule:
if self.catalog is None and self.schemas is None and self.tables is None:
raise ValueError("Sampling include rule must specify at least one of: catalog, schemas, tables")
return self


class SamplingExcludeRule(BaseModel):
"""Denylist selector for sampling.

Attributes:
catalog: optional glob pattern
schemas: optional list of glob patterns (string also accepted)
tables: optional list of glob patterns (string also accepted)
except_schemas: optional list of glob patterns (string also accepted)
except_tables: optional list of glob patterns (string also accepted)

If a target matches the rule but also matches an except_* selector, it is NOT excluded by this rule.
"""

model_config = ConfigDict(extra="forbid")

catalog: str | None = None
schemas: list[str] | None = None
tables: list[str] | None = None

except_schemas: list[str] | None = None
except_tables: list[str] | None = None

@field_validator("schemas", "tables", "except_schemas", "except_tables", mode="before")
@classmethod
def _normalize_lists(cls, v: Any) -> Any:
return _normalize_str_or_list(v)

@model_validator(mode="after")
def _validate_rule(self) -> SamplingExcludeRule:
if self.catalog is None and self.schemas is None and self.tables is None:
raise ValueError("Sampling exclude rule must specify at least one of: catalog, schemas, tables")
return self


class SamplingScope(BaseModel):
"""Include/exclude rule set for sampling."""

model_config = ConfigDict(extra="forbid")

include: list[SamplingIncludeRule] = []
exclude: list[SamplingExcludeRule] = []


class SamplingConfig(BaseModel):
"""Sampling configuration.

Attributes:
enabled: master switch. If False, sampling is disabled entirely.
scope: include/exclude rules controlling which tables get sampled.
"""

model_config = ConfigDict(extra="forbid")

enabled: bool = True
scope: SamplingScope | None = None
Loading