Introduce data attribute group in semantic-conventions for governance and security#3644
Introduce data attribute group in semantic-conventions for governance and security#3644Ayushi-12 wants to merge 4 commits intoopen-telemetry:mainfrom
Conversation
|
This PR contains changes to area(s) that do not have an active SIG/project and will be auto-closed:
Such changes may be rejected or put on hold until a new SIG/project is established. Please refer to the Semantic Convention Areas |
|
This PR contains changes to area(s) that do not have an active SIG/project and will be auto-closed:
Such changes may be rejected or put on hold until a new SIG/project is established. Please refer to the Semantic Convention Areas |
|
|
||
| 1. **Low Cardinality**: To avoid performance degradation in metrics backends (like Prometheus), do not use unique IDs (e.g., `user_id`) in `data.*` attributes. | ||
| 2. **No Actual Data**: Never place actual PII (e.g., an email address) inside these attributes. They are for metadata only. | ||
| 3. **Mandatory Review**: Any new value added to `data.category` must be approved by the Data Governance committee to ensure consistent reporting. |
There was a problem hiding this comment.
Is a "Data Governance committee" being formed in the OpenTelemetry Project, or does this refer to something else? https://opentelemetry.io/community/members/ lists "Governance Committee" and "Technical Committee" but not yet that.
There was a problem hiding this comment.
I think this section can be removed. I meant it more as a guideline for the users who might want to add custom values to data.category field but it is confusing when it references a "committee"
proposing a mandatory review as a guideline seems counterintuitive.
|
|
||
| --- | ||
|
|
||
| `data.category` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used. |
There was a problem hiding this comment.
Is there an existing taxonomy we can/should use here? Maybe more generally does this need to be enumerated now?
One of my concerns is that either internal or public would seem to apply to all data and thus one of them MUST be used at all times. That doesn't leave any room for any other category since this is string and not []string. There's also overlap between pii and financial or health.
There was a problem hiding this comment.
I agree with your point, so far in my research I could not find any specific taxonomy that will be clean and mutually exclusive.
Should we remove the enum for now and let the community usage drive addition of enums
(while continuing the research for standardized terminology)?
There was a problem hiding this comment.
Microsoft has lists of sensitivity labels and information types for use with SQL Server, at https://github.com/Azure-Samples/sql-data-classification/blob/main/sql_information_protection_default.json, referenced from https://learn.microsoft.com/en-us/sql/relational-databases/security/sql-data-discovery-and-classification?view=sql-server-ver17. I don't know if the same taxonomy is used in other Microsoft products or even by other vendors.
Their Tabular Data Stream 7.4 protocol is able to return this information from SQL Server to the client along with a result set; see [MS-TDS] section 2.2.7.5 DATACLASSIFICATION. Which makes me wonder whether these attributes could be of any use in SQL Server database client spans. Because these attributes are defined as plain strings rather than arrays, I suppose the instrumentation would have to choose only the most sensitive label and the most secret information type if different columns of a result set have different labels.
|
|
||
| --- | ||
|
|
||
| `data.sensitivity` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used. |
There was a problem hiding this comment.
I'm not sure this should be a formal enumeration, for some of the same reasons mentioned about data.category. I'd expect most organizations to have their own data classification schemes that may or may not align with this and that may be more coarsely- or finely-grained.
There was a problem hiding this comment.
Ack. I can remove the enum from both these fields for now. We can come back to it with more data.
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
|
|
||
| ```bash | ||
| # Broadly tagging the service resource in the collection pipeline | ||
| export OTEL_RESOURCE_ATTRIBUTES="data.sensitivity=restricted,data.category=financial" |
There was a problem hiding this comment.
How does this map to declarative config?
There was a problem hiding this comment.
I imagine it would be like
file_format: "1.0-rc.1"
resource:
attributes_list: "data.sensitivity=restricted,data.category=financial"or
file_format: "1.0-rc.1"
resource:
attributes:
- name: data.sensitivity
value: restricted
- name: data.category
value: financialhttps://github.com/open-telemetry/opentelemetry-configuration/blob/v1.0.0-rc.1/examples/kitchen-sink.yaml shows an example of both.
thompson-tomo
left a comment
There was a problem hiding this comment.
The resource attribute/scenario is hard to invision how it could be used reliably. Also resource entries should be associated with an entity.
|
|
||
| ### 2. OpenTelemetry Resource Attributes | ||
|
|
||
| When a service starts, it can be configured to broadcast its data handling capabilities or the sensitivity of its primary datastore via environment variables. |
There was a problem hiding this comment.
I am not following here. This is talking about broadcasting it's capabilities (plural) and as such shouldn't the resource attribute be an array so that a service can report that it can handle PII & health etc. Tools can then raise alerts if attribute on telemetry signal ie span is not a registered capability.
|
|
||
| ### 2. OpenTelemetry Resource Attributes | ||
|
|
||
| When a service starts, it can be configured to broadcast its data handling capabilities or the sensitivity of its primary datastore via environment variables. |
There was a problem hiding this comment.
The data store use case will be difficult due to potential for multiple data stores being used by a client. The scenario where I see it as workable is if the data store ie db reports it's sensitivity and we can trace from client to server.
Fixes #
Changes
This change introduces the 'data' attribute group to OpenTelemetry semantic conventions to support data governance and security use cases.
Key changes:
Included a non-normative guide with usage examples for Kubernetes, Resource Attributes, and Context Propagation (Baggage).
These attributes enable automated security workflows like dynamic data redaction, compliance mapping (GDPR/PCI), and sensitive data movement monitoring.
Prototypes [IN draft] -
open-telemetry/opentelemetry-demo#3215
open-telemetry/opentelemetry-demo#3210
Supporting documents -
Introduce "data" attribute group in OTEL
Important
Pull requests acceptance are subject to the triage process as described in Issue and PR Triage Management.
PRs that do not follow the guidance above, may be automatically rejected and closed.
Merge requirement checklist
[chore]