Skip to content

Conversation

@borisarzentar
Copy link
Contributor

@borisarzentar borisarzentar commented Oct 12, 2025

Description

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.


Note

Adds coverage to ensure deletion helpers only remove targeted records.

  • Adds tests for delete_data_related_nodes and delete_data_related_edges to confirm only entries with matching data_id are deleted
  • Adds tests for delete_dataset_related_nodes and delete_dataset_related_edges to confirm only entries with matching dataset_id are deleted
  • Verifies no-op behavior when tables are empty

Written by Cursor Bugbot for commit fae9869. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • Tests
    • Added unit tests for graph data deletion operations, verifying correct removal of target data-related nodes and edges while preserving unrelated data.
    • Added test coverage for dataset-specific deletion operations.
    • Added tests for edge case scenarios including operations on empty tables.

✏️ Tip: You can customize this high-level summary in your review settings.

@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 12, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Two unit test files are added to cover graph cleanup utilities. The tests verify that delete functions for nodes and edges correctly remove only specified data or datasets while leaving others intact, and handle empty tables without exceptions.

Changes

Cohort / File(s) Summary
Graph deletion unit tests
cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py, cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py
Added test suites for data/dataset node and edge deletion: verify target data removed while others remain, and validate no-op behavior on empty tables using async SQLAlchemy sessions and SQL COUNT queries.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • pazone
  • hajdul88
  • siillee

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description lacks a clear human-generated summary in the Description section; it relies on AI-generated Cursor summary and leaves critical sections incomplete (Acceptance Criteria is missing entirely). Provide a human-written Description section explaining the changes, add Acceptance Criteria with test verification instructions, and ensure all required template sections are completed.
Docstring Coverage ⚠️ Warning Docstring coverage is 34.09% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'feat: implement full delete feature' is vague and overly broad; it lacks specificity about what delete feature is being implemented. Consider using a more specific title that describes the primary change, such as 'feat: add unit tests for delete_data and delete_dataset helpers' or 'feat: add comprehensive deletion tests for graph nodes and edges'.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@borisarzentar borisarzentar requested a review from siillee October 15, 2025 13:17
@dexters1
Copy link
Collaborator

If user1 (who is the owner of the dataset) adds Data to a dataset can User2 who has delete permission delete delete this Data? In a multi-user setting

- name: Install Package
if: ${{ !contains(matrix.os, 'windows-latest') }}
run: |
cd dist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will put general comments here:

  • examples/low_level/pipeline.py is failing (I think it needs update)
  • mcp server delete is failing (probably just param handling)

What we discussed in the meeting:
-Authentication is not handled (Igor will probably write that too)
-custom edges are not handled

Since its changing a lot I would merge this after the release so that we have more time to test + migration is needed for saas. This will break delete for existing users. Maybe we need to support old delete somehow (I dont know what is the solution here).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed that after these are fixed we test the interface (delete_all, dataset, data) features with edge cases

@hajdul88
Copy link
Collaborator

In general this changes a lot of things and can break cognee delete for existing users/cognee setups as it is.

@Vasilije1990
Copy link
Contributor

This PR will not be merged until it is backward compatible and all issues raised by Igor and Laszlo have been addressed @borisarzentar

The PR is too big and touches too many things without a good reason.

I would suggest slimming it down to base core delete + backward compatibility and opening PR for other changes separately.

@borisarzentar
Copy link
Contributor Author

If user1 (who is the owner of the dataset) adds Data to a dataset can User2 who has delete permission delete delete this Data? In a multi-user setting

Added a test for that.

Copy link
Collaborator

@hajdul88 hajdul88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only tested the test_delete_default_graph, and found major discrepancies. I will write them ordered by importance (highest to lowest):

  1. Cognify doesn't embed custom edges anymore. They are missing from the EdgeType_relationship_name vector collection. (Checked dev and its correct there)
  2. When I ingest 2 file. delete one of them, edges like is_a, made_from, is_part_of are getting deleted from the EdgeType_relationship_name collection although they shouldn't.
  3. When I call datasets.delete_data(dataset_id, added_data.id, user), nothing gets deleted from the vector collections (except the edges that shouldn't be deleted)

Additional comments questions:

How do we make sure that the new delete is backward compatible with the old solution? and what happens in the following case?

-User has a built cognee system (KG + vector stored populated)
-I ingest one document that contains nodes that are already in the KG + vector stores
-I delete the document I ingested
-In this case I believe old parts of the KG will be dropped as well no?

For now I only tested the test_delete_default_graph.py so basic delete. Lets discuss before we proceed with he review.

Copy link
Collaborator

@hajdul88 hajdul88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to check it again but tests are failing now + there is a conflict

@borisarzentar
Copy link
Contributor Author

Started to check it again but tests are failing now + there is a conflict

Yeah, text was too long in one of the tests so Edge table failed. Changed that now and resolved the conflict. Everything else stayed the same basically.

@hajdul88
Copy link
Collaborator

hajdul88 commented Nov 7, 2025

I think CI is still failing, and delete tests are the ones that are failing now


# revision identifiers, used by Alembic.
revision: str = "84e5d08260d6"
down_revision: Union[str, None] = "211ab850ef3d"
Copy link
Collaborator

@dexters1 dexters1 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a reminder for everyone, there is a chance the down revision might change due to other PRs and needs to be manually updated (before we merge)

@gitguardian
Copy link

gitguardian bot commented Nov 13, 2025

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@@ -0,0 +1,119 @@
import os
Copy link
Collaborator

@hajdul88 hajdul88 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is not the part of the CI or at least I didn't see it called. I tried to run locally but it also failed for me in line 61.

If we dont need it lets delete it, if we do then lets put it into the CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I forgot to add it, will check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I tried it with neo4j and pgvector without backend access control

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it, let's see what it says. I run it locally once, it passed.

Copy link
Collaborator

@hajdul88 hajdul88 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw why is this failing without backend access control? (neo4j, pg) (and basically default graph test as well)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still happening in the new pr review round

initial_node_ids = set([node[0] for node in initial_nodes])

user = await get_default_user()
await datasets.delete_data(dataset_id, johns_data_id, user) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this example. Deleted the first file so johns data. After that I inspected the nodes and edges rel tables. Nodes seems to be correct however, edges table contains edges from johns data. In my graph I have 11 edges in the edges rel table 14. So i think the ledger for edges is not updated correcly

Copy link
Collaborator

@hajdul88 hajdul88 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I think about it this can cause later a problem where we dont delete edges because of these phantom edges that stayed there.

Copy link
Collaborator

@hajdul88 hajdul88 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still happening in the new review round

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this and was following the data in databases, everything seems fine now. Tried with defaults and with pgvector/neo4j.

@@ -0,0 +1,382 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without running this test, I did the following experiment to test backward compatibility:

  1. Switched to dev branch
  2. added and cognified the following sentence: "Bmw is a german carmanufacturer"
  3. Inspected the graph -- BMW--produced in--germany -- is_a--country
  4. Switched to the new delete branch
  5. added and cognified the following sentence: "Germany is located next to the netherlands"
  6. Inspected the graph --Germany--located_next_to--Netherlands (germany node is shared between the two documents)
  7. Deleted the latest sentence about germany and netherlands
  8. The node germany got deleted too, however it should have stayed there

I think the current solution is not backward compatible in this case as it deleted node that was the part of the previously ingested knowledge graph

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not correct in the new review round

Copy link
Contributor

@pazone pazone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a bug: When we create a DS with 2 files and delete it - the DS is being deleted, but the files remain hanging in the database

@pazone
Copy link
Contributor

pazone commented Nov 20, 2025

As I understand, the GitGuardian / GitGuardian Security Checks failure is false positive

@borisarzentar
Copy link
Contributor Author

I found a bug: When we create a DS with 2 files and delete it - the DS is being deleted, but the files remain hanging in the database

Checked this and looked at the files, everything is removed now.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

await add_data_points(
data_points=graph_nodes, custom_edges=graph_edges, embed_triplets=embed_triplets
)
await add_data_points( data_points=graph_nodes, custom_edges=context, embed_triplets=embed_triplets)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context passed to wrong parameter in add_data_points call

The context dictionary is incorrectly passed to the custom_edges parameter instead of the context parameter. The add_data_points function expects context as its second positional parameter for user/dataset/data tracking, but here it's passed as custom_edges=context. This means context will be None inside the function (breaking node/edge upserts), and custom_edges will receive a dict instead of a list (breaking edge processing logic).

Fix in Cursor Fix in Web

raise CliCommandException(
"No user ID provided for '--all' deletion. Please specify using --user-id param."
)
await cognee_datasets.delete_all(user_id=args.user_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong parameter name passed to delete_all method

The call cognee_datasets.delete_all(user_id=args.user_id) passes a keyword argument user_id, but the delete_all method signature is async def delete_all(user: Optional[User] = None). This will raise a TypeError at runtime because user_id is not a valid parameter. The method expects a User object, not a user ID.

Fix in Cursor Fix in Web

user = await get_default_user()

dataset = await get_dataset(user.id, dataset_id)
dataset = await get_authorized_dataset(user.id, dataset_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User ID passed instead of User object

The has_data method calls get_authorized_dataset(user.id, dataset_id), passing user.id (a UUID) as the first argument. However, get_authorized_dataset expects a User object as its first parameter, not a UUID. This is inconsistent with other calls in the same file (lines 40 and 64) which correctly pass user. This will cause the authorization check to fail or behave incorrectly.

Fix in Cursor Fix in Web


if len(edges_to_save) > 0:
await graph_engine.add_edges(edges_to_save)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required context parameter in add_data_points

The call await add_data_points(data_points=rule_list.rules) is missing the required context parameter. The updated add_data_points function signature has context as a required positional parameter with no default value. This will cause a TypeError at runtime. The context variable is available in the function scope and needs to be passed.

Additional Locations (1)

Fix in Cursor Fix in Web

user_id=context["user"].id,
dataset_id=context["dataset"].id,
data_id=context["data"].id,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tenant_id parameter in upsert_edges call

The call to upsert_edges is missing the required tenant_id parameter. The function signature requires tenant_id: UUID as the second positional parameter, but only user_id, dataset_id, and data_id are passed. This differs from the correct call pattern in cognee/tasks/codingagents/coding_rule_associations.py which properly includes tenant_id=context["user"].tenant_id.

Fix in Cursor Fix in Web

user_id=user.id,
dataset_id=context["dataset"].id,
data_id=context["data"].id,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuples passed to upsert_nodes instead of DataPoints

The nodes variable is a list of tuples (str, dict) built on lines 111-119, but the new code passes it to upsert_nodes(nodes, ...) which expects List[DataPoint]. Inside upsert_nodes, the code accesses node.id, node.type, and calls DataPoint.get_embeddable_property_names(node), all of which will fail because tuples don't have these attributes. This will raise AttributeError at runtime when the classifier runs with a user context.

Fix in Cursor Fix in Web

user = await get_default_user()
datasets = await get_datasets_by_name("main_dataset", user_id=user.id)
dataset = datasets[0]
added_data = await get_last_added_data(dataset.id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null check for dataset lookup causes crash

The code calls get_datasets_by_name("main_dataset", ...) and immediately accesses datasets[0] without checking if the list is empty. If the "main_dataset" doesn't exist, datasets will be an empty list and datasets[0] will raise an IndexError, crashing the interaction save process. A check for an empty result is needed before accessing the first element.

Fix in Cursor Fix in Web

await self.cognee.datasets.delete_data(
dataset_id=dataset_id,
data_id=data_id,
user=user,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing return value in delete method direct mode

The delete method signature specifies a return type of Dict[str, Any]. The API mode branch correctly returns response.json(), but the direct mode else branch calls datasets.delete_data() without returning anything. This causes the function to implicitly return None instead of the expected dictionary, which could cause callers expecting a result to fail.

Fix in Cursor Fix in Web


legacy_edges = (await session.scalars(query)).all()

legacy_edge_names = set([edge.creator_function.split(".")[1] for edge in legacy_edges])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index access without dot check may raise IndexError

The code splits edge.creator_function by "." and accesses index [1]. If any creator_function value in the legacy ledger doesn't contain a period (e.g., a malformed or corrupted entry), .split(".") returns a single-element list and [1] raises an IndexError. This could crash deletion operations when processing legacy data with unexpected formats.

Fix in Cursor Fix in Web

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
cognee/infrastructure/databases/vector/vector_db_interface.py (1)

141-141: Update the docstring to match the new signature.

The docstring still references Union[List[str], list[str]], but the signature now accepts List[UUID]. This creates confusion for developers using this interface.

🔎 Proposed fix
-            - data_point_ids (Union[List[str], list[str]]): A list of IDs of the data points to delete.
+            - data_point_ids (List[UUID]): A list of UUIDs of the data points to delete.
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

288-293: Fix type signature inconsistency: delete_data_points should accept list[str] to match retrieve method.

The method currently accepts list[UUID] (lines 288-293), but the retrieve method in the same adapter accepts list[str] (line 207). Since LanceDB internally stores IDs as strings (line 184: id=str(properties["id"])), both methods should use the same type. The string interpolation f"id = '{data_point_id}'" will work correctly with UUID objects via __str__(), but the type signature should be consistent with retrieve.

Additionally, note that this deviates from the vector_db_interface.py contract which specifies list[UUID] for delete_data_points. Other adapters also have type signature mismatches: ChromaDB and NeptuneAnalytics use list[str] while PGVectorAdapter correctly uses list[UUID]. Consider aligning all adapters to the interface specification.

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (1)

388-396: Critical: Inconsistent delete_data_points signatures violate VectorDBInterface contract.

The VectorDBInterface defines delete_data_points(self, collection_name: str, data_point_ids: List[UUID]) at line 132, but three adapter implementations do not comply:

  • ChromaDBAdapter (line 514): list[str] instead of List[UUID]
  • NeptuneAnalyticsAdapter (line 350): list[str] instead of List[UUID]

Additionally, all current callers in the codebase (delete_data_nodes_and_edges.py, legacy_delete.py, delete_dataset_nodes_and_edges.py) pass [str(node.slug) for node in ...] — converting UUIDs to strings. This works with ChromaDB and NeptuneAnalytics but will fail at runtime with PGVectorAdapter and LanceDBAdapter, which correctly implement the UUID signature. The interface docstring (line 141) also contradicts the signature, stating Union[List[str], list[str]] instead of List[UUID].

All implementations and callers must be updated to consistently use List[UUID] as defined in the interface signature.

cognee/modules/graph/legacy/GraphRelationshipLedger.py (1)

31-40: Critical pre-existing bug: to_json references non-existent attributes.

The to_json method references self.parent_id and self.child_id (lines 34-35), but the actual model fields are source_node_id and destination_node_id (lines 16-17). This will cause an AttributeError at runtime.

While this bug was not introduced in this PR, it should be fixed for correctness.

🔎 Proposed fix
     def to_json(self) -> dict:
         return {
             "id": str(self.id),
-            "source_node_id": str(self.parent_id),
-            "destination_node_id": str(self.child_id),
+            "source_node_id": str(self.source_node_id),
+            "destination_node_id": str(self.destination_node_id),
             "creator_function": self.creator_function,
             "created_at": self.created_at.isoformat(),
             "deleted_at": self.deleted_at.isoformat() if self.deleted_at else None,
             "user_id": str(self.user_id),
         }
cognee/modules/data/methods/create_dataset.py (1)

13-38: Remove redundant commit; let the @with_async_session decorator handle session lifecycle.

The decorator commits the session after the function returns when it creates one (line 20 of with_async_session.py). The explicit await session.commit() at line 36 results in a double commit when the decorator manages the session. While SQLAlchemy treats the second commit as a no-op, the explicit commit is redundant and should be removed.

This pattern appears elsewhere in similar functions (e.g., cognee/modules/notebooks/methods/create_notebook.py:24).

examples/low_level/product_recommendation.py (1)

64-107: Add docstring and error handling for data access.

This function lacks documentation and performs unsafe data access that could raise errors.

📝 Suggested improvements
 def ingest_customers(data):
+    """
+    Ingest customer data and build Customer, Product, and Preference data points.
+    
+    Args:
+        data: List containing Data objects with customers attribute.
+    
+    Returns:
+        List[Customer]: List of Customer data points with relationships.
+    
+    Raises:
+        IndexError: If data list is empty.
+        KeyError: If required customer/product/preference fields are missing.
+    """
+    if not data:
+        raise ValueError("Data list cannot be empty")
+    
     customers_data_points = {}
     products_data_points = {}
     preferences_data_points = {}
 
     for customer in data[0].customers:
+        # Validate required fields
+        if "id" not in customer or "name" not in customer:
+            raise KeyError(f"Customer missing required fields: {customer}")
+            
         new_customer = Customer(
             id=uuid5(NAMESPACE_OID, customer["id"]),
             name=customer["name"],

As per coding guidelines, undocumented functions are assumed incomplete and should have docstrings. Additionally, the direct access to data[0] and dictionary keys without validation could cause runtime errors in production.

cognee/modules/data/models/Data.py (1)

41-47: Reconsider the eager loading strategy for the datasets relationship.

Changing to lazy="selectin" causes all related Dataset objects to be eagerly loaded whenever a Data instance is accessed. This is problematic because:

  • Many Data queries do not access datasets (e.g., get_pipeline_run_metrics.py performs aggregates, ingest_data.py only checks IDs)
  • File operations and single-record lookups don't require dataset information
  • The to_json() method has dataset serialization commented out, indicating datasets aren't always needed
  • The reverse relationship (Dataset.data) uses lazy="noload", creating an intentional asymmetry that suggests datasets should not be eagerly loaded from the Data side

Functions like get_dataset_data() and get_last_added_data() that explicitly .join(Data.datasets) will benefit from this, but most other Data queries will incur unnecessary loading overhead.

cognee-mcp/src/cognee_client.py (1)

199-199: Remove unused mode parameter.

The mode parameter is declared in the function signature but is never used in either API or direct mode. This creates a misleading API surface and breaks backward compatibility since callers may still pass this parameter expecting it to have an effect.

🔎 Proposed fix
-    async def delete(self, data_id: UUID, dataset_id: UUID, mode: str = "soft") -> Dict[str, Any]:
+    async def delete(self, data_id: UUID, dataset_id: UUID) -> Dict[str, Any]:
cognee-mcp/src/codingagents/coding_rule_associations.py (1)

95-125: Missing validation for required context keys may cause runtime errors.

The function accesses context["user"].id, context["dataset"].id, and context["data"].id without validation. If any key is missing or has a None value, this will raise KeyError or AttributeError at runtime.

Additionally, edges appear to be persisted twice: once via graph_engine.add_edges (line 119) and again via upsert_edges (lines 120-125). Please verify this dual persistence is intentional.

🔎 Proposed fix for context validation
 async def add_rule_associations(data: str, rules_nodeset_name: str, context: Dict):
+    required_keys = ["user", "dataset", "data"]
+    for key in required_keys:
+        if key not in context or context[key] is None:
+            raise ValueError(f"Missing required context key: {key}")
+
     graph_engine = await get_graph_engine()
cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py (1)

424-438: Mock reassignment shadows the patched mock.

The mock_get_deletion_counts parameter is provided by the @patch decorator, but lines 428-429 create a new AsyncMock that shadows it without being used. This means the patch isn't properly configured.

🔎 Proposed fix
     @patch("cognee.cli.commands.delete_command.get_deletion_counts")
     @patch("cognee.cli.commands.delete_command.fmt.confirm")
     def test_delete_confirmation_keyboard_interrupt(self, mock_confirm, mock_get_deletion_counts):
         """Test delete command when user interrupts confirmation"""
-        mock_get_deletion_counts = AsyncMock()
-        mock_get_deletion_counts.return_value = DeletionCountsPreview()
+        mock_get_deletion_counts.return_value = DeletionCountsPreview()
 
         command = DeleteCommand()
♻️ Duplicate comments (2)
alembic/versions/84e5d08260d6_replace_graph_ledger_table_with_nodes_.py (1)

16-20: Reminder: Verify down_revision before merge.

As noted in a previous review comment, the down_revision value may need updating if other PRs are merged first that introduce new migrations.

cognee/tests/test_delete_default_graph_with_legacy_data_1.py (1)

139-141: Test function should follow test_* naming convention.

Same issue as the other test file - rename main to test_delete_default_graph_with_legacy_data_1 for pytest autodiscovery.

🟠 Major comments (28)
cognee/tests/utils/extract_relationships.py-6-6 (1)

6-6: Avoid mutable default argument for cache.

Using cache: dict = {} as a default argument is a well-known Python anti-pattern. The same dictionary instance is shared across all calls to the function, which can cause unexpected behavior if the cache is not explicitly reset between test runs.

If the caching behavior is intentional, consider using None as the default and initializing inside the function, or document the shared-cache semantics clearly.

🔎 Proposed fix
-def extract_relationships(document_chunk: DocumentChunk, graph: KnowledgeGraph, cache: dict = {}):
+def extract_relationships(document_chunk: DocumentChunk, graph: KnowledgeGraph, cache: dict = None):
+    if cache is None:
+        cache = {}
     relationships = []
cognee/tests/utils/filter_overlapping_entities.py-1-26 (1)

1-26: Add docstring to document the function's behavior.

Per the coding guidelines, undocumented function definitions are assumed incomplete. Please add a docstring explaining the purpose, parameters, and return value structure.

🔎 Suggested docstring
 def filter_overlapping_entities(*entity_groups):
+    """
+    Partition entities across groups into overlapping and non-overlapping sets.
+    
+    Args:
+        *entity_groups: Variable number of entity group lists to analyze.
+    
+    Returns:
+        tuple: (overlapping_entities, *per_group_filtered_lists) where
+            overlapping_entities contains entities present in multiple groups,
+            and each per_group list contains entities unique to that group.
+    """
     entity_count = {}

As per coding guidelines, undocumented functions are assumed incomplete.

cognee/tests/utils/filter_overlapping_relationships.py-4-33 (1)

4-33: Add docstring to document the function's behavior and relationship structure.

Per the coding guidelines, undocumented function definitions are assumed incomplete. Additionally, the function assumes relationships are indexable sequences with specific positions (start at [0], type at [1], end at [2]), which should be documented.

🔎 Suggested docstring
 def filter_overlapping_relationships(*relationship_groups):
+    """
+    Partition relationships across groups into overlapping and non-overlapping sets.
+    
+    Args:
+        *relationship_groups: Variable number of relationship group lists to analyze.
+            Each relationship is expected to be a sequence with:
+            [0] = start node, [1] = relationship type, [2] = end node.
+    
+    Returns:
+        tuple: (overlapping_relationships, *per_group_filtered_lists) where
+            overlapping_relationships contains relationships present in multiple groups,
+            and each per_group list contains relationships unique to that group.
+    """
     relationship_count = {}

As per coding guidelines, undocumented functions are assumed incomplete.

cognee/tests/utils/extract_entities.py-6-45 (1)

6-45: Add docstring to document function behavior.

As per coding guidelines, undocumented function definitions are assumed incomplete. Please add a docstring explaining the function's purpose, parameters (especially the cache behavior), and return value.

🔎 Suggested docstring
 def extract_entities(graph: KnowledgeGraph, cache: dict | None = None):
+    """
+    Extract Entity and EntityType objects from a KnowledgeGraph.
+    
+    Args:
+        graph: KnowledgeGraph containing nodes to extract
+        cache: Optional dict to reuse previously created Entity/EntityType objects.
+               If None, a new cache is created.
+    
+    Returns:
+        List containing Entity objects followed by EntityType objects.
+    """
     entities = []
     entity_types = []

Based on coding guidelines, undocumented Python function definitions should include docstrings.

Committable suggestion skipped: line range outside the PR's diff.

cognee/modules/data/methods/get_last_added_data.py-8-22 (1)

8-22: Add docstring for the function.

The function is missing a docstring. Per coding guidelines, undocumented function definitions are considered incomplete.

🔎 Proposed docstring
 async def get_last_added_data(dataset_id: UUID) -> Optional[Data]:
+    """
+    Retrieve the most recently created Data item associated with the given dataset.
+    
+    Args:
+        dataset_id: UUID of the dataset to query.
+    
+    Returns:
+        The most recent Data object for the dataset, or None if no data exists.
+    """
     db_engine = get_relational_engine()

As per coding guidelines, documenting functions is essential for maintainability.

examples/python/graphiti_example.py-43-46 (1)

43-46: Add bounds checking for dataset list access.

Line 44 accesses datasets[0] without verifying the list is non-empty. If no datasets exist after the add operations, this will raise an IndexError.

🔎 Proposed fix
     datasets = await get_datasets(user.id)
-    dataset_data = await get_dataset_data(datasets[0].id)  # type: ignore
-
-    pipeline = run_pipeline(tasks, data=dataset_data, datasets=[datasets[0].id], user=user)
+    if not datasets:
+        raise ValueError("No datasets found. Please ensure data was added successfully.")
+    
+    dataset_data = await get_dataset_data(datasets[0].id)  # type: ignore
+    pipeline = run_pipeline(tasks, data=dataset_data, datasets=[datasets[0].id], user=user)
cognee-mcp/src/server.py-289-307 (1)

289-307: Add error handling for dataset retrieval and data access.

Lines 290-292 retrieve datasets and access datasets[0] without verifying:

  1. The datasets list is non-empty
  2. get_last_added_data returns a valid data item

If the dataset lookup fails or returns an empty list, or if no data was added, this will cause an IndexError or pass None to add_rule_associations, potentially causing downstream failures.

🔎 Proposed fix
                 user = await get_default_user()
                 datasets = await get_datasets_by_name("main_dataset", user_id=user.id)
-                dataset = datasets[0]
-                added_data = await get_last_added_data(dataset.id)
+                
+                if not datasets:
+                    logger.warning("No main_dataset found, skipping rule associations")
+                else:
+                    dataset = datasets[0]
+                    added_data = await get_last_added_data(dataset.id)
+                    
+                    if not added_data:
+                        logger.warning("No data added to dataset, skipping rule associations")
+                    else:
+                        await add_rule_associations(
+                            data=data,
+                            rules_nodeset_name="coding_agent_rules",
+                            context={
+                                "user": user,
+                                "dataset": dataset,
+                                "data": added_data,
+                            },
+                        )
+                        logger.info("Associated rules generated from interaction data.")
 
                 logger.info("Save interaction process finished.")
 
                 # Rule associations only work in direct mode
                 if not cognee_client.use_api:
-                    logger.info("Generating associated rules from interaction data.")
-                    await add_rule_associations(
-                        data=data,
-                        rules_nodeset_name="coding_agent_rules",
-                        context={
-                            "user": user,
-                            "dataset": dataset,
-                            "data": added_data,
-                        },
-                    )
-                    logger.info("Associated rules generated from interaction data.")
+                    # Rule association logic moved above
+                    pass
                 else:

Note: The conditional structure may need adjustment depending on whether rule associations should only run in non-API mode or should run whenever datasets/data are available.

Committable suggestion skipped: line range outside the PR's diff.

cognee/tests/test_delete_dataset_neo4j.py-206-209 (1)

206-209: Use generate_edge_id(edge[2]) to generate proper edge vector IDs instead of edge[0].

Line 206 collects edge[0] (source_node_id) to query edge vectors, but edge vectors are indexed by IDs generated using generate_edge_id() (see cognee/init.py line 273). Using only the source node ID will not match the stored vector item identifiers. The correct approach is shown in assert_edges_vector_index_not_present.py, which uses generate_edge_id(relationship[2]).

Update to: query_edge_ids = [generate_edge_id(edge[2]) for edge in initial_edges]

Add import: from cognee.modules.engine.utils import generate_edge_id

cognee/modules/graph/legacy/has_nodes_in_legacy_ledger.py-15-50 (1)

15-50: Add docstring to document function behavior and parameters.

This function lacks a docstring explaining its purpose, parameters, return value, and the conditional ACL logic. As per coding guidelines, undocumented function definitions are assumed incomplete.

📝 Suggested docstring
 @with_async_session
 async def has_nodes_in_legacy_ledger(nodes: List[Node], session: AsyncSession):
+    """
+    Check whether given nodes exist in the legacy graph relationship ledger.
+    
+    Queries the GraphRelationshipLedger for node records where source_node_id equals
+    destination_node_id (self-referential entries indicating node existence) and the
+    node has not been deleted.
+    
+    Args:
+        nodes: List of Node objects to check for existence.
+        session: AsyncSession for database queries (injected by decorator).
+    
+    Returns:
+        List[bool]: Boolean list parallel to input nodes indicating existence.
+        If backend access control is enabled, existence is confirmed against the graph engine.
+        Otherwise, existence is determined solely by presence in the ledger.
+    """
     node_ids = [node.slug for node in nodes]

Based on coding guidelines, undocumented Python functions should have docstrings added.

cognee/modules/graph/legacy/has_nodes_in_legacy_ledger.py-52-65 (1)

52-65: Add docstring and implement error handling for async graph engine call.

This helper function lacks documentation (violating coding guidelines for undocumented functions) and does not handle potential errors from the async graph_engine.get_nodes() call, which could fail silently.

📝 Suggested docstring and error handling
 async def confirm_nodes_in_graph(
     legacy_nodes: List[Tuple[str, UUID]],
 ):
+    """Confirm that legacy nodes exist in the graph engine.
+
+    Args:
+        legacy_nodes: List of tuples (node_label, node_id) from the legacy ledger.
+
+    Returns:
+        Set[UUID]: Set of node IDs confirmed to exist in the graph.
+    """
     graph_engine = await get_graph_engine()
 
-    graph_nodes = await graph_engine.get_nodes([str(node[1]) for node in legacy_nodes])
+    try:
+        graph_nodes = await graph_engine.get_nodes([str(node[1]) for node in legacy_nodes])
+    except Exception as error:
+        raise RuntimeError(f"Failed to retrieve nodes from graph engine: {error}") from error
+
     graph_nodes_by_id = {node["id"]: node for node in graph_nodes}

Per coding guidelines, all undocumented function definitions should include a docstring summary. Additionally, async operations that query external services should explicitly handle failures rather than allowing exceptions to propagate uncaught.

cognee/modules/graph/utils/copy_edge.py-1-7 (1)

1-7: Add type hints and docstring; consider validation.

Per coding guidelines:

  1. Undocumented function definitions are considered incomplete—add a docstring.
  2. Prefer explicit, structured error handling.

The function lacks type hints (unclear what obj and overrides should be) and has no docstring. Additionally, it assumes obj is an ORM-mapped instance without validation, which could raise AttributeError if obj.__table__ doesn't exist.

📝 Suggested improvements
+from typing import Any, Dict, TypeVar
+
+T = TypeVar('T')
+
-def copy_edge(obj, overrides):
+def copy_edge(obj: T, overrides: Dict[str, Any]) -> T:
+    """
+    Create a copy of an ORM object with specified field overrides.
+    
+    Args:
+        obj: ORM-mapped object to copy (must have __table__ attribute).
+        overrides: Dictionary of column names and values to override.
+    
+    Returns:
+        New instance of the same class with merged data.
+    
+    Raises:
+        AttributeError: If obj is not an ORM-mapped instance.
+        TypeError: If overrides contain invalid column names.
+    """
     cls = obj.__class__
+    
+    if not hasattr(obj, '__table__'):
+        raise AttributeError(f"Object of type {type(obj).__name__} is not ORM-mapped")
 
     data = {c.name: getattr(obj, c.name) for c in cls.__table__.columns if not c.primary_key}
     data.update(overrides)
 
     return cls(**data)
cognee/modules/graph/methods/delete_dataset_related_nodes.py-9-13 (2)

9-13: Inefficient deletion pattern - delete directly without fetching.

The current implementation fetches all Node records into memory just to extract their IDs for deletion. For datasets with many nodes, this is wasteful. You can delete directly using the dataset_id predicate.

🔎 Proposed fix to delete nodes directly
 @with_async_session
 async def delete_dataset_related_nodes(dataset_id: UUID, session: AsyncSession):
-    nodes = (await session.scalars(select(Node).where(Node.dataset_id == dataset_id))).all()
-
-    await session.execute(delete(Node).where(Node.id.in_([node.id for node in nodes])))
+    await session.execute(delete(Node).where(Node.dataset_id == dataset_id))

9-13: Add a docstring to document the function.

Per the coding guidelines, undocumented function definitions are considered incomplete. Please add a docstring describing the purpose, parameters, and behavior of this function.

🔎 Proposed docstring
 @with_async_session
 async def delete_dataset_related_nodes(dataset_id: UUID, session: AsyncSession):
+    """
+    Delete all nodes associated with a given dataset.
+
+    Args:
+        dataset_id: UUID of the dataset whose nodes should be deleted
+        session: Database session (injected by decorator if not provided)
+    """
     nodes = (await session.scalars(select(Node).where(Node.dataset_id == dataset_id))).all()

Based on coding guidelines, undocumented functions should include docstrings for completeness.

cognee/tasks/codingagents/coding_rule_associations.py-130-137 (1)

130-137: Guard against KeyError when accessing context dictionary.

Line 130 accesses context["data"] without first checking if the "data" key exists. If the context dictionary doesn't contain this key, a KeyError will be raised.

🔎 Proposed fix
-        if context and hasattr(context["data"], "id"):
+        if context and "data" in context and hasattr(context["data"], "id"):
             await upsert_edges(
cognee/tests/utils/assert_nodes_vector_index_present.py-11-13 (1)

11-13: Add bounds checking for index_fields access.

Line 13 accesses node_metadata["index_fields"][0] without verifying that the list is non-empty. This will raise an IndexError if index_fields is an empty list.

🔎 Proposed fix
     for data_point in data_points:
         node_metadata = data_point.metadata or {}
+        index_fields = node_metadata.get("index_fields", [])
+        if not index_fields:
+            raise ValueError(f"DataPoint {data_point.id} has no index_fields defined")
-        collection_name = data_point.type + "_" + node_metadata["index_fields"][0]
+        collection_name = data_point.type + "_" + index_fields[0]
cognee/modules/graph/methods/delete_data_related_nodes.py-10-13 (1)

10-13: Optimize deletion to use a single query.

The current implementation fetches all matching nodes into memory (line 11) and then deletes them using an IN clause (line 13). This is inefficient and could consume excessive memory for large datasets. A direct DELETE statement would be faster and more memory-efficient.

🔎 Proposed refactor
 @with_async_session
 async def delete_data_related_nodes(data_id: UUID, session: AsyncSession):
-    nodes = (await session.scalars(select(Node).where(Node.data_id == data_id))).all()
-
-    await session.execute(delete(Node).where(Node.id.in_([node.id for node in nodes])))
+    await session.execute(delete(Node).where(Node.data_id == data_id))
cognee/modules/pipelines/operations/log_pipeline_run_complete.py-8-10 (1)

8-10: Add a docstring to document the function purpose.

This function lacks a docstring, which per project guidelines indicates incomplete implementation. Add a brief docstring explaining what the function does (e.g., logs the completion of a pipeline run with results to the database).

Also note: this is a new function, not a type change. All callers already pass UUID objects for pipeline_id (generated via generate_pipeline_id()), so there are no breaking changes.

cognee/tasks/storage/add_data_points.py-2-3 (1)

2-3: Remove duplicate import.

Line 3 duplicates the typing imports from line 2. Consolidate these into a single import statement.

🔎 Proposed fix
-from typing import Any, Dict, List, Optional
-from typing import List, Dict, Optional
+from typing import Any, Dict, List, Optional
cognee/modules/graph/methods/delete_dataset_related_edges.py-9-13 (1)

9-13: Variable naming error and inefficient query pattern.

  1. Variable is named nodes but contains Edge objects (line 11)
  2. The two-step approach (fetch then delete by IDs) is inefficient. A single DELETE statement would be more performant, especially for large datasets.
🔎 Proposed fix
 @with_async_session
-async def delete_dataset_related_edges(dataset_id: UUID, session: AsyncSession):
-    nodes = (await session.scalars(select(Edge).where(Edge.dataset_id == dataset_id))).all()
-
-    await session.execute(delete(Edge).where(Edge.id.in_([node.id for node in nodes])))
+async def delete_dataset_related_edges(dataset_id: UUID, session: AsyncSession) -> None:
+    """Delete all edges belonging to the specified dataset."""
+    await session.execute(delete(Edge).where(Edge.dataset_id == dataset_id))
cognee/modules/pipelines/operations/run_tasks.py-67-71 (1)

67-71: Handle case where dataset is not found.

If dataset_id doesn't correspond to an existing dataset, session.get() returns None, and subsequent dataset.id access will raise AttributeError. Consider adding explicit error handling.

🔎 Proposed fix
     async with get_relational_engine().get_async_session() as session:
         from cognee.modules.data.models import Dataset

         dataset = await session.get(Dataset, dataset_id)
+        if dataset is None:
+            raise ValueError(f"Dataset with id '{dataset_id}' not found")
cognee/cli/commands/delete_command.py-137-138 (1)

137-138: Unreachable code block.

The condition on line 137 (hasattr(args, "dataset_id") and hasattr(args, "data_id")) is unreachable because line 110's elif already handles hasattr(args, "dataset_id").

This branch needs to be restructured to handle data deletion separately, or the conditions need reordering.

cognee/cli/commands/delete_command.py-126-134 (1)

126-134: Contradictory logic: requires user_id then checks if not provided.

Line 127 raises an exception if user_id is not provided, but line 131 checks if not args.user_id to use default user. This logic is unreachable.

-                        if not hasattr(args, "user_id"):
-                            raise CliCommandException(
-                                "No user ID provided for deletion. Please specify using --user-id param."
-                            )
-
-                        if not args.user_id:
-                            user = await get_default_user()
-                        else:
-                            user = await get_user(args.user_id)
+                        user = await get_user(args.user_id) if args.user_id else await get_default_user()
cognee/api/v1/datasets/datasets.py-45-51 (1)

45-51: Type inconsistency and incorrect argument order in has_data.

  1. dataset_id is typed as str but other methods use UUID.
  2. Line 49 passes user.id instead of user to get_authorized_dataset, which expects a User object as the first argument (see relevant snippet from get_authorized_dataset.py).
     @staticmethod
-    async def has_data(dataset_id: str, user: Optional[User] = None) -> bool:
+    async def has_data(dataset_id: UUID, user: Optional[User] = None) -> bool:
         if not user:
             user = await get_default_user()

-        dataset = await get_authorized_dataset(user.id, dataset_id)
+        dataset = await get_authorized_dataset(user, dataset_id)

         return await has_dataset_data(dataset.id)
cognee/api/v1/datasets/datasets.py-40-42 (1)

40-42: Missing null check for dataset in list_data.

If get_authorized_dataset returns None (user lacks permission), line 42 will raise AttributeError when accessing dataset.id.

         dataset = await get_authorized_dataset(user, dataset_id)

+        if not dataset:
+            raise UnauthorizedDataAccessError(f"Dataset {dataset_id} not accessible.")
+
         return await get_dataset_data(dataset.id)

Committable suggestion skipped: line range outside the PR's diff.

cognee/api/v1/delete/__init__.py-13-14 (1)

13-14: The mode parameter is not forwarded to datasets.delete_data.

The function accepts a mode parameter but doesn't pass it to the underlying datasets.delete_data call. This means callers using mode="hard" will silently get the default behavior instead.

🔎 Suggested fix
 async def delete(data_id: UUID, dataset_id: UUID, mode: str = "soft", user: Optional[User] = None):
-    await datasets.delete_data(data_id=data_id, dataset_id=dataset_id, user=user)
+    await datasets.delete_data(dataset_id=dataset_id, data_id=data_id, user=user, mode=mode)
cognee/modules/graph/methods/legacy_delete.py-11-11 (1)

11-11: Production code imports from test utilities.

Importing get_contains_edge_text from cognee.tests.utils in production code violates module boundaries. Test utilities should not be dependencies of production modules. Consider moving this helper to a shared utility module under cognee/modules/ or cognee/shared/.

🔎 Suggested refactor

Move get_contains_edge_text to a shared module:

-from cognee.tests.utils.get_contains_edge_text import get_contains_edge_text
+from cognee.shared.utils import get_contains_edge_text

Then create cognee/shared/utils.py (or similar) with the helper function, and have the test utilities import from there instead.

Committable suggestion skipped: line range outside the PR's diff.

cognee/tests/test_delete_default_graph_with_legacy_data_2.py-116-118 (1)

116-118: Test function should follow test_* naming convention.

The function is named main but should be named with a test_ prefix for pytest to discover it automatically (e.g., test_delete_default_graph_with_legacy_data_2). As per coding guidelines, test functions should follow the test_*.py naming pattern.

🔎 Suggested fix
 @pytest.mark.asyncio
 @patch.object(LLMGateway, "acreate_structured_output", new_callable=AsyncMock)
-async def main(mock_create_structured_output: AsyncMock):
+async def test_delete_default_graph_with_legacy_data_2(mock_create_structured_output: AsyncMock):

Also update the __main__ block:

 if __name__ == "__main__":
     import asyncio
 
-    asyncio.run(main())
+    asyncio.run(test_delete_default_graph_with_legacy_data_2())
cognee/modules/graph/methods/legacy_delete.py-80-92 (1)

80-92: Variable shadowing: inner node shadows outer loop variable.

The list comprehension at line 89 unpacks (__, edge, node) which shadows the node variable from the outer loop at line 77. This makes the code confusing and error-prone.

🔎 Suggested fix - rename inner variable
                 if key == "chunks":
                     chunk_connections = await graph_db.get_connections(node_id)
                     deleted_node_ids.extend(
                         [
                             str(
                                 generate_edge_id(
-                                    get_contains_edge_text(node["name"], node["description"])
+                                    get_contains_edge_text(connected_node["name"], connected_node["description"])
                                 )
                             )
-                            for (__, edge, node) in chunk_connections
+                            for (__, edge, connected_node) in chunk_connections
                             if "relationship_name: contains;" in edge["relationship_name"]
                         ]
                     )
🟡 Minor comments (27)
cognee-starter-kit/src/pipelines/low_level.py-207-210 (1)

207-210: Potential IndexError if data list is empty.

The function accesses data[0] without checking if the list is non-empty. Consider adding a guard or documenting the precondition.

🔎 Suggested fix
 def ingest_payloads(data: List[Data]) -> list[Company]:
     """Ingest payloads and build company nodes."""
+    if not data:
+        raise ValueError("data list must contain at least one element")
     companies = build_companies(data[0])
     return companies

Alternatively, if only a single Data object is expected, simplify the signature:

-def ingest_payloads(data: List[Data]) -> list[Company]:
+def ingest_payloads(data: Data) -> list[Company]:
     """Ingest payloads and build company nodes."""
-    companies = build_companies(data[0])
+    companies = build_companies(data)
     return companies
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py-91-92 (1)

91-92: Remove duplicate condition check.

Line 91 duplicates the condition already checked at line 89. The nested if isinstance(dataset, str): is redundant since we're already inside a block that only executes when dataset is a string.

🔎 Proposed fix
     # If dataset is given as name make sure the dataset is created first
     if isinstance(dataset, str):
         async with db_engine.get_async_session() as session:
-            if isinstance(dataset, str):
-                dataset = await create_authorized_dataset(dataset, user)
+            dataset = await create_authorized_dataset(dataset, user)
cognee/tests/test_delete_custom_graph.py-129-130 (1)

129-130: Fix misleading comment: this is the first deletion, not second.

The comment says "Second data deletion" but this is the first call to delete_data.

Proposed fix
-    # Second data deletion
+    # First data deletion
     await datasets.delete_data(dataset.id, data1.id, user)
cognee/tests/test_delete_custom_graph.py-144-147 (1)

144-147: Fix assertion message: refers to John instead of Jane.

The assertion message incorrectly states "John and Company A" when checking Jane's edge to Company B.

Proposed fix
     assert (
         f"{str(generate_node_id('Jane'))}_works_for_{str(generate_node_id('Company B'))}"
         in edges_by_ids
-    ), "Edge between John and Company A not present in the graph."
+    ), "Edge between Jane and Company B not present in the graph."
cognee/tests/test_delete_custom_graph.py-120-127 (1)

120-127: Fix copy-paste errors in assertion messages.

All three edge assertion messages incorrectly state "John and Company A" even when checking Jane's edges or John's connection to Company B.

Proposed fix
     assert (
         f"{str(generate_node_id('John'))}_works_for_{str(generate_node_id('Company B'))}"
         in edges_by_ids
-    ), "Edge between John and Company A not present in the graph."
+    ), "Edge between John and Company B not present in the graph."
     assert (
         f"{str(generate_node_id('Jane'))}_works_for_{str(generate_node_id('Company B'))}"
         in edges_by_ids
-    ), "Edge between John and Company A not present in the graph."
+    ), "Edge between Jane and Company B not present in the graph."
cognee/modules/graph/legacy/has_edges_in_legacy_ledger.py-27-29 (1)

27-29: Guard against creator_function without a dot separator.

Line 29 assumes creator_function always contains a dot (e.g., "add_data_points.works_for"). If a legacy entry lacks this format, split(".")[1] will raise an IndexError.

Proposed fix
-    legacy_edge_names = set([edge.creator_function.split(".")[1] for edge in legacy_edges])
+    legacy_edge_names = set()
+    for edge in legacy_edges:
+        parts = edge.creator_function.split(".", 1)
+        if len(parts) > 1:
+            legacy_edge_names.add(parts[1])
cognee/tests/test_delete_dataset_neo4j.py-174-181 (1)

174-181: Dead code: rebuilding the same structure from initial_nodes instead of current nodes.

This block iterates over initial_nodes (same as lines 151-157) and builds after_delete_nodes_by_vector_collection, but this variable is never used afterward. This appears to be copy-paste error or leftover code.

Proposed fix: Remove unused code block
     after_first_delete_node_ids = set([node[0] for node in nodes])
 
-    after_delete_nodes_by_vector_collection = {}
-    for node in initial_nodes:
-        node_data = node[1]
-        node_metadata = json.loads(node_data["metadata"])
-        collection_name = node_data["type"] + "_" + node_metadata["index_fields"][0]
-        if collection_name not in after_delete_nodes_by_vector_collection:
-            after_delete_nodes_by_vector_collection[collection_name] = []
-        after_delete_nodes_by_vector_collection[collection_name].append(node)
-
     vector_engine = get_vector_engine()
cognee/tests/test_delete_custom_graph.py-22-22 (1)

22-22: Add @pytest.mark.asyncio decorator for async test discovery.

The main function is async but lacks the @pytest.mark.asyncio decorator. Without it, pytest won't discover and run this test correctly in async mode. Per coding guidelines, use pytest.mark.asyncio for async tests.

Proposed fix
+import pytest
+
+
+@pytest.mark.asyncio
 async def main():
cognee/modules/data/methods/get_deletion_counts.py-82-82 (1)

82-82: Fix typo in comment.

The comment contains a redundant word: "data data_entries" should be "data entries".

🔎 Proposed fix
-                # Count all data data_entries across all of the user's datasets
+                # Count all data entries across all of the user's datasets
cognee/tests/test_delete_default_graph_non_mocked.py-89-100 (1)

89-100: Remove duplicate node grouping logic.

Lines 89-100 repeat the same grouping logic as lines 58-70, iterating over initial_nodes again instead of using the post-deletion nodes. This appears to be a bug - the variable after_delete_nodes_by_vector_collection is populated but never used, and it's populated with the same data as initial_nodes_by_vector_collection.

If the intent was to group nodes remaining after the first deletion, the loop should iterate over nodes (from line 77), not initial_nodes.

🔎 Proposed fix (if grouping post-deletion nodes is intended)
     after_delete_nodes_by_vector_collection = {}
-    for node in initial_nodes:
+    for node in nodes:
         node_data = node[1]
         node_metadata = (
             node_data["metadata"]
             if type(node_data["metadata"]) is dict
             else json.loads(node_data["metadata"])
         )
         collection_name = node_data["type"] + "_" + node_metadata["index_fields"][0]
         if collection_name not in after_delete_nodes_by_vector_collection:
             after_delete_nodes_by_vector_collection[collection_name] = []
         after_delete_nodes_by_vector_collection[collection_name].append(node)

Alternatively, if this grouping is not needed, consider removing lines 89-100 entirely since after_delete_nodes_by_vector_collection is never used.

cognee/tasks/temporal_awareness/build_graph_with_temporal_awareness.py-19-20 (1)

19-20: Add missing docstring and fix blocking I/O operations in async context.

  • Line 19-20: file.read() is synchronous I/O that will block the event loop. Use await asyncio.to_thread(file.read) to handle both sync (LocalFileStorage) and async (S3FileStorage) backends without blocking, consistent with the pattern used in cognee/shared/cache.py.
  • Line 35: datetime.now() lacks timezone information, affecting consistency across environments. Use datetime.now(timezone.utc) for portable timestamps.
  • Missing docstring for the async function. Add a brief summary describing the function's purpose and parameters per coding guidelines.
distributed/workers/graph_saving_worker.py-98-98 (1)

98-98: Remove commented-out upsert calls—they reference undefined variables and are dead code.

The commented-out calls to upsert_nodes and upsert_edges on lines 98 and 111 reference undefined variables (user, dataset, data) that are never available in the graph_saving_worker() function scope. This worker processes a queue containing only nodes and edges—no metadata context.

Metadata persistence is correctly handled elsewhere in the workflow. In cognee/tasks/storage/add_data_points.py, where metadata is available, upsert_nodes() and upsert_edges() are called after graph_engine.add_nodes() to store user/dataset/data associations in the relational database. The graph_saving_worker has no access to this context and shouldn't attempt it.

Delete these commented-out lines.

cognee/tests/utils/assert_graph_edges_present.py-13-16 (1)

13-16: Edge ID construction may fail with underscore-containing relationship names.

The edge identifier is built using underscore separators (lines 13 and 16). If a relationship name contains underscores, this could lead to ambiguous or incorrect matching.

Consider using a delimiter that's guaranteed not to appear in relationship names, or use a tuple-based comparison instead of string concatenation:

# Alternative approach using tuples
edge_ids = set([(str(edge[0]), edge[2], str(edge[1])) for edge in edges])

for relationship in relationships:
    relationship_tuple = (str(relationship[0]), relationship[2], str(relationship[1]))
    # ... comparison logic
cognee/tests/utils/assert_nodes_vector_index_present.py-25-28 (1)

25-28: Handle missing 'text' key in vector_item.payload safely.

Line 27 accesses vector_item.payload['text'] in the assertion message, which will raise a KeyError if the payload doesn't contain a 'text' key.

🔎 Proposed fix
         for vector_item in vector_items:
             assert str(vector_item.id) in query_data_point_ids, (
-                f"{vector_item.payload['text']} is not present in the vector store."
+                f"Item {vector_item.id} (text: {vector_item.payload.get('text', 'N/A')}) is not present in the vector store."
             )
cognee-mcp/src/cognee_client.py-228-232 (1)

228-232: Add missing return statement for consistency.

The direct mode path doesn't return a value, while the API mode returns a JSON response (line 221). This creates inconsistent behavior between the two modes.

🔎 Proposed fix
             with redirect_stdout(sys.stderr):
                 user = await get_default_user()
                 await self.cognee.datasets.delete_data(
                     dataset_id=dataset_id,
                     data_id=data_id,
                     user=user,
                 )
+                return {"status": "success", "message": "Data deleted successfully"}
cognee/modules/graph/methods/upsert_nodes.py-56-58 (1)

56-58: Remove redundant session.commit() call.

The @with_async_session decorator already commits the session when it creates one (see with_async_session.py lines 17-19). The explicit commit here is redundant when the decorator provides the session, and could cause issues if an external session is passed.

🔎 Proposed fix
     await session.execute(upsert_statement)
-
-    await session.commit()
cognee/modules/graph/methods/upsert_edges.py-30-56 (1)

30-56: Add documentation explaining edge deduplication semantics.

The composite ID for edges (lines 37-45) intentionally excludes data_id, unlike upsert_nodes which includes it. This creates different deduplication scopes: edges deduplicate globally within a dataset (based on source node, relationship, and destination node), while nodes are scoped per data item. The deduplicate_nodes_and_edges function confirms this pattern—edge keys use only edge[0] + edge[2] + edge[1], not data_id.

While this design choice appears intentional, the function lacks documentation explaining why edges and nodes have asymmetric deduplication semantics. Add a docstring clarifying this design decision and the rationale for global edge deduplication across data items.

cognee/api/v1/datasets/routers/get_datasets_router.py-267-267 (1)

267-267: Update endpoint docstring to match actual HTTP error codes returned by delete_data.

The delete_data method properly handles errors and raises UnauthorizedDataAccessError for authorization and data validation failures. However, this exception is mapped to 401 Unauthorized, not 404 as documented in the endpoint's docstring. Update the docstring to accurately reflect the actual error responses:

  • 401 Unauthorized: User doesn't have access to the dataset or data item doesn't belong to the dataset
  • 500 Internal Server Error: Error during deletion
cognee/modules/graph/methods/delete_dataset_nodes_and_edges.py-21-21 (1)

21-21: Unused parameter: user_id.

The user_id parameter is declared but never referenced in the function body. Remove it from the function signature and update the caller in cognee/api/v1/datasets/datasets.py (line 71).

Additionally, add a docstring to document the function's purpose.

cognee/tests/utils/assert_edges_vector_index_present.py-10-21 (1)

10-21: Type hint declares 4-element tuple but function receives 3-element tuples.

The type hint on line 10 specifies Tuple[UUID, UUID, str, Dict] (4 elements), but all callers pass 3-element tuples (source_id, target_id, relationship_name). The function only accesses indices 0-2, and the Dict node data is passed as a separate parameter. Update the type hint to Tuple[UUID, UUID, str] to match the actual data structure.

Additionally, the function lacks a docstring. Consider adding a brief summary of what it does (e.g., formats a relationship tuple into an edge ID mapping, handling special case for "contains" relationships).

cognee/tests/test_delete_two_users_same_dataset.py-61-69 (1)

61-69: Inconsistent directory naming with file name.

The test directories are named test_delete_two_users_with_legacy_data but the file is named test_delete_two_users_same_dataset.py. This inconsistency could cause confusion and potential file conflicts with test_delete_two_users_with_legacy_data.py if it exists.

     data_directory_path = os.path.join(
-        pathlib.Path(__file__).parent, ".data_storage/test_delete_two_users_with_legacy_data"
+        pathlib.Path(__file__).parent, ".data_storage/test_delete_two_users_same_dataset"
     )
     cognee.config.data_root_directory(data_directory_path)

     cognee_directory_path = os.path.join(
         pathlib.Path(__file__).parent,
-        ".cognee_system/test_delete_two_users_with_legacy_data",
+        ".cognee_system/test_delete_two_users_same_dataset",
     )
cognee/api/v1/datasets/datasets.py-114-115 (1)

114-115: The mode parameter is ignored.

The mode parameter is documented for backwards compatibility but line 115 always passes "soft" to legacy_delete, ignoring the actual mode value.

         if not await has_data_related_nodes(dataset_id, data_id):
-            await legacy_delete(data, "soft")
+            await legacy_delete(data, mode)

If intentionally hardcoded to "soft" for safety, consider removing the parameter or updating the docstring to clarify.

cognee/tests/test_delete_dataset_kuzu.py-20-22 (1)

20-22: Rename test function to follow pytest naming conventions.

The function is named main but decorated with @pytest.mark.asyncio. Pytest discovers tests by looking for functions prefixed with test_. This function will not be discovered by pytest's test collection.

 @pytest.mark.asyncio
 @patch.object(LLMGateway, "acreate_structured_output", new_callable=AsyncMock)
-async def main(mock_create_structured_output: AsyncMock):
+async def test_delete_dataset_kuzu(mock_create_structured_output: AsyncMock):
cognee/api/v1/datasets/datasets.py-107-110 (1)

107-110: Unreachable condition and inefficient list comprehension.

Line 109's if not data check is unreachable because if data is None or falsy, the function already returns on line 105. Additionally, use a generator expression with any() instead of a list.

         data_datasets = data.datasets

-        if not data or not any([dataset.id == dataset_id for dataset in data_datasets]):
+        if not any(ds.id == dataset_id for ds in data_datasets):
             raise UnauthorizedDataAccessError(f"Data {data_id} not accessible.")
cognee/tests/test_delete_two_users_same_dataset.py-56-58 (1)

56-58: Rename test function to follow pytest naming conventions.

Same issue as in test_delete_dataset_kuzu.py - the function main won't be discovered by pytest. Rename to test_delete_two_users_same_dataset.

 @pytest.mark.asyncio
 @patch.object(LLMGateway, "acreate_structured_output", new_callable=AsyncMock)
-async def main(mock_create_structured_output: AsyncMock):
+async def test_delete_two_users_same_dataset(mock_create_structured_output: AsyncMock):
cognee/modules/graph/methods/legacy_delete.py-30-33 (1)

30-33: Potential AttributeError if metadata field lacks a default value.

Accessing subclass.model_fields["metadata"].default.get(...) assumes the metadata field exists and has a dict as its default. If a DataPoint subclass defines metadata without a default or with None, this will raise an AttributeError.

🔎 Suggested defensive check
 for subclass in subclasses:
-    index_fields = subclass.model_fields["metadata"].default.get("index_fields", [])
+    metadata_field = subclass.model_fields.get("metadata")
+    if metadata_field is None or metadata_field.default is None:
+        continue
+    index_fields = metadata_field.default.get("index_fields", [])
     for field_name in index_fields:
         vector_collections.append(f"{subclass.__name__}_{field_name}")
cognee/tasks/graph/extract_graph_from_data.py-100-109 (1)

100-109: Missing validation for required context keys.

When user exists in context, the code assumes context["dataset"] and context["data"] also exist. If these keys are missing, a KeyError will be raised at runtime.

🔎 Suggested defensive check
         user = context["user"] if "user" in context else None
 
-        if user:
+        if user and "dataset" in context and "data" in context:
             await upsert_edges(
                 graph_edges,
                 tenant_id=user.tenant_id,
                 user_id=user.id,
                 dataset_id=context["dataset"].id,
                 data_id=context["data"].id,
             )

Comment on lines 367 to 396
test-deletion-on-default-graph:
name: Delete default graph data test ${{ matrix.python-version }} on ubuntu-22.04
runs-on: ubuntu-22.04
strategy:
matrix:
python-version: ${{ fromJSON(inputs.supported-python-versions) }}
fail-fast: false
steps:
- name: Check out
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: ${{ matrix.python-version }}

- name: Run deletion on default graph
env:
ENV: 'dev'
LLM_MODEL: ${{ secrets.LLM_MODEL }}
LLM_ENDPOINT: ${{ secrets.LLM_ENDPOINT }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_API_VERSION: ${{ secrets.LLM_API_VERSION }}
EMBEDDING_MODEL: ${{ secrets.EMBEDDING_MODEL }}
EMBEDDING_ENDPOINT: ${{ secrets.EMBEDDING_ENDPOINT }}
EMBEDDING_API_KEY: ${{ secrets.EMBEDDING_API_KEY }}
EMBEDDING_API_VERSION: ${{ secrets.EMBEDDING_API_VERSION }}
run: uv run python ./cognee/tests/test_delete_default_graph.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Matrix references undefined workflow input.

The matrix strategy at line 372 references inputs.supported-python-versions, but this input is not defined in the workflow's inputs section (lines 8-11). This will cause the workflow to fail when triggered.

🔧 Proposed fix

Either add the missing input definition:

 on:
   workflow_call:
     inputs:
       python-version:
         required: false
         type: string
         default: '3.11.x'
+      supported-python-versions:
+        required: false
+        type: string
+        default: '["3.11.x", "3.12.x"]'

Or if only testing a single version, remove the matrix:

-    strategy:
-      matrix:
-        python-version: ${{ fromJSON(inputs.supported-python-versions) }}
-      fail-fast: false
     steps:
       - name: Check out
         uses: actions/checkout@v4
@@ -380,7 +377,7 @@
       - name: Cognee Setup
         uses: ./.github/actions/cognee_setup
         with:
-          python-version: ${{ matrix.python-version }}
+          python-version: ${{ inputs.python-version }}

This same issue affects test jobs at lines 398-427 (test-deletion-on-default-graph-non-mocked) and lines 429-458 (test-deletion-on-custom-graph).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test-deletion-on-default-graph:
name: Delete default graph data test ${{ matrix.python-version }} on ubuntu-22.04
runs-on: ubuntu-22.04
strategy:
matrix:
python-version: ${{ fromJSON(inputs.supported-python-versions) }}
fail-fast: false
steps:
- name: Check out
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: ${{ matrix.python-version }}
- name: Run deletion on default graph
env:
ENV: 'dev'
LLM_MODEL: ${{ secrets.LLM_MODEL }}
LLM_ENDPOINT: ${{ secrets.LLM_ENDPOINT }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_API_VERSION: ${{ secrets.LLM_API_VERSION }}
EMBEDDING_MODEL: ${{ secrets.EMBEDDING_MODEL }}
EMBEDDING_ENDPOINT: ${{ secrets.EMBEDDING_ENDPOINT }}
EMBEDDING_API_KEY: ${{ secrets.EMBEDDING_API_KEY }}
EMBEDDING_API_VERSION: ${{ secrets.EMBEDDING_API_VERSION }}
run: uv run python ./cognee/tests/test_delete_default_graph.py
test-deletion-on-default-graph:
name: Delete default graph data test ${{ inputs.python-version }} on ubuntu-22.04
runs-on: ubuntu-22.04
steps:
- name: Check out
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Cognee Setup
uses: ./.github/actions/cognee_setup
with:
python-version: ${{ inputs.python-version }}
- name: Run deletion on default graph
env:
ENV: 'dev'
LLM_MODEL: ${{ secrets.LLM_MODEL }}
LLM_ENDPOINT: ${{ secrets.LLM_ENDPOINT }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_API_VERSION: ${{ secrets.LLM_API_VERSION }}
EMBEDDING_MODEL: ${{ secrets.EMBEDDING_MODEL }}
EMBEDDING_ENDPOINT: ${{ secrets.EMBEDDING_ENDPOINT }}
EMBEDDING_API_KEY: ${{ secrets.EMBEDDING_API_KEY }}
EMBEDDING_API_VERSION: ${{ secrets.EMBEDDING_API_VERSION }}
run: uv run python ./cognee/tests/test_delete_default_graph.py
🧰 Tools
🪛 actionlint (1.7.9)

372-372: property "supported-python-versions" is not defined in object type {python-version: string}

(expression)

🤖 Prompt for AI Agents
In .github/workflows/e2e_tests.yml around lines 367 to 396 (and similarly the
jobs at 398-427 and 429-458), the matrix references
inputs.supported-python-versions which is not defined in the workflow inputs;
either add a top-level workflow input named supported-python-versions containing
a JSON array of versions (e.g. '["3.10","3.11"]') so
fromJSON(inputs.supported-python-versions) resolves, or replace the matrix
reference with an explicit matrix list or a single python-version value in each
job so the matrix does not depend on an undefined input; update all three
affected jobs consistently.

Comment on lines 40 to 51
if not any(
[
getattr(args, "dataset_name", None),
getattr(args, "dataset_id", None),
getattr(args, "data_id", None),
getattr(args, "all", False) and getattr(args, "user_id", None),
]
):
fmt.error(
"Please specify what to delete: --dataset-name, --dataset-id, --data-id, --user-id, or --all"
)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Arguments --dataset-id and --data-id are referenced but not configured.

The validation checks for dataset_id and data_id (lines 43-44), but these arguments are not added in configure_parser. This will cause getattr to return None and the conditions will never be satisfied.

Add the missing argument definitions in configure_parser:

     def configure_parser(self, parser: argparse.ArgumentParser) -> None:
         parser.add_argument("--dataset-name", "-d", help="Specific dataset to delete")
+        parser.add_argument("--dataset-id", help="Dataset ID to delete")
+        parser.add_argument("--data-id", help="Specific data item ID to delete")
         parser.add_argument("--user-id", "-u", help="User ID to delete data for")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/cli/commands/delete_command.py around lines 40 to 51, the validation
references args.dataset_id and args.data_id but those options are not defined in
configure_parser, so getattr will always return None; add argument definitions
for --dataset-id and --data-id in configure_parser (using dest names dataset_id
and data_id respectively), set appropriate types (e.g., int or str to match
storage IDs), include clear help text, and ensure they are optional (default
None) so the validation works as intended; also verify names exactly match the
attributes used in the validation and update any related help/examples if
needed.

Comment on lines 104 to 109
if getattr(args, "all", False):
if not hasattr(args, "user_id"):
raise CliCommandException(
"No user ID provided for '--all' deletion. Please specify using --user-id param."
)
await cognee_datasets.delete_all(user_id=args.user_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect API call: delete_all expects user, not user_id.

cognee_datasets.delete_all signature is async def delete_all(user: Optional[User] = None) but it's called with user_id=args.user_id. This will cause a TypeError at runtime.

                     if getattr(args, "all", False):
                         if not hasattr(args, "user_id"):
                             raise CliCommandException(
                                 "No user ID provided for '--all' deletion. Please specify using --user-id param."
                             )
-                        await cognee_datasets.delete_all(user_id=args.user_id)
+                        user = await get_user(args.user_id) if args.user_id else await get_default_user()
+                        await cognee_datasets.delete_all(user=user)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cognee/cli/commands/delete_command.py around lines 104-109, the call passes
user_id to delete_all but the function signature expects a User object (user:
Optional[User]). Replace the positional user_id argument with a proper User
object: resolve the user from the ID (e.g. user = await
cognee_datasets.get_user(args.user_id) or the appropriate user-fetch helper) and
call await cognee_datasets.delete_all(user=user); if the lookup returns None
raise a CliCommandException indicating the user was not found.

Comment on lines 10 to 16
@with_async_session
async def get_dataset_related_edges(dataset_id: UUID, session: AsyncSession):
return (
await session.scalars(
select(Edge).where(Edge.dataset_id == dataset_id).distinct(Edge.relationship_name)
)
).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "get_dataset_related_edges.py" 2>/dev/null | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Find the file and examine it
find . -type f -name "*.py" -path "*/graph/methods/*" 2>/dev/null | head -20

Repository: topoteretes/cognee

Length of output: 969


🏁 Script executed:

# Look for the cognee directory structure
fd "get_dataset_related_edges" --type f 2>/dev/null

Repository: topoteretes/cognee

Length of output: 119


🏁 Script executed:

# Check for .distinct( usage in the codebase
rg -n "\.distinct\(" --type=py -C2

Repository: topoteretes/cognee

Length of output: 886


🏁 Script executed:

# Also check the broader context of the file mentioned
git ls-files | grep -E "modules/graph/methods" | head -20

Repository: topoteretes/cognee

Length of output: 937


🏁 Script executed:

cat -n cognee/modules/graph/methods/get_dataset_related_edges.py

Repository: topoteretes/cognee

Length of output: 1387


🏁 Script executed:

# Check if there's a pyproject.toml or setup.py to understand database configuration
cat -n pyproject.toml | head -50

Repository: topoteretes/cognee

Length of output: 1953


🏁 Script executed:

# Look for database-related configuration
rg -i "sqlite\|postgres\|database" pyproject.toml | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check SQLAlchemy version and search for distinct usage in requirements or setup
rg -i "sqlalchemy" pyproject.toml | head -10

Repository: topoteretes/cognee

Length of output: 179


🏁 Script executed:

# Look at similar get_ functions to understand the pattern used in the codebase
cat -n cognee/modules/graph/methods/get_data_related_edges.py

Repository: topoteretes/cognee

Length of output: 1847


🏁 Script executed:

# Check the Edge model to understand the return type
find . -type f -name "*.py" -path "*/graph/models*" | head -10

Repository: topoteretes/cognee

Length of output: 252


🏁 Script executed:

# Look for the Edge model definition
rg -n "class Edge" --type=py -A 5 | head -50

Repository: topoteretes/cognee

Length of output: 2626


🏁 Script executed:

# Check if other functions in the codebase have return type annotations
rg -n "async def get_" cognee/modules/graph/methods/ --type=py | head -20

Repository: topoteretes/cognee

Length of output: 1326


🏁 Script executed:

# Check for any type hints in the Edge-related functions
rg -B2 -A8 "async def get_dataset_related" cognee/modules/graph/methods/ --type=py

Repository: topoteretes/cognee

Length of output: 2001


🏁 Script executed:

# Search for return type annotations in the entire methods directory
rg "\) ->" cognee/modules/graph/methods/ --type=py

Repository: topoteretes/cognee

Length of output: 362


🏁 Script executed:

# Check what databases are actually used in the codebase
rg -i "sqlite\|postgres\|engine.*dialect" --type=py | head -30

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check the database configuration
find . -name "*config*" -o -name "*database*" -type f | grep -E "\.(py|toml|yml|yaml)$" | head -20

Repository: topoteretes/cognee

Length of output: 1219


🏁 Script executed:

# Look at the with_async_session decorator to understand the session setup
rg -B5 -A10 "def with_async_session" --type=py

Repository: topoteretes/cognee

Length of output: 1685


🏁 Script executed:

# Check if there's any documentation on SQLAlchemy distinct behavior or database support
rg -i "distinct\|database.*support" cognee/infrastructure/databases/ --type=py | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check which databases are configured or used
cat -n cognee/infrastructure/databases/relational/config.py

Repository: topoteretes/cognee

Length of output: 6519


🏁 Script executed:

# Look for database URL patterns to understand supported databases
rg "sqlite\|postgres\|mysql" --type=py | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check supported databases in graph module
cat -n cognee/infrastructure/databases/graph/supported_databases.py

Repository: topoteretes/cognee

Length of output: 93


🏁 Script executed:

# Check if there's any SQLAlchemy version-specific behavior documented
rg -i "distinct.*on\|postgres.*distinct" --type=py | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check the Edge model more closely to understand the return type
cat -n cognee/modules/graph/models/Edge.py

Repository: topoteretes/cognee

Length of output: 2362


🏁 Script executed:

# Look for how session.scalars is typically used in the codebase
rg "session.scalars" --type=py -A2 | head -40

Repository: topoteretes/cognee

Length of output: 3245


🏁 Script executed:

# Check if there are any tests or examples showing how get_dataset_related_edges should work
rg "get_dataset_related_edges\|get_global_dataset_related_edges" --type=py | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check SQLAlchemy documentation about distinct behavior with different databases
# Let's look for any version constraints or documentation
rg "distinct" --type=py -B3 -A3 | head -50

Repository: topoteretes/cognee

Length of output: 6422


🏁 Script executed:

# Check if there are any database-specific tests or conditional logic
rg "sqlite\|postgres" --type=py -i | grep -i "distinct\|condition\|if" | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Verify the actual imports to understand SQLAlchemy version and capabilities
cat -n cognee/modules/graph/methods/get_dataset_related_edges.py

Repository: topoteretes/cognee

Length of output: 1387


🏁 Script executed:

# Search for any documentation or comments about database compatibility
rg "database.*compat\|portab\|support.*sql" --type=py -i | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if the function is actually used anywhere that might reveal intended behavior
rg "get_dataset_related_edges\|get_global_dataset_related_edges" --type=py -B2 -A2 | grep -v "^--$" | head -40

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Let's verify SQLAlchemy 2.0+ behavior with distinct on columns
# Check if there's any SQLAlchemy documentation or test that shows intended behavior
rg "DISTINCT ON\|distinct.*column" --type=py | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if there are any database-specific conditions or workarounds elsewhere
rg "db_provider\|dialect" --type=py -C2 | head -40

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Look at how other similar methods handle filtering for unique values
rg "group_by\|distinct" cognee/modules/graph/methods/ --type=py -B2 -A2

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if there are any database-specific test files or CI configurations
find . -name "*test*.py" -type f | xargs grep -l "get_dataset_related_edges\|get_global_dataset_related_edges" 2>/dev/null

Repository: topoteretes/cognee

Length of output: 46


🏁 Script executed:

# Look at the CI/test configuration to see which databases are tested
cat -n .github/workflows/*.yml 2>/dev/null | head -100

Repository: topoteretes/cognee

Length of output: 3112


🏁 Script executed:

# Check if there's any documentation about database support
find . -name "README*" -o -name "*DATABASE*" -o -name "*SETUP*" | head -10

Repository: topoteretes/cognee

Length of output: 341


🏁 Script executed:

# Verify the actual SQLAlchemy behavior by checking if there are any version notes
rg "SQLAlchemy\|2\.0\|distinct" README.md CONTRIBUTING.md 2>/dev/null | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Let's check what the actual Edge return type should be based on how session.scalars().all() works
python3 << 'EOF'
# Check SQLAlchemy 2.0+ documentation about distinct with column argument
print("SQLAlchemy 2.0+ behavior:")
print("- .distinct() with no args: SELECT DISTINCT (all databases)")
print("- .distinct(column): Database-specific implementation")
print("  - PostgreSQL: DISTINCT ON (column)")
print("  - SQLite: May not support or fail")
print("  - Other DBs: Behavior varies")
print("\nThe review's concern is valid - this is a portability issue.")
EOF

Repository: topoteretes/cognee

Length of output: 367


🏁 Script executed:

# Check if there are any comments or issues about this in the codebase
rg "TODO\|FIXME\|XXX\|NOTE" cognee/modules/graph/methods/get_dataset_related_edges.py

Repository: topoteretes/cognee

Length of output: 44


distinct(Edge.relationship_name) will fail on SQLite, the default database.

SQLAlchemy 2.0+ translates .distinct(column) to PostgreSQL's DISTINCT ON (column) syntax. Since the default db_provider is SQLite (per RelationalConfig), this code will raise a runtime error in production. Use .distinct() without arguments (standard SQL) or refactor using group_by for portability across all supported databases.

Additionally, add a return type annotation and docstring per coding guidelines:

Suggested fix
 @with_async_session
-async def get_dataset_related_edges(dataset_id: UUID, session: AsyncSession):
+async def get_dataset_related_edges(dataset_id: UUID, session: AsyncSession) -> list[Edge]:
+    """Retrieve all Edge records for the given dataset."""
     return (
         await session.scalars(
-            select(Edge).where(Edge.dataset_id == dataset_id).distinct(Edge.relationship_name)
+            select(Edge).where(Edge.dataset_id == dataset_id).distinct()
         )
     ).all()
🤖 Prompt for AI Agents
In cognee/modules/graph/methods/get_dataset_related_edges.py around lines 10 to
16, the query uses distinct(Edge.relationship_name) which emits
PostgreSQL-specific DISTINCT ON and will fail on the default SQLite provider;
change the query to use .distinct() (no args) or refactor to use
group_by(Edge.relationship_name) so it is portable across DBs, and add a return
type annotation (e.g., -> list[Edge] or Sequence[Edge]) and a short docstring
describing the function, params, and return value per coding guidelines.

Comment on lines 28 to 54
upsert_statement = (
insert(Node)
.values(
[
{
"id": uuid5(
NAMESPACE_OID,
str(tenant_id)
+ str(user_id)
+ str(dataset_id)
+ str(data_id)
+ str(node.id),
),
"slug": node.id,
"user_id": user_id,
"data_id": data_id,
"dataset_id": dataset_id,
"type": node.type,
"indexed_fields": DataPoint.get_embeddable_property_names(node),
"label": getattr(node, "label", getattr(node, "name", str(node.id))),
"attributes": jsonable_encoder(node),
}
for node in nodes
]
)
.on_conflict_do_nothing(index_elements=["id"])
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual file and its full context
cat -n cognee/modules/graph/methods/upsert_nodes.py | head -80

Repository: topoteretes/cognee

Length of output: 2255


🏁 Script executed:

# Search for database configuration and supported backends
rg -n "RELATIONAL_DB_ENGINE|DATABASE_URL|database" --type=py cognee/infrastructure/ | head -30

Repository: topoteretes/cognee

Length of output: 4423


🏁 Script executed:

# Check what SQLAlchemy dialects are used in the codebase
rg -n "sqlalchemy.dialects|from sqlalchemy.dialects" --type=py

Repository: topoteretes/cognee

Length of output: 581


🏁 Script executed:

# Look for how insert() statements are typically handled with empty lists
rg -n "\.values\(\[" --type=py -B 3 -A 3 | head -40

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check how upsert_nodes is called to see if empty nodes is expected
rg -n "upsert_nodes" --type=py -B 2 -A 2 | head -60

Repository: topoteretes/cognee

Length of output: 3318


🏁 Script executed:

# Look for patterns of handling empty lists before database operations
rg -n "if not.*:\s*return" --type=py cognee/modules/graph/methods/ -A 1

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if other similar operations in the codebase handle empty lists
rg -n "\.values\(" --type=py cognee/modules/graph/methods/ -B 2 -A 2

Repository: topoteretes/cognee

Length of output: 760


🏁 Script executed:

# Verify relational database configuration to confirm PostgreSQL-only
find cognee/infrastructure/databases/relational -name "*.py" | head -10

Repository: topoteretes/cognee

Length of output: 708


🏁 Script executed:

# Check the relational database configuration
cat -n cognee/infrastructure/databases/relational/config.py

Repository: topoteretes/cognee

Length of output: 6519


🏁 Script executed:

# Check the SqlAlchemy adapter to understand database backend support
cat -n cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py | head -50

Repository: topoteretes/cognee

Length of output: 2259


🏁 Script executed:

# Check if there's any environment configuration for database choice
rg -n "DATABASE_TYPE|DB_PROVIDER|RELATIONAL_DB" --type=py cognee/infrastructure/ | head -20

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Let's verify SQLAlchemy's insert behavior with empty values by checking how it's handled elsewhere
rg -n "on_conflict_do_nothing" --type=py -B 3 -A 1

Repository: topoteretes/cognee

Length of output: 1453


🏁 Script executed:

# Check how upsert_edges handles the empty list case, as it's similar
cat -n cognee/modules/graph/methods/upsert_edges.py | head -70

Repository: topoteretes/cognee

Length of output: 2437


🏁 Script executed:

# Verify if there's a pattern in the codebase for handling empty lists before DB operations
rg -n "if not.*:\s*return" --type=py cognee/modules/graph/methods/ -B 1 -A 1

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check the actual engine creation to understand which database is being used
cat -n cognee/infrastructure/databases/relational/get_relational_engine.py

Repository: topoteretes/cognee

Length of output: 949


🏁 Script executed:

# Check the create_relational_engine to see if it handles different database providers
cat -n cognee/infrastructure/databases/relational/create_relational_engine.py

Repository: topoteretes/cognee

Length of output: 2799


🏁 Script executed:

# Verify if there are any tests for upsert_nodes with empty lists
find . -name "*test*" -type f -name "*.py" | xargs grep -l "upsert_nodes\|upsert_edges" 2>/dev/null | head -5

Repository: topoteretes/cognee

Length of output: 44


🏁 Script executed:

# Check if there's any documentation about supported databases
find . -name "README*" -o -name "*.md" | xargs grep -i "database\|postgres\|sqlite" 2>/dev/null | head -20

Repository: topoteretes/cognee

Length of output: 2814


Add guard clause for empty list and use database-agnostic conflict handling.

Two issues:

  1. Empty list handling: If nodes is empty, this generates invalid SQL. Add an early return guard.

  2. Database portability: Code uses from sqlalchemy.dialects.postgresql import insert with .on_conflict_do_nothing(), which only works with PostgreSQL. The codebase supports both SQLite (sqlite+aiosqlite) and PostgreSQL (postgresql+asyncpg) as configured in create_relational_engine.py. SQLite requires different conflict-handling syntax (e.g., INSERT OR IGNORE).

For portability, use SQLAlchemy's generic insert() from sqlalchemy and apply dialect-specific conflict handling conditionally, or validate that all environments use PostgreSQL.

Proposed fix for empty list
+    if not nodes:
+        return
+
     upsert_statement = (
         insert(Node)
         .values(
🤖 Prompt for AI Agents
In cognee/modules/graph/methods/upsert_nodes.py around lines 28-54, add an early
guard that returns immediately when the incoming nodes list is empty to avoid
generating invalid SQL; then replace the PostgreSQL-only dialect insert usage
with SQLAlchemy's generic insert() and apply dialect-specific conflict handling:
detect the engine/connection dialect (e.g., engine.dialect.name or
connection.dialect.name) and if it is 'postgresql' use the PostgreSQL conflict
API (on_conflict_do_nothing), otherwise for SQLite apply the appropriate generic
approach (e.g., INSERT OR IGNORE via a prefix/prefix_with or use SQLite-specific
conflict clause) so the upsert works across configured backends; keep the rest
of the values/values-building logic the same and only branch when composing the
final statement.

Comment on lines 22 to 24
assert False, (
f"Edge '{relationship_name}' still present between '{source_node['name'] if 'node' in source_node else source_node['id']}' and '{destination_node['name'] if 'node' in destination_node else destination_node['id']}' in graph database."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential logic error in node name resolution.

The condition if 'node' in source_node (and for destination_node) appears incorrect. The code likely intends to check if 'name' in source_node to determine whether to use the name field or fall back to id.

🔎 Proposed fix
             assert False, (
-                f"Edge '{relationship_name}' still present between '{source_node['name'] if 'node' in source_node else source_node['id']}' and '{destination_node['name'] if 'node' in destination_node else destination_node['id']}' in graph database."
+                f"Edge '{relationship_name}' still present between '{source_node['name'] if 'name' in source_node else source_node['id']}' and '{destination_node['name'] if 'name' in destination_node else destination_node['id']}' in graph database."
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert False, (
f"Edge '{relationship_name}' still present between '{source_node['name'] if 'node' in source_node else source_node['id']}' and '{destination_node['name'] if 'node' in destination_node else destination_node['id']}' in graph database."
)
assert False, (
f"Edge '{relationship_name}' still present between '{source_node['name'] if 'name' in source_node else source_node['id']}' and '{destination_node['name'] if 'name' in destination_node else destination_node['id']}' in graph database."
)
🤖 Prompt for AI Agents
In cognee/tests/utils/assert_graph_edges_not_present.py around lines 22 to 24,
the logic uses "if 'node' in source_node" (and similarly for destination_node)
when resolving the display name, which is incorrect; change the condition to
check for the presence of the "name" key (i.e., use "if 'name' in source_node"
and "if 'name' in destination_node") or, simpler and safer, use the dict
.get('name', id) pattern to select source_node.get('name', source_node['id'])
and destination_node.get('name', destination_node['id']) so the message shows
the name when available and falls back to id.

from cognee.shared.data_models import KnowledgeGraph


def extract_entities(graph: KnowledgeGraph, cache: dict = {}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Mutable default argument causes shared state across function calls.

The default cache: dict = {} creates a single shared dictionary that persists across all calls to extract_entities. This is a well-known Python anti-pattern that can lead to unexpected behavior and hard-to-debug issues.

🔎 Proposed fix using None as default
-def extract_entities(graph: KnowledgeGraph, cache: dict = {}):
+def extract_entities(graph: KnowledgeGraph, cache: dict | None = None):
+    if cache is None:
+        cache = {}
     entities = []
     entity_types = []
🤖 Prompt for AI Agents
In cognee/tests/utils/extract_entities.py around line 6, the function signature
uses a mutable default argument `cache: dict = {}` which creates shared state
across calls; change the signature to accept None instead (e.g., `cache:
Optional[dict] = None`) and inside the function initialize `cache = {}` when
it's None so each call gets a fresh dictionary; keep the type annotation
consistent (import Optional if needed) and ensure subsequent code uses the local
`cache` variable.

Comment on lines 13 to 17
relationship = (
generate_edge_id(edge.source_node_id),
generate_edge_id(edge.target_node_id),
edge.relationship_name,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use generate_node_id for node identifiers, not generate_edge_id.

edge.source_node_id and edge.target_node_id are node identifiers, not edge identifiers. The code on lines 25-26 correctly uses generate_node_id() for nodes, but here generate_edge_id() is used inconsistently. This will produce different UUIDs for the same node depending on context, likely breaking relationship integrity.

🔎 Proposed fix
         if edge_id not in cache:
             relationship = (
-                generate_edge_id(edge.source_node_id),
-                generate_edge_id(edge.target_node_id),
+                generate_node_id(edge.source_node_id),
+                generate_node_id(edge.target_node_id),
                 edge.relationship_name,
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
relationship = (
generate_edge_id(edge.source_node_id),
generate_edge_id(edge.target_node_id),
edge.relationship_name,
)
relationship = (
generate_node_id(edge.source_node_id),
generate_node_id(edge.target_node_id),
edge.relationship_name,
)
🤖 Prompt for AI Agents
In cognee/tests/utils/extract_relationships.py around lines 13 to 17, the tuple
uses generate_edge_id() for edge.source_node_id and edge.target_node_id but
these are node identifiers; replace generate_edge_id(...) with
generate_node_id(...) for both source and target so node IDs are generated
consistently (matching the usage on lines ~25-26) and relationship integrity is
preserved.

from cognee.tasks.summarization.models import TextSummary


def extract_summary(document_chunk: DocumentChunk, summary=SummarizedContent) -> TextSummary:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the type annotation syntax error.

The parameter summary=SummarizedContent uses assignment syntax (=) instead of type annotation syntax (:). This sets the default value to the class itself rather than properly annotating the expected type. When called without a second argument, summary.summary on line 10 will fail because SummarizedContent (the class) has no summary attribute.

🔎 Proposed fix
-def extract_summary(document_chunk: DocumentChunk, summary=SummarizedContent) -> TextSummary:
+def extract_summary(document_chunk: DocumentChunk, summary: SummarizedContent) -> TextSummary:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_summary(document_chunk: DocumentChunk, summary=SummarizedContent) -> TextSummary:
def extract_summary(document_chunk: DocumentChunk, summary: SummarizedContent) -> TextSummary:
🤖 Prompt for AI Agents
In cognee/tests/utils/extract_summary.py around line 7, the function parameter
uses assignment syntax "summary=SummarizedContent" which sets the default to the
class instead of annotating the type; replace the "=" with ":" to make it a
proper type annotation (summary: SummarizedContent). If a default value is
actually required, instead use an explicit default like "summary:
Optional[SummarizedContent] = None" and update the function body to handle the
None case accordingly.

Comment on lines 44 to 46
class Data(BaseModel):
id: Optional[UUID] = uuid4()
payload: Dict[str, List]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Default id is shared across all instances.

Using id: Optional[UUID] = uuid4() evaluates uuid4() once at class definition time. All Data instances created without an explicit id will share the same UUID, breaking uniqueness.

Use Field(default_factory=...) to generate a new UUID per instance.

Proposed fix
+from pydantic import BaseModel, Field
+
 class Data(BaseModel):
-    id: Optional[UUID] = uuid4()
+    id: Optional[UUID] = Field(default_factory=uuid4)
     payload: Dict[str, List]
🤖 Prompt for AI Agents
In examples/low_level/pipeline.py around lines 44 to 46, the Data model sets id:
Optional[UUID] = uuid4() which calls uuid4() at import time and shares the same
UUID across all instances; replace the default with a per-instance factory by
using Field(default_factory=uuid4) (ensure Field is imported from pydantic) so
each Data() created without an explicit id gets a unique UUID.

@Vasilije1990 Vasilije1990 force-pushed the feature/cog-3014-refactor-delete-feature branch from 2ff8224 to fae9869 Compare January 2, 2026 13:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py (1)

16-17: Consider adding docstrings to test functions.

While the function names are descriptive, adding brief docstrings would document the test intent and expected behavior more explicitly.

Example docstring format
 @pytest.mark.asyncio
 async def test_delete_dataset_related_nodes_removes_only_target_dataset():
+    """Verify delete_dataset_related_nodes removes only nodes from the target dataset."""
     await create_db_and_tables()
 @pytest.mark.asyncio
 async def test_delete_dataset_related_edges_removes_only_target_dataset():
+    """Verify delete_dataset_related_edges removes only edges from the target dataset."""
     await create_db_and_tables()
 @pytest.mark.asyncio
 async def test_delete_dataset_related_nodes_edges_noop_on_empty_tables():
+    """Verify deletion functions handle empty tables without raising exceptions."""
     await create_db_and_tables()

As per coding guidelines, undocumented function definitions are assumed incomplete.

Also applies to: 80-81, 159-160

cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py (1)

16-17: Consider adding docstrings to test functions.

While the function names are descriptive, adding brief docstrings would document the test intent and expected behavior more explicitly.

Example docstring format
 @pytest.mark.asyncio
 async def test_delete_data_related_nodes_removes_only_target_data():
+    """Verify delete_data_related_nodes removes only nodes from the target data_id."""
     await create_db_and_tables()
 @pytest.mark.asyncio
 async def test_delete_data_related_edges_removes_only_target_data():
+    """Verify delete_data_related_edges removes only edges from the target data_id."""
     await create_db_and_tables()
 @pytest.mark.asyncio
 async def test_delete_data_related_nodes_edges_noop_on_empty_tables():
+    """Verify deletion functions handle empty tables without raising exceptions."""
     await create_db_and_tables()

As per coding guidelines, undocumented function definitions are assumed incomplete.

Also applies to: 76-77, 151-152

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff8224 and fae9869.

📒 Files selected for processing (2)
  • cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py
  • cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py
  • cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py
  • cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py
  • cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py
cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

  • cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py
  • cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 251
File: cognee/tests/infrastructure/databases/test_index_graph_edges.py:0-0
Timestamp: 2024-12-04T18:37:55.092Z
Learning: In the `index_graph_edges` function, both graph engine and vector engine initialization failures are handled within the same try-except block, so a single test covers both cases.
📚 Learning: 2024-11-13T16:06:32.576Z
Learnt from: hajdul88
Repo: topoteretes/cognee PR: 196
File: cognee/modules/graph/cognee_graph/CogneeGraph.py:32-38
Timestamp: 2024-11-13T16:06:32.576Z
Learning: In `CogneeGraph.py`, within the `CogneeGraph` class, it's intentional to add skeleton edges in both the `add_edge` method and the `project_graph_from_db` method to ensure that edges are added to the graph and to the nodes.

Applied to files:

  • cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test Pipeline Caching
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test adding of label for data in Cognee
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Linting
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: CLI Tests / CLI Unit Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (2)
cognee/tests/unit/graph/test_delete_dataset_related_nodes_edges.py (1)

1-169: Well-structured tests with comprehensive coverage.

The test suite effectively validates dataset-scoped deletion behavior:

  • Selective deletion (only target dataset removed)
  • Non-interference (other datasets remain intact)
  • Empty-table handling (no exceptions on empty state)

The test logic is correct, assertions are appropriate, and the data setup comprehensively covers the scenarios.

cognee/tests/unit/graph/test_delete_data_related_nodes_edges.py (1)

1-161: Well-structured tests with comprehensive coverage.

The test suite effectively validates data-scoped deletion behavior:

  • Selective deletion (only target data removed)
  • Non-interference (other data remains intact)
  • Empty-table handling (no exceptions on empty state)

The test logic is correct, assertions are appropriate, and the structure mirrors the dataset-scoped tests, providing consistent coverage across both deletion scopes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants