Add custom Python Directory support for Index by JingsongLi · Pull Request #632 · quickwit-oss/tantivy-py

JingsongLi · 2026-03-31T10:55:48Z

Allow users to pass a custom directory object to the Index constructor via the directory parameter, enabling index storage backed by any Python-implemented backend (e.g., cloud object store, database, or in-memory dict).

Allow users to pass a custom directory object to the Index constructor via the `directory` parameter, enabling index storage backed by any Python-implemented backend (e.g., cloud object store, database, or in-memory dict).

cjrh · 2026-03-31T18:29:16Z

Thanks for this, I will go through it in the next few days.

cjrh · 2026-04-18T01:10:57Z



+class Directory(Protocol):
+    """Protocol that custom directory objects must implement."""


Docstrings should be on the struct or struct methods, not in the pyi file

cjrh · 2026-04-18T01:47:17Z

+            io_error: Arc::new(io::Error::other(e.to_string())),
+            filepath: path.to_path_buf(),
+        })?;
+


This appears to be converting any kind of error into an IO error.

This could break the locking behaviour in tantivy. Tantivy acquires directory locks by calling open_write on a lock file and retrying if it gets back OpenWriteError::FileAlreadyExists. Look at tantivy's own source try_acquire_lock in ~/.cargo/registry/src/.../tantivy-0.25.0/src/directory/directory.rs:78-86:

let mut write = directory.open_write(filepath).map_err(|e| match e { OpenWriteError::FileAlreadyExists(_) => TryAcquireLockError::FileExists, // retry OpenWriteError::IoError { io_error, .. } => TryAcquireLockError::IoError(io_error), // bail })?;

The retry loop only triggers on FileAlreadyExists. IoError is treated as fatal and propagated
up immediately. Blocking locks retry 100 times at 100ms intervals (10s timeout) but this only happens
when they see FileAlreadyExists.

This PR's own test InMemoryDirectory.open_write already raises FileExistsError which is correct, but but the rust code here swallows it into IoError, so tantivy's retry logic never fires. A second concurrent writer will fail immediately instead of waiting for the first to release and retrying.

An idea for a fix might look something this:

.map_err(|e: PyErr| Python::attach(|py| { if e.is_instance_of::<pyo3::exceptions::PyFileExistsError>(py) { OpenWriteError::FileAlreadyExists(path.to_path_buf().into()) } else { OpenWriteError::IoError { io_error: Arc::new(io::Error::other(e.to_string())), filepath: path.to_path_buf(), } } }))?;

I am not sure. Please think about it carefully.

I think this consideration should also definitely be added to the docs. Maybe something like "open_write must raise FileExistsError when the path is already open for writing (required for lock contention)."

cjrh · 2026-04-18T01:51:52Z

+        .map_err(|e: PyErr| DeleteError::IoError {
+            io_error: Arc::new(io::Error::other(e.to_string())),
+            filepath: path.to_path_buf(),
+        })?;


Every error is being mapped to IoError here. This is a a somewhat similar issue to my comment in open_write, but not exactly the same.

The issue here is that tantivy already has a DeleteError::FileDoesNotExist variant, but there is no way to surface that here. And I think it could be necessary because tantivy's
own docstring on Directory::delete (tantivy's directory.rs:127-132) says: "Removing a nonexistent
file, returns a DeleteError::FileDoesNotExist." Callers may branch on this variant (e.g., the
ManagedDirectory garbage collector cares whether a delete failed because the file was already
gone vs. because of a real I/O fault). If every error collapses to IoError, legitimate
"already deleted" outcomes look like failures.

The get_file_handle impl here is doing things more correctly, maybe some of that code can be reused here?

Like using the helper is_file_not_found in the same way.

cjrh · 2026-04-18T01:53:29Z

+        .map_err(|e: PyErr| OpenReadError::IoError {
+            io_error: Arc::new(io::Error::other(e.to_string())),
+            filepath: path.to_path_buf(),
+        })


I think maybe exists should return Ok(false) on FileNotFoundError rather
than an OpenReadError::IoError. Otherwise why return a bool?

cjrh · 2026-04-18T02:00:02Z

+                        OpenReadError::FileDoesNotExist(path.to_path_buf())
+                    } else {
+                        OpenReadError::IoError {
+                            io_error: Arc::new(io::Error::other(e.to_string())),


The e.to_string() is happening in a bunch of places in this file, but it causes the python traceback to be lost, including chained exceptions.

Here's an idea for a newtype err that can be used to preserve the traces. This would go at the top of the file:

/// Captures a Python exception's type, message, and traceback as a formatted /// string at construction time, so the full stack survives being wrapped in /// `io::Error` and crossing back to Rust error types. /// /// The traceback is rendered eagerly while the GIL is held — `Display` itself /// does no Python work and is safe to call from any context. #[derive(Debug)] struct PyErrContext(String); impl PyErrContext { fn new(py: Python<'_>, err: &PyErr) -> Self { let mut msg = err.to_string(); // "ExceptionType: message" if let Some(tb) = err.traceback(py) { if let Ok(tb_str) = tb.format() { msg.push('\n'); msg.push_str(&tb_str); } } Self(msg) } } impl fmt::Display for PyErrContext { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { f.write_str(&self.0) } } impl std::error::Error for PyErrContext {}

And then here at all the code locations where it currently does e.to_string(), you could do this instead:

let data: Vec<u8> = Python::attach(|py| { self.py_object .call_method1(py, "get_file_handle", (&path_str,)) .and_then(|result| result.extract::<Vec<u8>>(py)) .map_err(|e| { if is_file_not_found(py, &e) { OpenReadError::FileDoesNotExist(path.to_path_buf()) } else { OpenReadError::IoError { // io_error: Arc::new(io::Error::other(e.to_string())), // OLD io_error: Arc::new(io::Error::other(PyErrContext::new(py, &e))), filepath: path.to_path_buf(), } } }) })?;

You should also add tests with error cases - so you can check that exceptions are handled correctly.

cjrh · 2026-04-18T02:01:40Z

+
+    fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()> {
+        let path_str = path.to_string_lossy().to_string();
+        let data = data.to_vec();


I think this might be redundant.

cjrh · 2026-04-18T02:01:56Z

+
+impl Write for PyWritePtr {
+    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
+        let data = buf.to_vec();


Maybe redundant?

cjrh · 2026-04-18T02:03:06Z

+        let len = data.len();
+
+        Python::attach(|py| {
+            let py_bytes = PyBytes::new(py, &data);


If you remove the to_vec, this can become PyBytes::new(py, data), the slice should go straight through.

cjrh · 2026-04-18T02:15:46Z

        schema: &Schema,
        path: Option<&str>,
        reuse: bool,
+        directory: Option<Py<PyAny>>,


One of the tricky issues is that the Python GIL (for the non-freethreaded build) is going to limit the concurrency for segment writers. Users will call index.writer(mem, num_cpus) and be surprised to discover that they still only get performances of about 1 cpu. This is a consquence of exposing the raw Directory methods directly in the Python interface.

The following is a hypothetical argument to explain the consequences of exposing the raw Directory interface. Imagine if, instead of Directory, we exposed a different kind of interface, say a "Blob Store":

class BlobStore(Protocol): def read(self, key: str) -> bytes: ... def write(self, key: str, data: bytes) -> None: ... def delete(self, key: str) -> None: ... def exists(self, key: str) -> bool: ...

So the idea would be that the caller can provide either a path or an instance of BlobStore to the index construction method.

Inside the rust implementation we actually implemented all the detailed Directory methods to be compatible with the internal tantivy trait.

There is a big difference between the potentical concurrency of our BlobStore interface, above, and the PyDirectory. Because inside the impl of our BlobStore, we can release the GIL for everything get gets done inside the BlobStore.write(...)method, because it's all rust, and the internal behaviour of tantivy's concurrency goes at full speed. But forPyDirectory, it doesn't work because all of the 8 fine-grained methods on the PyDirectory` class will all compete for the GIL during normal usage. Several of them will need to be used during writes, and they will contend with each other to acquire the GIL lock.

Does that make sense?

On the one hand, it seems attractive to expose the raw Directory interface, but on the other hand we get unfortunate collisions between the python and rust systems like this. It seems important enough that if we wanted to continue with this PR, we'd have to clamp the num_cpus parameter to 1 in the case of a custom directory parameter.

cjrh · 2026-04-18T02:17:41Z

+            self._writers.pop(writer_id, None)
+
+
+class TestPyDirectory:


We'll need to check all the error paths in the directory code because it's a fundamental interface. Currently it looks like these tests are all happy path only?

cjrh

I left a lot of feedback. It's a complex PR and it makes me quite nervous to add this new interface to tantivy-py. I will be on the hook to maintain this code in the future.

However, I ackknowledge the value of the idea, to make abstract the storage location of the data.

How important is it to your use-case to mimic the Directory object exactly, versus a simpler storage abstraction like a BlobStore:

  class BlobStore(Protocol):                                                                    
      def read(self, key: str) -> bytes: ...                                                    
      def write(self, key: str, data: bytes) -> None: ...
      def delete(self, key: str) -> None: ...                                                   
      def exists(self, key: str) -> bool: ...

?

JingsongLi · 2026-04-26T12:37:41Z

Hi @cjrh , thank you very much for your feedback! Our specific use case involves using Python to read Tantivy indices. Since our indices are stored on object storage (S3), we would like to access them directly via remote access; otherwise, the data transfer costs for downloading them would be prohibitively expensive.

BlobStore looks good to me! In fact, current requirements call for only a single read method interface.

The code used is here, FYI.
https://github.com/apache/paimon/blob/master/paimon-python/pypaimon/globalindex/tantivy/tantivy_full_text_global_index_reader.py

cjrh · 2026-04-26T13:44:39Z

@JingsongLi Is it ok with you if I add the BlobStore object in a separate PR? Or do you specifically want to do it yourself?

JingsongLi · 2026-04-29T14:07:32Z

@JingsongLi Is it ok with you if I add the BlobStore object in a separate PR? Or do you specifically want to do it yourself?

Thanks! It would be even better if you could come to add the BlobStore~ (Sorry for the slow response. I am currently on vacation)

JingsongLi added 3 commits March 31, 2026 18:54

Add custom Python Directory support for Index

373de62

Allow users to pass a custom directory object to the Index constructor via the `directory` parameter, enabling index storage backed by any Python-implemented backend (e.g., cloud object store, database, or in-memory dict).

Fix formats

3c32f26

Fix lint

d5d653c

JingsongLi added 2 commits April 1, 2026 09:12

Add close callback to drop

4e7cce0

Add close to protocol

a797c8c

cjrh reviewed Apr 18, 2026

View reviewed changes

cjrh requested changes Apr 18, 2026

View reviewed changes



		class Directory(Protocol):
		"""Protocol that custom directory objects must implement."""

Conversation

JingsongLi commented Mar 31, 2026

Uh oh!

cjrh commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjrh Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjrh left a comment

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Apr 26, 2026

Uh oh!

cjrh commented Apr 26, 2026

Uh oh!

JingsongLi commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cjrh Apr 18, 2026 •

edited

Loading

JingsongLi commented Apr 29, 2026 •

edited

Loading