Skip to content

Add custom Python Directory support for Index#632

Open
JingsongLi wants to merge 5 commits intoquickwit-oss:masterfrom
JingsongLi:support_directory
Open

Add custom Python Directory support for Index#632
JingsongLi wants to merge 5 commits intoquickwit-oss:masterfrom
JingsongLi:support_directory

Conversation

@JingsongLi
Copy link
Copy Markdown

Allow users to pass a custom directory object to the Index constructor via the directory parameter, enabling index storage backed by any Python-implemented backend (e.g., cloud object store, database, or in-memory dict).

Allow users to pass a custom directory object to the Index constructor
via the `directory` parameter, enabling index storage backed by any
Python-implemented backend (e.g., cloud object store, database, or
in-memory dict).
@cjrh
Copy link
Copy Markdown
Collaborator

cjrh commented Mar 31, 2026

Thanks for this, I will go through it in the next few days.

Comment thread tantivy/tantivy.pyi


class Directory(Protocol):
"""Protocol that custom directory objects must implement."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings should be on the struct or struct methods, not in the pyi file

Comment thread src/directory.rs
io_error: Arc::new(io::Error::other(e.to_string())),
filepath: path.to_path_buf(),
})?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be converting any kind of error into an IO error.

This could break the locking behaviour in tantivy. Tantivy acquires directory locks by calling open_write on a lock file and retrying if it gets back OpenWriteError::FileAlreadyExists. Look at tantivy's own source try_acquire_lock in ~/.cargo/registry/src/.../tantivy-0.25.0/src/directory/directory.rs:78-86:

  let mut write = directory.open_write(filepath).map_err(|e| match e {
      OpenWriteError::FileAlreadyExists(_) => TryAcquireLockError::FileExists,  // retry        
      OpenWriteError::IoError { io_error, .. } => TryAcquireLockError::IoError(io_error),  //   
  bail                                                                                          
  })?;   

The retry loop only triggers on FileAlreadyExists. IoError is treated as fatal and propagated
up immediately. Blocking locks retry 100 times at 100ms intervals (10s timeout) but this only happens
when they see FileAlreadyExists.

This PR's own test InMemoryDirectory.open_write already raises FileExistsError which is correct, but but the rust code here swallows it into IoError, so tantivy's retry logic never fires. A second concurrent writer will fail immediately instead of waiting for the first to release and retrying.

An idea for a fix might look something this:

  .map_err(|e: PyErr| Python::attach(|py| {                                                     
      if e.is_instance_of::<pyo3::exceptions::PyFileExistsError>(py) {                          
          OpenWriteError::FileAlreadyExists(path.to_path_buf().into())                          
      } else {                                                                                  
          OpenWriteError::IoError {                                                             
              io_error: Arc::new(io::Error::other(e.to_string())),
              filepath: path.to_path_buf(),                                                     
          }
      }                                                                                         
  }))?;  

I am not sure. Please think about it carefully.

I think this consideration should also definitely be added to the docs. Maybe something like "open_write must raise FileExistsError when the path is already open for writing (required for lock contention)."

Comment thread src/directory.rs
Comment on lines +83 to +86
.map_err(|e: PyErr| DeleteError::IoError {
io_error: Arc::new(io::Error::other(e.to_string())),
filepath: path.to_path_buf(),
})?;
Copy link
Copy Markdown
Collaborator

@cjrh cjrh Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every error is being mapped to IoError here. This is a a somewhat similar issue to my comment in open_write, but not exactly the same.

The issue here is that tantivy already has a DeleteError::FileDoesNotExist variant, but there is no way to surface that here. And I think it could be necessary because tantivy's
own docstring on Directory::delete (tantivy's directory.rs:127-132) says: "Removing a nonexistent
file, returns a DeleteError::FileDoesNotExist." Callers may branch on this variant (e.g., the
ManagedDirectory garbage collector cares whether a delete failed because the file was already
gone vs. because of a real I/O fault). If every error collapses to IoError, legitimate
"already deleted" outcomes look like failures.

The get_file_handle impl here is doing things more correctly, maybe some of that code can be reused here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like using the helper is_file_not_found in the same way.

Comment thread src/directory.rs
Comment on lines +98 to +101
.map_err(|e: PyErr| OpenReadError::IoError {
io_error: Arc::new(io::Error::other(e.to_string())),
filepath: path.to_path_buf(),
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe exists should return Ok(false) on FileNotFoundError rather
than an OpenReadError::IoError. Otherwise why return a bool?

Comment thread src/directory.rs
OpenReadError::FileDoesNotExist(path.to_path_buf())
} else {
OpenReadError::IoError {
io_error: Arc::new(io::Error::other(e.to_string())),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e.to_string() is happening in a bunch of places in this file, but it causes the python traceback to be lost, including chained exceptions.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an idea for a newtype err that can be used to preserve the traces. This would go at the top of the file:

/// Captures a Python exception's type, message, and traceback as a formatted                 
/// string at construction time, so the full stack survives being wrapped in
/// `io::Error` and crossing back to Rust error types.                                        
///                                                                                           
/// The traceback is rendered eagerly while the GIL is held — `Display` itself                
/// does no Python work and is safe to call from any context.                                 
#[derive(Debug)]                                                                              
struct PyErrContext(String);                                                                  
                                                                                              
impl PyErrContext {             
    fn new(py: Python<'_>, err: &PyErr) -> Self {                                             
        let mut msg = err.to_string(); // "ExceptionType: message"                            
        if let Some(tb) = err.traceback(py) {
            if let Ok(tb_str) = tb.format() {                                                 
                msg.push('\n');                                                               
                msg.push_str(&tb_str);
            }                                                                                 
        }                                                  
        Self(msg)
    }
}

impl fmt::Display for PyErrContext {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        f.write_str(&self.0)
    }                                                                                       
}                                                                                             
                                
impl std::error::Error for PyErrContext {}        

And then here at all the code locations where it currently does e.to_string(), you could do this instead:

let data: Vec<u8> = Python::attach(|py| {
    self.py_object
        .call_method1(py, "get_file_handle", (&path_str,))
        .and_then(|result| result.extract::<Vec<u8>>(py))
        .map_err(|e| {
            if is_file_not_found(py, &e) {
                OpenReadError::FileDoesNotExist(path.to_path_buf())
            } else {
                OpenReadError::IoError {
                    // io_error: Arc::new(io::Error::other(e.to_string())),     // OLD
                    io_error: Arc::new(io::Error::other(PyErrContext::new(py, &e))),
                    filepath: path.to_path_buf(),
                }
            }
        })
})?;

You should also add tests with error cases - so you can check that exceptions are handled correctly.

Comment thread src/directory.rs

fn atomic_write(&self, path: &Path, data: &[u8]) -> io::Result<()> {
let path_str = path.to_string_lossy().to_string();
let data = data.to_vec();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be redundant.

Comment thread src/directory.rs

impl Write for PyWritePtr {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
let data = buf.to_vec();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe redundant?

Comment thread src/directory.rs
let len = data.len();

Python::attach(|py| {
let py_bytes = PyBytes::new(py, &data);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove the to_vec, this can become PyBytes::new(py, data), the slice should go straight through.

Comment thread src/index.rs
schema: &Schema,
path: Option<&str>,
reuse: bool,
directory: Option<Py<PyAny>>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the tricky issues is that the Python GIL (for the non-freethreaded build) is going to limit the concurrency for segment writers. Users will call index.writer(mem, num_cpus) and be surprised to discover that they still only get performances of about 1 cpu. This is a consquence of exposing the raw Directory methods directly in the Python interface.

The following is a hypothetical argument to explain the consequences of exposing the raw Directory interface. Imagine if, instead of Directory, we exposed a different kind of interface, say a "Blob Store":

  class BlobStore(Protocol):                                                                    
      def read(self, key: str) -> bytes: ...                                                    
      def write(self, key: str, data: bytes) -> None: ...
      def delete(self, key: str) -> None: ...                                                   
      def exists(self, key: str) -> bool: ...  

So the idea would be that the caller can provide either a path or an instance of BlobStore to the index construction method.

Inside the rust implementation we actually implemented all the detailed Directory methods to be compatible with the internal tantivy trait.

There is a big difference between the potentical concurrency of our BlobStore interface, above, and the PyDirectory. Because inside the impl of our BlobStore, we can release the GIL for everything get gets done inside the BlobStore.write(...)method, because it's all rust, and the internal behaviour of tantivy's concurrency goes at full speed. But forPyDirectory, it doesn't work because all of the 8 fine-grained methods on the PyDirectory` class will all compete for the GIL during normal usage. Several of them will need to be used during writes, and they will contend with each other to acquire the GIL lock.

Does that make sense?

On the one hand, it seems attractive to expose the raw Directory interface, but on the other hand we get unfortunate collisions between the python and rust systems like this. It seems important enough that if we wanted to continue with this PR, we'd have to clamp the num_cpus parameter to 1 in the case of a custom directory parameter.

Comment thread tests/test_directory.py
self._writers.pop(writer_id, None)


class TestPyDirectory:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to check all the error paths in the directory code because it's a fundamental interface. Currently it looks like these tests are all happy path only?

Copy link
Copy Markdown
Collaborator

@cjrh cjrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a lot of feedback. It's a complex PR and it makes me quite nervous to add this new interface to tantivy-py. I will be on the hook to maintain this code in the future.

However, I ackknowledge the value of the idea, to make abstract the storage location of the data.

How important is it to your use-case to mimic the Directory object exactly, versus a simpler storage abstraction like a BlobStore:

  class BlobStore(Protocol):                                                                    
      def read(self, key: str) -> bytes: ...                                                    
      def write(self, key: str, data: bytes) -> None: ...
      def delete(self, key: str) -> None: ...                                                   
      def exists(self, key: str) -> bool: ... 

?

@JingsongLi
Copy link
Copy Markdown
Author

Hi @cjrh , thank you very much for your feedback! Our specific use case involves using Python to read Tantivy indices. Since our indices are stored on object storage (S3), we would like to access them directly via remote access; otherwise, the data transfer costs for downloading them would be prohibitively expensive.

BlobStore looks good to me! In fact, current requirements call for only a single read method interface.

The code used is here, FYI.
https://github.com/apache/paimon/blob/master/paimon-python/pypaimon/globalindex/tantivy/tantivy_full_text_global_index_reader.py

@cjrh
Copy link
Copy Markdown
Collaborator

cjrh commented Apr 26, 2026

@JingsongLi Is it ok with you if I add the BlobStore object in a separate PR? Or do you specifically want to do it yourself?

@JingsongLi
Copy link
Copy Markdown
Author

JingsongLi commented Apr 29, 2026

@JingsongLi Is it ok with you if I add the BlobStore object in a separate PR? Or do you specifically want to do it yourself?

Thanks! It would be even better if you could come to add the BlobStore~ (Sorry for the slow response. I am currently on vacation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants