Skip to content

Commit b08121a

Browse files
committed
Add documentation about how to use catalog and schema providers
1 parent 55dd215 commit b08121a

File tree

2 files changed

+58
-0
lines changed

2 files changed

+58
-0
lines changed

docs/source/user-guide/data-sources.rst

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,59 @@ the interface as describe in the :ref:`Custom Table Provider <io_custom_table_pr
185185
section. This is an advanced topic, but a
186186
`user example <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_
187187
is provided in the DataFusion repository.
188+
189+
Catalog
190+
=======
191+
192+
A common technique for organizing tables is using a three level hierarchical approach. DataFusion
193+
supports this form of organizing using the :py:class:`~datafusion.catalog.Catalog`,
194+
:py:class:`~datafusion.catalog.Schema`, and :py:class:`~datafusion.catalog.Table`. By default,
195+
a :py:class:`~datafusion.context.SessionContext` comes with a single Catalog and a single Schema
196+
with the names ``datafusion`` and ``default``, respectively.
197+
198+
The default implementation uses an in-memory approach to the catalog and schema. We have support
199+
for adding additional in-memory catalogs and schemas. This can be done like in the following
200+
example:
201+
202+
.. code-block:: python
203+
204+
from datafusion.catalog import Catalog, Schema
205+
206+
my_catalog = Catalog.memory_catalog()
207+
my_schema = Schema.memory_schema()
208+
209+
my_catalog.register_schema("my_schema_name", my_schema)
210+
211+
ctx.register_catalog("my_catalog_name", my_catalog)
212+
213+
You could then register tables in ``my_schema`` and access them either through the DataFrame
214+
API or via sql commands such as ``"SELECT * from my_catalog_name.my_schema_name.my_table"``.
215+
216+
User Defined Catalog and Schema
217+
-------------------------------
218+
219+
If the in-memory catalogs are insufficient for your uses, there are two approaches you can take
220+
to implementing a custom catalog and/or schema. In the below discussion, we describe how to
221+
implement these for a Catalog, but the approach to implementing for a Schema is nearly
222+
identical.
223+
224+
DataFusion supports Catalogs written in either Rust or Python. If you write a Catalog in Rust,
225+
you will need to export it as a Python library via PyO3. There is a complete example of a
226+
catalog implemented this way in the
227+
`examples folder <https://github.com/apache/datafusion-python/tree/main/examples/>`_
228+
of our repository. Writing catalog providers in Rust provides typically can lead to significant
229+
performance improvements over the Python based approach.
230+
231+
To implement a Catalog in Python, you will need to inherit from the abstract base class
232+
:py:class:`~datafusion.catalog.CatalogProvider`. There are examples in the
233+
`unit tests <https://github.com/apache/datafusion-python/tree/main/python/tests>`_ of
234+
implementing a basic Catalog in Python where we simply keep a dictionary of the
235+
registered Schemas.
236+
237+
One important note for developers is that when we have a Catalog defined in Python, we have
238+
two different ways of accessing this Catalog. First, we register the catalog with a Rust
239+
wrapper. This allows for any rust based code to call the Python functions as necessary.
240+
Second, if the user access the Catalog via the Python API, we identify this and return back
241+
the original Python object that implements the Catalog. This is an important distinction
242+
for developers because we do *not* return a Python wrapper around the Rust wrapper of the
243+
original Python object.

python/datafusion/catalog.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,9 @@
3535

3636
__all__ = [
3737
"Catalog",
38+
"CatalogProvider",
3839
"Schema",
40+
"SchemaProvider",
3941
"Table",
4042
]
4143

0 commit comments

Comments
 (0)