Skip to content

Commit b8427d3

Browse files
AlenkaFrok
andauthored
GH-34956: [Docs][Python] Add to docs the usage of the FixedShapeTensorType (#34957)
### Rationale for this change This PR adds examples of the use of `FixedShapeTensorType`to the PyArrow user guide. Should be reviewed and merged after #34883 is done. * Closes: #34956 Lead-authored-by: Alenka Frim <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
1 parent 07642fd commit b8427d3

File tree

1 file changed

+160
-0
lines changed

1 file changed

+160
-0
lines changed

docs/source/python/extending_types.rst

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,3 +357,163 @@ pandas ``ExtensionArray``. This method should have the following signature::
357357

358358
This way, you can control the conversion of a pyarrow ``Array`` of your pyarrow
359359
extension type to a pandas ``ExtensionArray`` that can be stored in a DataFrame.
360+
361+
362+
Canonical extension types
363+
~~~~~~~~~~~~~~~~~~~~~~~~~
364+
365+
You can find the official list of canonical extension types in the
366+
:ref:`format_canonical_extensions` section. Here we add examples on how to
367+
use them in pyarrow.
368+
369+
Fixed size tensor
370+
"""""""""""""""""
371+
372+
To create an array of tensors with equal shape (fixed shape tensor array) we
373+
first need to define a fixed shape tensor extension type with value type
374+
and shape:
375+
376+
.. code-block:: python
377+
378+
>>> tensor_type = pa.fixed_shape_tensor(pa.int32(), (2, 2))
379+
380+
Then we need the storage array with :func:`pyarrow.list_` type where ``value_type```
381+
is the fixed shape tensor value type and list size is a product of ``tensor_type``
382+
shape elements. Then we can create an array of tensors with
383+
``pa.ExtensionArray.from_storage()`` method:
384+
385+
.. code-block:: python
386+
387+
>>> arr = [[1, 2, 3, 4], [10, 20, 30, 40], [100, 200, 300, 400]]
388+
>>> storage = pa.array(arr, pa.list_(pa.int32(), 4))
389+
>>> tensor_array = pa.ExtensionArray.from_storage(tensor_type, storage)
390+
391+
We can also create another array of tensors with different value type:
392+
393+
.. code-block:: python
394+
395+
>>> tensor_type_2 = pa.fixed_shape_tensor(pa.float32(), (2, 2))
396+
>>> storage_2 = pa.array(arr, pa.list_(pa.float32(), 4))
397+
>>> tensor_array_2 = pa.ExtensionArray.from_storage(tensor_type_2, storage_2)
398+
399+
Extension arrays can be used as columns in ``pyarrow.Table`` or
400+
``pyarrow.RecordBatch``:
401+
402+
.. code-block:: python
403+
404+
>>> data = [
405+
... pa.array([1, 2, 3]),
406+
... pa.array(['foo', 'bar', None]),
407+
... pa.array([True, None, True]),
408+
... tensor_array,
409+
... tensor_array_2
410+
... ]
411+
>>> my_schema = pa.schema([('f0', pa.int8()),
412+
... ('f1', pa.string()),
413+
... ('f2', pa.bool_()),
414+
... ('tensors_int', tensor_type),
415+
... ('tensors_float', tensor_type_2)])
416+
>>> table = pa.Table.from_arrays(data, schema=my_schema)
417+
>>> table
418+
pyarrow.Table
419+
f0: int8
420+
f1: string
421+
f2: bool
422+
tensors_int: extension<arrow.fixed_size_tensor>
423+
tensors_float: extension<arrow.fixed_size_tensor>
424+
----
425+
f0: [[1,2,3]]
426+
f1: [["foo","bar",null]]
427+
f2: [[true,null,true]]
428+
tensors_int: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]
429+
tensors_float: [[[1,2,3,4],[10,20,30,40],[100,200,300,400]]]
430+
431+
We can also convert a tensor array to a single multi-dimensional numpy ndarray.
432+
With the conversion the length of the arrow array becomes the first dimension
433+
in the numpy ndarray:
434+
435+
.. code-block:: python
436+
437+
>>> numpy_tensor = tensor_array_2.to_numpy_ndarray()
438+
>>> numpy_tensor
439+
array([[[ 1., 2.],
440+
[ 3., 4.]],
441+
[[ 10., 20.],
442+
[ 30., 40.]],
443+
[[100., 200.],
444+
[300., 400.]]])
445+
>>> numpy_tensor.shape
446+
(3, 2, 2)
447+
448+
.. note::
449+
450+
Both optional parameters, ``permutation`` and ``dim_names``, are meant to provide the user
451+
with the information about the logical layout of the data compared to the physical layout.
452+
453+
The conversion to numpy ndarray is only possible for trivial permutations (``None`` or
454+
``[0, 1, ... N-1]`` where ``N`` is the number of tensor dimensions).
455+
456+
And also the other way around, we can convert a numpy ndarray to a fixed shape tensor array:
457+
458+
.. code-block:: python
459+
460+
>>> pa.FixedShapeTensorArray.from_numpy_ndarray(numpy_tensor)
461+
<pyarrow.lib.FixedShapeTensorArray object at ...>
462+
[
463+
[
464+
1,
465+
2,
466+
3,
467+
4
468+
],
469+
[
470+
10,
471+
20,
472+
30,
473+
40
474+
],
475+
[
476+
100,
477+
200,
478+
300,
479+
400
480+
]
481+
]
482+
483+
With the conversion the first dimension of the ndarray becomes the length of the pyarrow extension
484+
array. We can see in the example that ndarray of shape ``(3, 2, 2)`` becomes an arrow array of
485+
length 3 with tensor elements of shape ``(2, 2)``.
486+
487+
.. code-block:: python
488+
489+
# ndarray of shape (3, 2, 2)
490+
>>> numpy_tensor.shape
491+
(3, 2, 2)
492+
493+
# arrow array of length 3 with tensor elements of shape (2, 2)
494+
>>> pyarrow_tensor_array = pa.FixedShapeTensorArray.from_numpy_ndarray(numpy_tensor)
495+
>>> len(pyarrow_tensor_array)
496+
3
497+
>>> pyarrow_tensor_array.type.shape
498+
[2, 2]
499+
500+
The extension type can also have ``permutation`` and ``dim_names`` defined. For
501+
example
502+
503+
.. code-block:: python
504+
505+
>>> tensor_type = pa.fixed_shape_tensor(pa.float64(), [2, 2, 3], permutation=[0, 2, 1])
506+
507+
or
508+
509+
.. code-block:: python
510+
511+
>>> tensor_type = pa.fixed_shape_tensor(pa.bool_(), [2, 2, 3], dim_names=['C', 'H', 'W'])
512+
513+
for ``NCHW`` format where:
514+
515+
* N: number of images which is in our case the length of an array and is always on
516+
the first dimension
517+
* C: number of channels of the image
518+
* H: height of the image
519+
* W: width of the image

0 commit comments

Comments
 (0)