Skip to content

Commit 675ff3c

Browse files
committed
WIP: more editing work
1 parent 38b394c commit 675ff3c

File tree

3 files changed

+126
-64
lines changed

3 files changed

+126
-64
lines changed

docs/source/design/containers.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
Containers Design Choices
2+
=========================
3+
4+
``Callable`` vs ``class``
5+
-------------------------
6+
7+
In the mathematical formulation we model the date source as a function, however
8+
in the implementation this has been promoted to a full class with two methods
9+
of ``query`` and ``describe``.
10+
11+
The justification for this is:
12+
13+
1. We anticipate the need to update the data held by the container so will
14+
likely be backed by instances of classes in practice
15+
2. We will want the containers to provide a static type and shape information
16+
about itself. In principle this could be carried in the signature of the
17+
function, but Python's built in type system is too dynamic for this to be
18+
practical.
19+
20+
A `collections.SimpleNamespace` with the correct names is compatible with this API.
21+
22+
23+
``obj.__call__`` and ``obj.describe`` would make the data feel more like a
24+
function, however if someone wanted to implement this with a function rather
25+
than a class it would require putting a callable as an attribute on a callable.
26+
This is technically allowed in Python, but a bit weird.
27+
28+
Caching design
29+
--------------
30+
31+
.. note::
32+
33+
There are two hard problems in computer science:
34+
35+
1. naming things
36+
2. cache invalidation
37+
3. off-by-one bugs
38+
39+
Because we are adding a layer of indirection to the data access it is no longer
40+
generally true that "getting the data" is cheap nor that any layer of the
41+
system has all of the information required to know if cached values are still
42+
valid.

docs/source/design.rst renamed to docs/source/design/index.rst

Lines changed: 83 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
========
44

55

6+
Introduction
7+
============
8+
9+
610
When a Matplotlib :obj:`~matplotlib.artist.Artist` object in rendered via the
711
`~matplotlib.artist.Artist.draw` method the following steps happen (in spirit
812
but maybe not exactly in code):
@@ -12,41 +16,49 @@ but maybe not exactly in code):
1216
3. convert the unit-less data from user-space to rendering-space
1317
4. call the backend rendering functions
1418

15-
..
16-
If we were to call these steps :math:`f_1` through :math:`f_4` this can be expressed as (taking
17-
great liberties with the mathematical notation):
19+
If we were to call these steps :math:`f_1` through :math:`f_4` this can be expressed as (taking
20+
great liberties with the mathematical notation):
21+
22+
.. math::
1823
19-
.. math::
24+
R = f_4(f_3(f_2(f_1())))
2025
21-
R = f_4(f_3(f_2(f_1())))
26+
or if you prefer
2227

23-
or if you prefer
28+
.. math::
2429
25-
.. math::
30+
R = (f_4 \circ f_3 \circ f_2 \circ f_1)()
2631
27-
R = (f_4 \circ f_3 \circ f_2 \circ f_1)()
32+
If we can do this for one ``Artist``, we can build up more complex
33+
visualizations via composition by rendering multiple ``Artist`` to the
34+
same target.
2835

29-
It is reasonable that if we can do this for one ``Artist``, we can build up
30-
more complex visualizations by rendering multiple ``Artist`` to the same
31-
target.
36+
We can understand the :obj:`~matplotlib.artist.Artist.draw` methods to be
37+
extensively `curried <https://en.wikipedia.org/wiki/Currying>`__ version of
38+
these function chains. By wrapping the functions in objects we can modify the
39+
bound arguments to the functions. However, the clear structure is frequently
40+
elided or obscured in the Matplotlib code base and there is an artificial
41+
distinction between "data" and "style" inputs.
3242

33-
However, this clear structure is frequently elided and obscured in the
34-
Matplotlib code base: Step 3 is only present for *x* and *y* like data
35-
(encapsulated in the `~matplotlib.transforms.TransformNode` objects) and color
36-
mapped data (encapsulated in the `.matplotlib.colors.ScalarMappable` family of
37-
classes); the application of Step 2 is inconsistent (both in actual application
38-
and when it is applied) between artists; each ``Artist`` stores its data in
39-
its own way (typically as numpy arrays).
43+
For example mapping from "user data" to "rendering data" (Step 3) is only done
44+
at draw-time for *x* / *y* like data (encapsulated in the
45+
`~matplotlib.transforms.TransformNode` objects) and color mapped data
46+
(encapsulated in the `~matplotlib.cm.ScalarMappable` family of classes).
47+
If users need to do any other mapping between their data and Matplotlib's
48+
rendering space, it must be done in user code and the results passed into
49+
Matplotlib. The application of unit conversion (Step 2) is inconsistent (both
50+
in actual application and when it is applied) between artists. This is a
51+
particular difficulty for ``Artists`` parameterized by deltas (e.g. *height*
52+
and *width* for a Rectangle) where the order of unit conversion and computing
53+
the absolute bounding box can be fraught. Finally, each ``Artist`` stores its
54+
data in its own way (typically as materialized numpy arrays) which makes it
55+
difficult to update artists in a uniform way.
4056

41-
With this view, we can understand the `~matplotlib.artist.Artist.draw` methods
42-
to be very extensively `curried <https://en.wikipedia.org/wiki/Currying>`__
43-
version of these function chains where the objects allow us to modify the
44-
arguments to the functions and the re-run them.
4557

4658
The goal of this work is to bring this structure more to the foreground in the
4759
internal structure of Matplotlib. By exposing this inherent structure in the
4860
architecture of Matplotlib the library will be easier to reason about and
49-
easier to extend by injecting custom logic at each of the steps.
61+
easier to extend.
5062

5163
A paper with the formal mathematical description of these ideas is in
5264
preparation.
@@ -57,11 +69,12 @@ Data pipeline
5769
Get the data (Step 1)
5870
---------------------
5971

60-
In this context "data" is post any data-to-data transformations or aggregation
61-
steps. There is already extensive tooling and literature around that aspect
62-
which we do not need to recreate. By completely decoupling the aggregations
63-
pipeline from the visualization process we are able to both simplify and
64-
generalize the software.
72+
.. note ::
73+
74+
In this context "data" is post any data-to-data transformation or aggregation
75+
steps. Because this proposal holds a function, rather than materialized
76+
arrays, we can defer actually executing the data pipeline until draw time,
77+
but Matplotlib does not need an visibility into what this pipeline is.
6578
6679
Currently, almost all ``Artist`` classes store the data they are representing
6780
as attributes on the instances as realized `numpy.array` [#]_ objects. On one
@@ -72,33 +85,29 @@ for *this* ``Artist``, you can query or update the data without recreating the
7285
we understand ``self.x[:]`` as ``self.x.__getitem__(slice())`` which is the
7386
function call in step 1.
7487

75-
However, this method of storing the data has several drawbacks.
76-
77-
In most cases the data attributes on an ``Artist`` are closely linked -- the
78-
*x* and *y* on a `~matplotlib.lines.Line2D` must be the same length -- and by
79-
storing them separately it is possible for them to become inconsistent in ways
80-
that noticed until draw time [#]_. With the rise of more structured data, such
81-
as ``pandas.DataFrame`` and ``xarray.Dataset`` users are more frequently having
82-
their data is coherent objects rather than individual arrays. Currently
88+
However, this method of storing the data has several drawbacks. In most cases
89+
the data attributes on an ``Artist`` are closely linked -- the *x* and *y* on a
90+
`~matplotlib.lines.Line2D` must be the same length -- and by storing them
91+
separately it is possible for them to become inconsistent in ways that noticed
92+
until draw time [#]_. With the rise of more structured data types, such as
93+
`pandas.DataFrame` and `xarray.core.dataset.Dataset`, users are likely to have
94+
their data in coherent objects rather than as individual arrays. Currently
8395
Matplotlib requires that these structures be decomposed and losing the
84-
association between the individual arrays.
85-
86-
An goal of this project is to bring support for draw-time resampling to every
87-
Matplotlib ``Artist``. Further, because the data is stored as materialized
88-
``numpy`` arrays, we must decide before draw time what the correct sampling of
89-
the data is. Projects like `grave <https://networkx.ors g/grave/>`__ that wrap
90-
richer objects or `mpl-modest-image
96+
association between the individual arrays. Further, because the data is stored
97+
as materialized ``numpy`` arrays, we must decide before draw time what the
98+
correct sampling of the data is. Projects like `grave <https://networkx.ors
99+
g/grave/>`__ that wrap richer objects or `mpl-modest-image
91100
<https://github.com/ChrisBeaumont/mpl-modest-image>`__, `datashader
92101
<https://datashader.org/getting_started/Interactivity.html#native-support-for-matplotlib>`__,
93102
and `mpl-scatter-density <https://github.com/astrofrog/mpl-scatter-density>`__
94103
that dynamically re-sample the data do exist, but they have only seen limited
95104
adoption.
96105

97-
This is a proposal to add a level of indirection the data storage -- via a
98-
(so-called) `~data_prototype.containers.DataContainer` -- rather than directly
99-
as individual numpy arrays on the ``Artist`` instances. The primary method on
100-
these objects is the `~data_prototype.containers.DataContainer.query` method
101-
which has the signature ::
106+
The first structural change of this proposal is to add a layer of indirection
107+
-- via a (so-called) `~data_prototype.containers.DataContainer` -- to the data
108+
storage and access. The primary method on these objects is the
109+
`~data_prototype.containers.DataContainer.query` method with the signature
110+
::
102111

103112
def query(
104113
self,
@@ -107,7 +116,7 @@ which has the signature ::
107116
size: Tuple[int, int],
108117
) -> Tuple[Dict[str, Any], Union[str, int]]:
109118

110-
The query is passed in:
119+
The query is passed:
111120

112121
- A *coord_transform* from "Axes fraction" to "data" (using Matplotlib's names
113122
for the `coordinate systems
@@ -119,9 +128,9 @@ The query is passed in:
119128
It will return:
120129

121130
- A mapping of strings to things that are coercible (with the help of the
122-
functions is steps 2 and 3) to a numpy array or types understandable by the
131+
functions in Steps 2 and 3) to a numpy array or types understandable by the
123132
backends.
124-
- A key that can be used for caching
133+
- A key that can be used for caching by the caller
125134

126135
This function will be called at draw time by the ``Artist`` to get the data to
127136
be drawn. In the simplest cases
@@ -153,9 +162,8 @@ return aligned data to the ``Artist``.
153162
There is still some ambiguity as to what should be put in the data. For
154163
example with `~matplotlib.lines.Line2D` it is clear that the *x* and *y* data
155164
should be pulled from the ``DataContiner``, but things like *color* and
156-
*linewidth* are ambiguous. A later section will make the case that it should be
157-
possible, but maybe not required, that these values be accessible in the data
158-
context.
165+
*linewidth* are ambiguous. It should be possible, but maybe not required, that
166+
these values be derived from the data returned by the ``DataContainer``.
159167

160168
An additional task that the ``DataContainer`` can do is to describe the type,
161169
shape, fields, and topology of the data it contains. For example a
@@ -170,6 +178,7 @@ all of this still needs to be developed. There is a
170178
`~data_prototype.containers.DataContainer.describe` method, however it is the
171179
most provisional part of the current design.
172180

181+
This does not address how the ``DataContainer`` objects are generated in practice.
173182

174183
Unit conversion (Step 2)
175184
------------------------
@@ -209,7 +218,7 @@ values), representation conversions (like named colors to RGB values), mapping
209218
stings to a set of objects (like named markershape), to paraaterized type
210219
conversion (like colormapping). Although Matplotlib is currently doing all of
211220
these conversions, the user really only has control of the position and
212-
colormapping (on `~matplotlib.colors.ScalarMappable` sub-classes). The next
221+
colormapping (on `~matplotlib.cm.ScalarMappable` sub-classes). The next
213222
thing that this design allows is for user defined functions to be passed for
214223
any of the relevant data fields.
215224

@@ -237,14 +246,14 @@ Caching
237246
A key to keeping this implementation efficient is to be able to cache when we
238247
have to re-compute values. Internally current Matplotlib has a number of
239248
ad-hoc caches, such as in ``ScalarMappable`` and ``Line2D``. Going down the
240-
route of hashing all of the data is not a sustainable path (in the case even
241-
modestly sized data the time to hash the data will quickly out-strip any
242-
possible time savings doing the cache lookup!). The proposed ``query`` method
243-
returns a cache key that it generates to the caller. The exact details of how
244-
to generate that key are left to the ``DataContainer`` implementation, but if
245-
the returned data changed, then the cache key must change. The cache key
246-
should be computed from a combination of the ``DataContainers`` internal state,
247-
the coordinate transformation and size passed in.
249+
route of hashing all of the data is not a sustainable path (even with modestly
250+
sized data the time to hash the data will quickly out-strip any possible time
251+
savings doing the cache lookup!). The proposed ``query`` method returns a
252+
cache key that it generates to the caller. The exact details of how to
253+
generate that key are left to the ``DataContainer`` implementation, but if the
254+
returned data changed, then the cache key must change. The cache key should be
255+
computed from a combination of the ``DataContainers`` internal state and the arguments
256+
passed to ``query``.
248257

249258
The choice to return the data and cache key in one step, rather than be a two
250259
step process is drive by simplicity and because the cache key is computed
@@ -260,6 +269,17 @@ layers to keep. Currently only the results of Step 3 are cached, but we may
260269
want to additionally cache intermediate results after Step 2. The caching from
261270
Step 1 is likely best left to the ``DataContainer`` instances.
262271

272+
Detailed design notes
273+
=====================
274+
275+
276+
.. toctree::
277+
:maxdepth: 2
278+
279+
containers
280+
281+
282+
263283
.. [#] Not strictly true, in some cases we also store the values in the data in
264284
the container it came in with which may not be a `numpy.array`.
265285
.. [#] For example `matplotlib.lines.Line2D.set_xdata` and

docs/source/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Design
1717
.. toctree::
1818
:maxdepth: 2
1919

20-
design.rst
20+
design/index
2121

2222

2323
Examples

0 commit comments

Comments
 (0)