3
3
========
4
4
5
5
6
+ Introduction
7
+ ============
8
+
9
+
6
10
When a Matplotlib :obj: `~matplotlib.artist.Artist ` object in rendered via the
7
11
`~matplotlib.artist.Artist.draw ` method the following steps happen (in spirit
8
12
but maybe not exactly in code):
@@ -12,41 +16,49 @@ but maybe not exactly in code):
12
16
3. convert the unit-less data from user-space to rendering-space
13
17
4. call the backend rendering functions
14
18
15
- ..
16
- If we were to call these steps :math:`f_1` through :math:`f_4` this can be expressed as (taking
17
- great liberties with the mathematical notation):
19
+ If we were to call these steps :math: `f_1 ` through :math: `f_4 ` this can be expressed as (taking
20
+ great liberties with the mathematical notation):
21
+
22
+ .. math ::
18
23
19
- .. math ::
24
+ R = f_ 4 (f_ 3 (f_ 2 (f_ 1 ())))
20
25
21
- R = f_ 4 (f_ 3 (f_ 2 (f_ 1 ())))
26
+ or if you prefer
22
27
23
- or if you prefer
28
+ .. math ::
24
29
25
- .. math ::
30
+ R = (f_ 4 \circ f_ 3 \circ f_ 2 \circ f_ 1 )()
26
31
27
- R = (f_4 \circ f_3 \circ f_2 \circ f_1 )()
32
+ If we can do this for one ``Artist ``, we can build up more complex
33
+ visualizations via composition by rendering multiple ``Artist `` to the
34
+ same target.
28
35
29
- It is reasonable that if we can do this for one ``Artist ``, we can build up
30
- more complex visualizations by rendering multiple ``Artist `` to the same
31
- target.
36
+ We can understand the :obj: `~matplotlib.artist.Artist.draw ` methods to be
37
+ extensively `curried <https://en.wikipedia.org/wiki/Currying >`__ version of
38
+ these function chains. By wrapping the functions in objects we can modify the
39
+ bound arguments to the functions. However, the clear structure is frequently
40
+ elided or obscured in the Matplotlib code base and there is an artificial
41
+ distinction between "data" and "style" inputs.
32
42
33
- However, this clear structure is frequently elided and obscured in the
34
- Matplotlib code base: Step 3 is only present for *x * and *y * like data
35
- (encapsulated in the `~matplotlib.transforms.TransformNode ` objects) and color
36
- mapped data (encapsulated in the `.matplotlib.colors.ScalarMappable ` family of
37
- classes); the application of Step 2 is inconsistent (both in actual application
38
- and when it is applied) between artists; each ``Artist `` stores its data in
39
- its own way (typically as numpy arrays).
43
+ For example mapping from "user data" to "rendering data" (Step 3) is only done
44
+ at draw-time for *x * / *y * like data (encapsulated in the
45
+ `~matplotlib.transforms.TransformNode ` objects) and color mapped data
46
+ (encapsulated in the `~matplotlib.cm.ScalarMappable ` family of classes).
47
+ If users need to do any other mapping between their data and Matplotlib's
48
+ rendering space, it must be done in user code and the results passed into
49
+ Matplotlib. The application of unit conversion (Step 2) is inconsistent (both
50
+ in actual application and when it is applied) between artists. This is a
51
+ particular difficulty for ``Artists `` parameterized by deltas (e.g. *height *
52
+ and *width * for a Rectangle) where the order of unit conversion and computing
53
+ the absolute bounding box can be fraught. Finally, each ``Artist `` stores its
54
+ data in its own way (typically as materialized numpy arrays) which makes it
55
+ difficult to update artists in a uniform way.
40
56
41
- With this view, we can understand the `~matplotlib.artist.Artist.draw ` methods
42
- to be very extensively `curried <https://en.wikipedia.org/wiki/Currying >`__
43
- version of these function chains where the objects allow us to modify the
44
- arguments to the functions and the re-run them.
45
57
46
58
The goal of this work is to bring this structure more to the foreground in the
47
59
internal structure of Matplotlib. By exposing this inherent structure in the
48
60
architecture of Matplotlib the library will be easier to reason about and
49
- easier to extend by injecting custom logic at each of the steps .
61
+ easier to extend.
50
62
51
63
A paper with the formal mathematical description of these ideas is in
52
64
preparation.
@@ -57,11 +69,12 @@ Data pipeline
57
69
Get the data (Step 1)
58
70
---------------------
59
71
60
- In this context "data" is post any data-to-data transformations or aggregation
61
- steps. There is already extensive tooling and literature around that aspect
62
- which we do not need to recreate. By completely decoupling the aggregations
63
- pipeline from the visualization process we are able to both simplify and
64
- generalize the software.
72
+ .. note ::
73
+
74
+ In this context "data" is post any data-to-data transformation or aggregation
75
+ steps. Because this proposal holds a function, rather than materialized
76
+ arrays, we can defer actually executing the data pipeline until draw time,
77
+ but Matplotlib does not need an visibility into what this pipeline is.
65
78
66
79
Currently, almost all ``Artist `` classes store the data they are representing
67
80
as attributes on the instances as realized `numpy.array ` [# ]_ objects. On one
@@ -72,33 +85,29 @@ for *this* ``Artist``, you can query or update the data without recreating the
72
85
we understand ``self.x[:] `` as ``self.x.__getitem__(slice()) `` which is the
73
86
function call in step 1.
74
87
75
- However, this method of storing the data has several drawbacks.
76
-
77
- In most cases the data attributes on an ``Artist `` are closely linked -- the
78
- *x * and *y * on a `~matplotlib.lines.Line2D ` must be the same length -- and by
79
- storing them separately it is possible for them to become inconsistent in ways
80
- that noticed until draw time [# ]_. With the rise of more structured data, such
81
- as ``pandas.DataFrame `` and ``xarray.Dataset `` users are more frequently having
82
- their data is coherent objects rather than individual arrays. Currently
88
+ However, this method of storing the data has several drawbacks. In most cases
89
+ the data attributes on an ``Artist `` are closely linked -- the *x * and *y * on a
90
+ `~matplotlib.lines.Line2D ` must be the same length -- and by storing them
91
+ separately it is possible for them to become inconsistent in ways that noticed
92
+ until draw time [# ]_. With the rise of more structured data types, such as
93
+ `pandas.DataFrame ` and `xarray.core.dataset.Dataset `, users are likely to have
94
+ their data in coherent objects rather than as individual arrays. Currently
83
95
Matplotlib requires that these structures be decomposed and losing the
84
- association between the individual arrays.
85
-
86
- An goal of this project is to bring support for draw-time resampling to every
87
- Matplotlib ``Artist ``. Further, because the data is stored as materialized
88
- ``numpy `` arrays, we must decide before draw time what the correct sampling of
89
- the data is. Projects like `grave <https://networkx.ors g/grave/ >`__ that wrap
90
- richer objects or `mpl-modest-image
96
+ association between the individual arrays. Further, because the data is stored
97
+ as materialized ``numpy `` arrays, we must decide before draw time what the
98
+ correct sampling of the data is. Projects like `grave <https://networkx.ors
99
+ g/grave/> `__ that wrap richer objects or `mpl-modest-image
91
100
<https://github.com/ChrisBeaumont/mpl-modest-image> `__, `datashader
92
101
<https://datashader.org/getting_started/Interactivity.html#native-support-for-matplotlib> `__,
93
102
and `mpl-scatter-density <https://github.com/astrofrog/mpl-scatter-density >`__
94
103
that dynamically re-sample the data do exist, but they have only seen limited
95
104
adoption.
96
105
97
- This is a proposal to add a level of indirection the data storage -- via a
98
- (so-called) `~data_prototype.containers.DataContainer ` -- rather than directly
99
- as individual numpy arrays on the `` Artist `` instances . The primary method on
100
- these objects is the `~data_prototype.containers.DataContainer.query ` method
101
- which has the signature ::
106
+ The first structural change of this proposal is to add a layer of indirection
107
+ -- via a (so-called) `~data_prototype.containers.DataContainer ` -- to the data
108
+ storage and access . The primary method on these objects is the
109
+ `~data_prototype.containers.DataContainer.query ` method with the signature
110
+ ::
102
111
103
112
def query(
104
113
self,
@@ -107,7 +116,7 @@ which has the signature ::
107
116
size: Tuple[int, int],
108
117
) -> Tuple[Dict[str, Any], Union[str, int]]:
109
118
110
- The query is passed in :
119
+ The query is passed:
111
120
112
121
- A *coord_transform * from "Axes fraction" to "data" (using Matplotlib's names
113
122
for the `coordinate systems
@@ -119,9 +128,9 @@ The query is passed in:
119
128
It will return:
120
129
121
130
- A mapping of strings to things that are coercible (with the help of the
122
- functions is steps 2 and 3) to a numpy array or types understandable by the
131
+ functions in Steps 2 and 3) to a numpy array or types understandable by the
123
132
backends.
124
- - A key that can be used for caching
133
+ - A key that can be used for caching by the caller
125
134
126
135
This function will be called at draw time by the ``Artist `` to get the data to
127
136
be drawn. In the simplest cases
@@ -153,9 +162,8 @@ return aligned data to the ``Artist``.
153
162
There is still some ambiguity as to what should be put in the data. For
154
163
example with `~matplotlib.lines.Line2D ` it is clear that the *x * and *y * data
155
164
should be pulled from the ``DataContiner ``, but things like *color * and
156
- *linewidth * are ambiguous. A later section will make the case that it should be
157
- possible, but maybe not required, that these values be accessible in the data
158
- context.
165
+ *linewidth * are ambiguous. It should be possible, but maybe not required, that
166
+ these values be derived from the data returned by the ``DataContainer ``.
159
167
160
168
An additional task that the ``DataContainer `` can do is to describe the type,
161
169
shape, fields, and topology of the data it contains. For example a
@@ -170,6 +178,7 @@ all of this still needs to be developed. There is a
170
178
`~data_prototype.containers.DataContainer.describe ` method, however it is the
171
179
most provisional part of the current design.
172
180
181
+ This does not address how the ``DataContainer `` objects are generated in practice.
173
182
174
183
Unit conversion (Step 2)
175
184
------------------------
@@ -209,7 +218,7 @@ values), representation conversions (like named colors to RGB values), mapping
209
218
stings to a set of objects (like named markershape), to paraaterized type
210
219
conversion (like colormapping). Although Matplotlib is currently doing all of
211
220
these conversions, the user really only has control of the position and
212
- colormapping (on `~matplotlib.colors .ScalarMappable ` sub-classes). The next
221
+ colormapping (on `~matplotlib.cm .ScalarMappable ` sub-classes). The next
213
222
thing that this design allows is for user defined functions to be passed for
214
223
any of the relevant data fields.
215
224
@@ -237,14 +246,14 @@ Caching
237
246
A key to keeping this implementation efficient is to be able to cache when we
238
247
have to re-compute values. Internally current Matplotlib has a number of
239
248
ad-hoc caches, such as in ``ScalarMappable `` and ``Line2D ``. Going down the
240
- route of hashing all of the data is not a sustainable path (in the case even
241
- modestly sized data the time to hash the data will quickly out-strip any
242
- possible time savings doing the cache lookup!). The proposed ``query `` method
243
- returns a cache key that it generates to the caller. The exact details of how
244
- to generate that key are left to the ``DataContainer `` implementation, but if
245
- the returned data changed, then the cache key must change. The cache key
246
- should be computed from a combination of the ``DataContainers `` internal state,
247
- the coordinate transformation and size passed in .
249
+ route of hashing all of the data is not a sustainable path (even with modestly
250
+ sized data the time to hash the data will quickly out-strip any possible time
251
+ savings doing the cache lookup!). The proposed ``query `` method returns a
252
+ cache key that it generates to the caller. The exact details of how to
253
+ generate that key are left to the ``DataContainer `` implementation, but if the
254
+ returned data changed, then the cache key must change. The cache key should be
255
+ computed from a combination of the ``DataContainers `` internal state and the arguments
256
+ passed to `` query `` .
248
257
249
258
The choice to return the data and cache key in one step, rather than be a two
250
259
step process is drive by simplicity and because the cache key is computed
@@ -260,6 +269,17 @@ layers to keep. Currently only the results of Step 3 are cached, but we may
260
269
want to additionally cache intermediate results after Step 2. The caching from
261
270
Step 1 is likely best left to the ``DataContainer `` instances.
262
271
272
+ Detailed design notes
273
+ =====================
274
+
275
+
276
+ .. toctree ::
277
+ :maxdepth: 2
278
+
279
+ containers
280
+
281
+
282
+
263
283
.. [# ] Not strictly true, in some cases we also store the values in the data in
264
284
the container it came in with which may not be a `numpy.array `.
265
285
.. [# ] For example `matplotlib.lines.Line2D.set_xdata ` and
0 commit comments