Skip to content

Commit 2e1b695

Browse files
authored
[REP] Roll out "strict mode" for Ray Data (#29)
Signed-off-by: Eric Liang <[email protected]>
1 parent 1952b1d commit 2e1b695

File tree

1 file changed

+98
-0
lines changed

1 file changed

+98
-0
lines changed

reps/2023-04-27-data-strict-mode.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Roll out "strict mode" for Ray Data
2+
3+
## Summary
4+
5+
Make a (breaking) API change to always require data schemas in Ray Data, dropping support for standalone Python objects. In addition to unification and simplicity benefits, this aligns the Ray Data API closer to industry-standard distributed data APIs like Apache Spark and also emerging standards for machine learning datasets like HuggingFace.
6+
7+
### General Motivation
8+
9+
This REP proposes rolling out a breaking API change to Ray Data, termed "strict mode". In strict mode, support for standalone Python objects is dropped. This means that instead of directly storing, e.g., Python `Tuple[str, int]` instance in Ray Data, users will have to either give each field a name (i.e., `{foo: str, bar: int}`), or use a named object-type field (i.e., `{foo: object}`). In addition, strict mode removes the "default" batch format in place of "numpy" by default. This means that most users just need to be aware of `Dict[str, Any]` (non-batched data records) and `Dict[str, np.ndarray]` (batched data) types when working with Ray Data.
10+
11+
The motivation for this change is to cut down on the number of alternative representations users have to be aware of in Ray Data, which complicate the docs, examples, and add to new user confusion.
12+
For reference, this is the main PR originally introducing strict mode: https://github.com/ray-project/ray/pull/34336
13+
14+
**Full list of changes**
15+
- All read apis return structured data, never standalone Python objects.
16+
- Standalone Python objects are prohibited from being returned from map / map batches.
17+
- Standalone Numpy arrays are prohibited from being returned from map / map batches.
18+
- There is no more special interpretation of single-column schema containing just `__value__` as a column.
19+
- The default batch format is "numpy" instead of "default" (pandas).
20+
- schema() returns a unified Schema class instead of Union[pyarrow.lib.Schema, type].
21+
22+
**Datasource behavior changes**
23+
- `range_tensor`: create "data" col instead of `__value__`
24+
- `from_numpy`/`from_numpy_refs` : create "data" col instead of using `__value__`
25+
- `from_items`: create "item" col instead of using Python objects
26+
- `range`: create "id" column instead of using Python objects
27+
28+
The change itself has been well received in user testing, so the remainder of this REP will focus on the rollout strategy.
29+
30+
### Should this change be within `ray` or outside?
31+
main `ray` project. Changes are made to Ray Data.
32+
33+
## Stewardship
34+
### Required Reviewers
35+
The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
36+
37+
@amogkam, @c21
38+
39+
### Shepherd of the Proposal (should be a senior committer)
40+
To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
41+
42+
@pcmoritz
43+
44+
## Rollout Plan
45+
46+
### Impact of Changes
47+
48+
The proposed change mainly impacts users that are working with in-memory data objects and image datasets. For these users, they will get an error when trying to load data without a schema (e.g., ``StrictModeError: Error validating <data_item>: standalone Python objects are not allowed in strict mode. Please wrap the item in a dictionary like `{data: <data_item>}`. For more details and how to disable strict mode, visit DOC_URL_HERE.``).
49+
50+
### Notification
51+
52+
The main method of notification will be the ``StrictModeError`` exception raised when the user tries to create disallowed data types. The exception will link to documentation on how to upgrade / disable strict mode.
53+
54+
We will also add a warning banner (for a couple releases) on the first import of Ray Data that notifies users of this change.
55+
56+
### Timeline
57+
58+
- Ray 2.5: Enable strict mode by default, with the above notification plan.
59+
- Ray 2.6: No changes.
60+
- Ray 2.7 or after: Enforce strict mode always, and remove code for supporting the legacy code paths.
61+
62+
## Examples:
63+
64+
### Before
65+
```python
66+
ds = ray.data.range(5)
67+
# -> Datastream(num_blocks=1, schema=<class 'int'>)
68+
69+
ds.take()[0]
70+
# -> 0
71+
72+
assert ds.take_batch()
73+
# -> [0, 1, 2, 3, 4]
74+
75+
ds.map_batches(lambda b: b * 2).take_batch() # b is coerced to pd.DataFrame
76+
# -> pd.DataFrame({"id": [0, 2, 4, 6, 8]})
77+
```
78+
79+
### After
80+
```python
81+
ds = ray.data.range(1)
82+
# -> Datastream(num_blocks=1, schema={id: int64})
83+
84+
ds.take()[0]
85+
# -> {"id": 0}
86+
87+
assert ds.take_batch()
88+
# -> {"id": np.array([0, 1, 2, 3, 4])}
89+
90+
ds.map_batches(lambda b: {"id": b["id"] * 2}).take_batch() # b is Dict[str, np.ndarray]
91+
# -> {"id": np.array([0, 2, 4, 6, 8])}
92+
```
93+
94+
Note that in the "after" code, the datastream always has a fixed schema, and the batch type is consistently a dict of numpy arrays.
95+
96+
## Test Plan and Acceptance Criteria
97+
98+
The master branch will have strict mode on by default. There will be a suite that tests basic functionality with strict mode off, to avoid regressions.

0 commit comments

Comments
 (0)