Replies: 4 comments 4 replies
-
@kislerdm thank-you for this detailed vision for taking SuperDuperDB to the data-mesh. There seem to be some clear takeaways here:
Some comments on this:
|
Beta Was this translation helpful? Give feedback.
-
BLUFThe key takeaways:
|
Beta Was this translation helpful? Give feedback.
-
I love the idea of introducing RFCs and doing it in a public way like this, fantastic work @kislerdm. A lot of what you wrote really got me thinking and out of my comfort zone at times, I really appreciate this. Thanks for taking the time to write this! 🙏 I don't seem to be able to add comments to the various sections (the only drawback that I can see with using GitHub discussions) so I've added a bunch of comments ordered by section below. Most of the comments are mainly meant as food for thought - ways that perhaps the RFC could be refined, but hopefully with enough wiggle room that there are multiple ways to address them. Context
Proposed Solutions
Data Processing Ways
SDDB cluster orchestration
Technical decisions
General |
Beta Was this translation helpful? Give feedback.
-
I think the
Obviously 1. is the crucial argument here, but I think it's probably the case that we have more internal experience with Beyond the short-term (which I know this document is targetting) I'm not sure what is the best option to go for. The only strategy I really have is that we try to list the arguments 'for' and 'against' each pacakge, and make a decision based on that. That is why I think it's so important that the rows in your original table reflect the correct information (see my previous comment for some rows I do not believe to be correct).
I agree that this is a design choice of My personal approach for now is to stick with
Sorry, I just realised that the link I posted earlier to the relevant comment on this is broken. :ups Here is the correct link. Consulting, Funded Development and Support (to name a few) are all other alternatives IMO.
I admit I don't have the experience to confidently say we should go with one or the other. My current reasons in favour of REST are:
Perhaps - I would need more details! But I like the fact that we are considering for SME. For me, our core users are single devs and small organisations, these are the people I want to support and enable over the coming months. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Table of contents
Context
Following Data Mesh approach, stream-aligned teams will provide domain data as a product. Such transition poses a
bottleneck because the teams, who own the domain logic, may lack extensive expertise in analytics and machine learning.
To address this challenge, SuperDuperDB offers the system to enable stream-aligned teams to delivering data as a product
by applying advanced analytics and machine learning to the business domain's data directly without dependency on cold
data storage a/k/a data lake.
SuperDuperDB solves the three classes of problems:
Problem Statement
The document is intended to address SuperDuperDB product readiness for successful integration into a running system.
Production readiness requirements
The SuperDuperDB product readiness strategies are outlined to satisfy the following requirements.
Goals
The documents addresses design of the SuperDuperDB system.
Non-Goals
The document does not cover implementation details, i.e. priorities and timelines.
Proposed solutions
Assumptions
The following assumptions are made to satisfy the readiness requirements:
coupling to SDDB, nor does it dependent on SDDB.
as k8s custom resource.
modification, and prevent PII leak.
Topology
SuperDuperDB is the stateful application with three inbound dependencies and one outbound dependency.
Inbound dependencies:
Outbound dependency:
Data processing ways
SuperDuperDB operates in two ways for data processing, batch and streaming:
Batch processing is executed asynchronously, or synchronously upon submission of a job to process a batch of data.
The job is submitted follow to the request-response protocol, with the request carrying job configurations, and
the response carrying job's metadata, or its results.
Streaming is executed asynchronously after the SDDB system consumed a pubsub event with the job submission details.
The job result is published back to a dedicated topic/stream.
Batch processing
The mode is used for model training and batch prediction.
There are two scenarios to trigger a job's execution:
The diagrams below illustrate the execution flow.
In-database call
Programmatic call
Streaming
The mode is used for real-time predictions.
Architecture design

The diagram's code: here.The diagram highlights three types of teams which are expected to interact with SDDB solution. It illustrates
dependencies among SuperDuperDB the running systems owned by the stream-aligned team.
Container diagram

The diagram's code: here.The container diagram illustrates the following critical points:
and models configurations.
SDDB cluster orchestration
Flyte's focus is on orchestration
Flyte: union.ai
Flyte is focused more on facilitating infrastructure topics, while Dask is focused more on facilitating analytics tasks
definition topics. Ray is somewhere in-between. Moreover, it supports interfaces to integrate with both, Dask and Flyte.
SDDB design
Components
Architecturally, the SDDB runner includes five main components:
IAM
: the authN/Z component to ensure zero-trust policy;Compute
: to define and configure the model logic;Storage
: the interface to communicate to the data source;SerDe
: data format definition;Observability
: observability and model performance metrics.Technical decisions
Communication protocol for communication between the client and SDDB
- Strong contract guarantees
- Contract maintenance
Questions
Integration checklist
One time:
Repetitive:
Implementation
TBD
Open questions
Glossary
References
Beta Was this translation helpful? Give feedback.
All reactions