WIP: [RFC0001] SuperDuperDB is enabler of Data Mesh #602

kislerdm · 2023-08-02T12:16:39Z

kislerdm
Aug 2, 2023

Context

Following Data Mesh approach, stream-aligned teams will provide domain data as a product. Such transition poses a
bottleneck because the teams, who own the domain logic, may lack extensive expertise in analytics and machine learning.

To address this challenge, SuperDuperDB offers the system to enable stream-aligned teams to delivering data as a product
by applying advanced analytics and machine learning to the business domain's data directly without dependency on cold
data storage a/k/a data lake.

SuperDuperDB solves the three classes of problems:

Execution. SDDB provides a platform to execute data and computation heavy analytics tasks effectively.
Governance. SDDB provides a central hub to govern the domain data schema evolution.
Collaboration. SDDB enables teams to share models and domain data products.

Problem Statement

The document is intended to address SuperDuperDB product readiness for successful integration into a running system.

Production readiness requirements

The SuperDuperDB product readiness strategies are outlined to satisfy the following requirements.

Zero downtime during integration.
Minimal invasion into the database.
Security:
- Data integrity: no data can be corrupt in the source database.
- PII governance.
- Zero-trust policy.
System availability:
- Seamless observability.
Scalability:
- In terms of adoption amongst many teams.
- In terms of performance.
Ease of modification:
- Schema governance to enable for changes without down-time.

Goals

The documents addresses design of the SuperDuperDB system.

Non-Goals

The document does not cover implementation details, i.e. priorities and timelines.

Proposed solutions

Assumptions

The following assumptions are made to satisfy the readiness requirements:

SDDB extends existing system following the dependencies inversion principle. It means that the database has no
coupling to SDDB, nor does it dependent on SDDB.
SDDB runner is deployed to the customer's environment. For example,
as k8s custom resource.
SDDB requires read-only access permissions to the source data.
SDDB provides governance capabilities to assure backwards compatibility in case of models and source data
modification, and prevent PII leak.
SDDB provides IAM capabilities to insure zero-trust policy.
SDDB provides extensions to integrate with the popular observability services, e.g. NewRelic, DataDog.
SDDB provides a reliable storage to keep the history of ML jobs.

Topology

SuperDuperDB is the stateful application with three inbound dependencies and one outbound dependency.

Inbound dependencies:

Source data with the model's features and labels;
Model's configuration: parameters and hyperparameters;
Trigger to run SDDB process.

Outbound dependency:

Prediction mode: prediction results, or
Training mode: model's parameters.

Data processing ways

SuperDuperDB operates in two ways for data processing, batch and streaming:

Batch processing is executed asynchronously, or synchronously upon submission of a job to process a batch of data.
The job is submitted follow to the request-response protocol, with the request carrying job configurations, and
the response carrying job's metadata, or its results.
Streaming is executed asynchronously after the SDDB system consumed a pubsub event with the job submission details.
The job result is published back to a dedicated topic/stream.

Batch processing

The mode is used for model training and batch prediction.

There are two scenarios to trigger a job's execution:

From a client application, when the SDDB SDK acts as a proxy to SDDB cluster.
From a database, when the SDDB UDF acts as a proxy to SDDB cluster.

The diagrams below illustrate the execution flow.

In-database call

sequenceDiagram
    title Prediction job in database, sync
    actor User
    User->>Database: Executes command
    Database-->>UDF: Proxies
    UDF->>+SDDB: Requests
    SDDB-->>Database: Verifies schema and permissions
    alt happy path
        Database-->>SDDB: ok
        SDDB->>Database: Reads data
        SDDB-->SDDB: Run model/analytics task
    else unhappy path
        Database-->>SDDB: not ok
    end
    SDDB->>-UDF: Responses
    UDF-->>Database: Proxies
    Database->>User: Returns results

sequenceDiagram
    title Train job submission in database, async
    actor User
    User->>Database: Executes command
    Database-->>UDF: Proxies
    UDF->>SDDB: Requests
    SDDB-->>Database: Verifies schema and permissions
    alt happy path
        Database-->>SDDB: ok
        par
          SDDB->>Database: Reads data
          SDDB-->SDDB: Run model/analytics task
        and
          SDDB->>UDF: Responses
        end
    else unhappy path
        Database-->>SDDB: not ok
        SDDB->>UDF: Responses
    end
    UDF-->>Database: Proxies
    Database->>User: Returns jobID and status

sequenceDiagram
    title Train job's status check in database, sync
    actor User
    User->>Database: Queries job status
    Database-->>UDF: Proxies
    UDF->>SDDB: Requests
    alt job found
      SDDB->>UDF: Responses job metadata
    else job not found
      SDDB->>UDF: Responses error
    end
    UDF-->>Database: Proxies
    Database->>User: Returns results

Programmatic call

sequenceDiagram
    title Prediction and train jobs submission in app, async
    App->>SDDB: Execute command
    SDDB-->>Database: Verifies schema and permissions
    alt happy path
        Database-->>SDDB: ok
        par
          SDDB->>Database: Reads data
          SDDB-->SDDB: Run model/analytics task
        and
          SDDB->>App: Returns jobID and status
        end
    else unhappy path
        Database-->>SDDB: not ok
        SDDB->>App: Returns jobID and status
    end

sequenceDiagram
    title Job's status check in app, sync
    App->>SDDB: Requests job status
    alt job found
      SDDB->>App: Responses job metadata
    else job not found
      SDDB->>App: Responses error
    end

Streaming

The mode is used for real-time predictions.

sequenceDiagram
    title Prediction job - streaming, async
    AppSource->>PubsubSource: Publishes the job definition
    PubsubSource->>+SDDB: Consumes the job definition
    SDDB-->>Database: Verifies schema and permissions
    alt happy path
        Database-->>SDDB: ok
        SDDB->>Database: Reads data
        SDDB-->SDDB: Run model/analytics task
    else unhappy path
        Database-->>SDDB: not ok
    end
    SDDB-->>-PubsubSource: Ack  
    SDDB->>PubsubDestination: Publishes result
    PubsubDestination->>AppDestination: Consumes the job results

Architecture design

The diagram's code: here.

The diagram highlights three types of teams which are expected to interact with SDDB solution. It illustrates
dependencies among SuperDuperDB the running systems owned by the stream-aligned team.

Container diagram

The diagram's code: here.

The container diagram illustrates the following critical points:

SDDB requires a metastore to keep checkpoints for disaster recovery and enable organisations to govern data schemas
and models configurations.

SDDB cluster orchestration

Attribute	Flyte	Ray	Dask	Comment
License	Apache 2.0	Apache 2.0	BSD-3
Age	3.5y	6y	8y	Flyte has the most active support with a release at least once a month.
Native PyData API	❌	❌	✅	Ray supports dask tasks execution Flyte's focus is on orchestration
Kubernetes support	✅	✅	✅
Disaster recovery in a cluster mode	✅	✅	❌
Checkpoints support	✅	✅	❌
Multi schedule-worker clusters	✅	✅	❌
Documentation quality	✅	✅	❌	Ray's documentation provides comprehensive tutorials and architecture's description
SaaS managed solution	✅	✅	❌	Ray: anyscale Flyte: union.ai
Parallelization of data extraction from database	❌	✅	✅

Flyte is focused more on facilitating infrastructure topics, while Dask is focused more on facilitating analytics tasks
definition topics. Ray is somewhere in-between. Moreover, it supports interfaces to integrate with both, Dask and Flyte.

SDDB design

Components

Architecturally, the SDDB runner includes five main components:

IAM: the authN/Z component to ensure zero-trust policy;
Compute: to define and configure the model logic;
Storage: the interface to communicate to the data source;
SerDe: data format definition;
Observability: observability and model performance metrics.

Technical decisions

Communication protocol for communication between the client and SDDB

Contestant	Pro	Cons	Choice
gRPC	- No need to supporting SDK - Strong contract guarantees	- Learning curve	✅
HTTP	- No learning curve	- SDK supprt - Contract maintenance	❌

Questions

What are pro/cons/effort estimates?
Compare with the CDC/status quo.

Integration checklist

One time:

Provision SDDB kubernetes custom resource.
Configure network access.
Configure access to secrets vault.
Create the database role with read-only permissions.
Define SDDB trigger as a proxy to database trigger.

Repetitive:

Provision secrets (secrets rotation is expected):
- Database credentials.
- Monitoring access token.
- SDDB access token/license.
- (Optional) access token for 3rd party models.
- (Optional) SDDB access token to proxy trigger.
Define data schema.
Define model.

Implementation

TBD

Open questions

What motivated to follow pull-based dependency mechanism by implementing CDC?
What controller/worker can be used as the starting point?

Glossary

UDF - user-defined function.

References

blythed · 2023-08-02T23:09:45Z

blythed
Aug 2, 2023
Maintainer

@kislerdm thank-you for this detailed vision for taking SuperDuperDB to the data-mesh.

There seem to be some clear takeaways here:

We should write to a different source than reading
We should consider pivoting from dask to ray.
We need a plan for observability
We should consider gRPC for the server layer
Production readiness, is still a big question mark (security, deployment on Kubernetes, cloud, on-prem)

Some comments on this:

Definitely should an option, it will make life more complicate in terms of serving the data-layer. We need to discuss that further.
Currently there an elegant simplicity in outputs residing in documents. For MongoDB vector-search, for instance, this is also a requirement. Maybe we should make this configurable?
I haven't got any experience with ray. The main reason I chose dask was ease of use.
I agree
We had some discussions around gRPC vs. REST and decided on REST. I'm guessing we'd need a really strong reason for gRPC due to the steeper learning curve.
We're still missing one slot in the team and ideation around this. Let's iterate on this aspect in particular.

1 reply

kislerdm Aug 3, 2023
Author

Thanks for your comments Duncan. I feel that you extracted the key takeaways I wanted to convey with the document 🙂

There are still many questions to be answered as pretty much whole RFC is based on assumptions.

We had some discussions around gRPC vs. REST and decided on REST

It's not a problem, any tech commitment is not beneficial at this stage - we carry on with Rest. We can come back to reassessment if its versioning maintenance costs becomes a bottleneck at some point.

Currently there an elegant simplicity in outputs residing in documents.

Unfortunately, elegant is not necessarily secure, so we shall rather commit to separating input and output. It will help us to avoid any security, or data integrity issues caused by the source data mutation. For example, the output data can still be stored as a mongo document, but to a different collection to segregate read and write circuits following CQRS. Moreover, we will likely need to collect data from multiple collections/tables to run predictions.

I haven't got any experience with ray. The main reason I chose dask was ease of use.

It's a very reasonable approach. However, unfortunately dask is not there yet when it comes to production readiness.

We should consider pivoting from dask to ray.

One of the points pro Ray is that it can integrate with Dask giving us known interface for the time being. Later on, we will migrate from Dask completely to simply the system.

We need a plan for observability

Definitely. Ray provides interfaces to integrate with many MLOps tools and some observability tools like grafana. We can start there and extend it as plugins to SDDB.

kislerdm · 2023-08-03T12:47:27Z

kislerdm
Aug 3, 2023
Author

BLUF

The key takeaways:

SDDB will provide SaaS solution: cloud (subscription) and on-prem (license).
SDDB will require less than 10 actions for integration into a running customer’s system.
SDDB will execute data transformation jobs in the client’s environment: no business data need to leave the client environment’s perimeter.
SDDB will not mutate input data in the source.
SDDB will provide two kinds of UI:
- UDF for people to query database;
- HTTP API for applications.
SDDB will provide clean interfaces to ensure seamless integration and extendibility.
SDDB will provide data schema and models hub.
SDDB will support streaming and batch data processing.
SDDB will rely on kubernetes and Ray to manage workload.

0 replies

nenb · 2023-08-04T11:26:33Z

nenb
Aug 4, 2023

I love the idea of introducing RFCs and doing it in a public way like this, fantastic work @kislerdm. A lot of what you wrote really got me thinking and out of my comfort zone at times, I really appreciate this. Thanks for taking the time to write this! 🙏

I don't seem to be able to add comments to the various sections (the only drawback that I can see with using GitHub discussions) so I've added a bunch of comments ordered by section below. Most of the comments are mainly meant as food for thought - ways that perhaps the RFC could be refined, but hopefully with enough wiggle room that there are multiple ways to address them.

Context

I think more should be said on the implicit assumptions of considering organisations adopting a data mesh model. For example, I think it implies > 200 people. Would these sorts of organisations be willing to adopt a very early-stage startup like us? Are there any other implicit requirements eg a service-level agreements that such organisations would likely require, and what would they imply for us? Perhaps also some discussion of a timeline eg our early-adopters are unlikely to be these types of users, but our X month goal is to become more attractive to larger organisations (although I know you said that this was out-of-scope, I think it deserves to be addressed at least in some way.)
You identify that SuperDuper solves three types of problems: execution, governance and collaboration. I don't believe that we currently solve the last two types, and I think it's unclear how well we currently solve for the first type. I think this section needs to be clarified to reflect this. And maybe some comment on why we need to focus on these final two types.

Proposed Solutions
Assumptions

I think more discussion or rationale needs to be provided for the assumptions. Why would we make these assumptions, and could we make life easier for ourselves by ignoring some of them? For example:

Is deploying to the customer's environment the only option, and if not, what about the alternatives? See here for some smarter thoughts than my own.
Do we need to provide governance capabilities ourselves (hard) or can we somehow integrate with other providers? A bit more discussion on how relevant either option is to us would be useful I think.

Data Processing Ways

I (personal opinion!) find the diagrams very noisy here, and I don't find they add much value. They also feel a bit implementation-focused. I would much rather see more written details on UDF and SDK (what exactly the UDF would do, why we need it etc.)
At the other extreme I really like the 'Architecture Design' and 'Container Diagram' diagrams! Nice work! I think that these could be much higher up in the document (unfortunately, people do have a tendency to skim through the middle sections of documents), as I think they really clarify what it is that we are aiming for and currently lack. I would try to make these as visible as possible (as early as possible) to all readers.

SDDB cluster orchestration

I don't agree with most of the boxes on dask eg Coiled (or others like SaturnCloud) provide the PaaS solution, the documentation is far better than most projects I work with, it's raison d'etre is basically to have a native PyData API etc. However, I am a very biased user here. 😉 I think you need to provide some further documentation at least to support your claims for dask, ray, and flyte (and maybe some comment on why we chose flyte - what about prefect etc if we choose flyte). (Note: Even though I love dask, I have wondered if ray is more aligned with our AI needs. I'm still learning here.)
I think that some comment on the use case for different organisations would be very interesting here. For example, Databricks are offering ray now, will this pitch us into a similar space? Do we want this? Coiled and others offer dask - would it be relevant to us to consider adopting these offerings (at least initially)?

Technical decisions

I think it's a bit early to be deciding that we will adopt gRPC or go all-in on Kubernetes eg the dask founder (who I seem to continually reference 😉 ) has argued that k8s may not be optimal for Coiled. Could similar issues be relevant to us in the future?

General
I think more focus on what is unique about our product, and how that informs the decisions and proposals made would be very useful. A (perhaps harsh) criticism is that a lot of what is written could be read as a generic playbook for producing a SaaS/PaaS product. I think more discussion on i) the current products in this space and ii) how our product differs from them is important. Given our current integration with MongoDB, I think there should also be more discussion on how these proposals can align with MongoDBs offerings, and what MongoDB users expect. So, a bit more focus on our users and the proposed use cases. A lot of work is required for most of this, and we need to have a clear rationale for doing so before beginning!

2 replies

thejumpman2323 Aug 6, 2023

There is something called dask on ray, we can look it as well!

cc @kislerdm @nenb

kislerdm Aug 7, 2023
Author

@nenb Hey Nick! Thanks a lot for your feedback! For context, the RFC was aiming to convey a version of the mid-/long-term vision of SDDB platform development in order for us to keep thinking in that direction, and to use this document as a data point. BTW I invite everyone to contribute to it so we converge eventually ;)

I don't believe that we currently solve the last two types, and I think it's unclear how well we currently solve for the first type.

The RFC outlines a way for SDDB to become production ready according to the mentioned assumptions. I'll revisit it to add the status quo - thanks for proposing.

I (personal opinion!) find the diagrams very noisy here, and I don't find they add much value. At the other extreme I really like the 'Architecture Design' and 'Container Diagram' diagrams! Nice work!

Thanks. Point taken. I'd be happy to share my knowledge about diagrams ;)

what exactly the UDF would do

It will proxy database request to SDDB. The purpose: to enable users to use vector search by querying data from the database. SDK has a similar purpose, but to enable applications to schedule jobs by communicating to SDDB cluster over network. The rationale: to enable teams using the languages other than python.

I don't agree with most of the boxes on dask

Dask is a handy instrument for those using PyData stack indeed, but unfortunately it has a huge flaw: lack of mechanism for the scheduler to recover in case of failure (please find details here).

For example, Databricks are offering ray now, will this pitch us into a similar space? Do we want this?

The problem I proposed to solve by employing ray is scalability. We can solve scalability in many ways. I invite you to join me to explore our opportunities.

WDYT: shall we commit to Ray, could it facilitate SDDB adoption by the orgs using Databricks?

I have wondered if ray is more aligned with our AI needs

It was beyond my assessment. The problem I was exploring was scalability.

Could you please share you opinion: does Dask fit our AI needs better than Ray?

Is deploying to the customer's environment the only option

It's unfortunately the only pragmatic way to avoid avoidable responsibility we were to incur shall we move customers data outside of their perimeter. Anything else requires complex network configuration which is still more risky than deployment to the customer's env.

What would you propose?

I think it's a bit early to be deciding that we will adopt gRPC

Indeed :) Although, to reiterate, I see the following arguments pro gRPC:

Ease of client-server contract maintenance;
No need to maintain SDK;
Client's requests are topologically commands, not queries, hence gRCP fits better than Rest.

What do you see as the main blocker for us to commit to using gRPC?

A lot of work is required for most of this, and we need to have a clear rationale for doing so before beginning!

Unfortunately, it's a "chicken-egg" problem: we cannot deliver until there is a customer's input, customer cannot give input until we deliver :) I feel that we shall definitely act proactive, increment, but not over-commit. For example, an alternative to this RFC's proposal would be deploy SDDB using "serverless" to target SME. WDYT?

nenb · 2023-08-09T09:44:19Z

nenb
Aug 9, 2023

I think the dask vs ray discussion is relevant to what we are building. For me, the arguments in favour of sticking with dask are:

we have a somewhat working implementation already
I am far more familiar with dask than ray.

Obviously 1. is the crucial argument here, but I think it's probably the case that we have more internal experience with dask than ray. In the absence of a contributor who is very familiar with ray, I think we should stick with dask in the short-term at least, if only to keep moving forward!

Beyond the short-term (which I know this document is targetting) I'm not sure what is the best option to go for. The only strategy I really have is that we try to list the arguments 'for' and 'against' each pacakge, and make a decision based on that. That is why I think it's so important that the rows in your original table reflect the correct information (see my previous comment for some rows I do not believe to be correct).

Dask is a handy instrument for those using PyData stack indeed, but unfortunately it has a huge flaw: lack of mechanism for the scheduler to recover in case of failure (please find details here).

I agree that this is a design choice of dask that doesn't affect ray. But I disagree with the characterisation of it as a huge flaw. I used dask on k8s for very large workloads (several TBs of data when loaded into memory) over the last few years, and I did not really find that scheduler failure was an issue. It's certainly possible, but in practice I didn't suffer from it.

My personal approach for now is to stick with dask and see what feedback we get from users. Use this feedback to inform whether dask is the right choice for us in practice. And in parallel, we can also document arguments 'for' and 'against' each package, but we should double-check that all the information is valid, because there is so much information out there, and I personally struggle to keep on top of even a small fraction of it.

It's unfortunately the only pragmatic way to avoid avoidable responsibility we were to incur shall we move customers data outside of their perimeter. Anything else requires complex network configuration which is still more risky than deployment to the customer's env.
What would you propose?

Sorry, I just realised that the link I posted earlier to the relevant comment on this is broken. :ups Here is the correct link. Consulting, Funded Development and Support (to name a few) are all other alternatives IMO.

What do you see as the main blocker for us to commit to using gRPC?

internal expertise/complexity
tooling (websearch grpcio Python errors or grpcio + ray conflict)

I admit I don't have the experience to confidently say we should go with one or the other. My current reasons in favour of REST are:

We already have something of a solution that works
It's simple and everyone already understands it
The tooling is not likely to ever be a problem. Worst case scenario for grcp is that the tooling causes problems for the tooling we require for our core mission (bring AI to your database from Python)

For example, an alternative to this RFC's proposal would be deploy SDDB using "serverless" to target SME. WDYT?

Perhaps - I would need more details! But I like the fact that we are considering for SME. For me, our core users are single devs and small organisations, these are the people I want to support and enable over the coming months.

1 reply

kislerdm Aug 9, 2023
Author

@nenb thank you for your feedback Nick!

Based on the argument, I feel in favor of the decisions you shared:

go with dask;
use Rest for server.

The key: to be "lean and clean", and be ready to execute effectively without over-commitment to a particular decision.

WIP: [RFC0001] SuperDuperDB is enabler of Data Mesh #602

Uh oh!

Uh oh!

kislerdm Aug 2, 2023

Table of contents

Context

Problem Statement

Production readiness requirements

Goals

Non-Goals

Proposed solutions

Assumptions

Topology

Data processing ways

Batch processing

Streaming

Architecture design

Container diagram

SDDB cluster orchestration

SDDB design

Components

Technical decisions

Communication protocol for communication between the client and SDDB

Questions

Integration checklist

Implementation

Open questions

Glossary

References

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

blythed Aug 2, 2023 Maintainer

Uh oh!

kislerdm Aug 3, 2023 Author

Uh oh!

kislerdm Aug 3, 2023 Author

BLUF

Uh oh!

nenb Aug 4, 2023

Uh oh!

Uh oh!

thejumpman2323 Aug 6, 2023

Uh oh!

Uh oh!

kislerdm Aug 7, 2023 Author

Uh oh!

Uh oh!

nenb Aug 9, 2023

Uh oh!

Uh oh!

kislerdm Aug 9, 2023 Author

kislerdm
Aug 2, 2023

Replies: 4 comments 4 replies

blythed
Aug 2, 2023
Maintainer

kislerdm Aug 3, 2023
Author

kislerdm
Aug 3, 2023
Author

nenb
Aug 4, 2023

kislerdm Aug 7, 2023
Author

nenb
Aug 9, 2023

kislerdm Aug 9, 2023
Author