Skip to content

TODO in Metafile.get_class() (storage/model/metafile.py) #501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 2.0
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions deltacat/storage/model/metafile.py
Original file line number Diff line number Diff line change
Expand Up @@ -517,6 +517,29 @@ class type to instantiate.
# is brittle to renames. On the other hand, this implementation does
# not require any marker fields to be persisted, and a regression
# will be quickly detected by test_metafile.io or other unit tests

# (sahil) Updated implementation
# The simplest and most reliable way to go about this is to add an
# additional "_type" key to all (serialized) metafiles. Then, there is
# no need to "infer" what class we are dealing with. If space is a
# concern, I think this approach adds just a couple of bytes to every
# metafile. There are less simple ways we can implement "_type" such as
# using an low-bit integer/enum to identify the class instead of a string.
# Implementhing this would involve updating several files with a new field
# and updating any serialization logic to account for the new field.

# Sample new implementation:
# if "_type" in serialized_dict:
# type_code = serialized_dict["_type"]
# if type_code in SOME_MAPPING:
# return SOME_MAPPING[type_code]
# else:
# raise an error
# else:
# raise and error
Comment on lines +531 to +539
Copy link
Member

@pdames pdames Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposal - it's definitely an improvement on what's currently here. Space optimality of metafiles isn't the biggest concern right now, since we generally expect things like S3 put/get latency to outweigh the cost of downloading the bytes as long as they all stay in the KiB size range.

With that being said, I think I'd still prefer to just use an integer-based enum here to not waste much more space than we need to when serializing with msgpack. This will probably be most relevant for Deltas, which will commonly be present in the millions or billions per table, so extra bytes will start to add up here (which reminds me of another important TODO to store the largest part of the Delta - the Manifest - in a separate file whenever it's in-memory size is larger than, say, 128 KiB).

WDYT about an enum like this that we add to storage/model/types.py:

class MetafileType(int, Enum):
    NAMESPACE = 1
    TABLE = 2
    TABLE_VERSION = 3
    STREAM = 4
    PARTITION = 5
    DELTA = 6
    
    # usage examples
    >>> MetafileType(1)
    <MetafileType.NAMESPACE: 1>
    >>> MetafileType(1).value
    1
    >>> MetafileType(1).name
    'NAMESPACE'
    >>> MetafileType.NAMESPACE
    <MetafileType.NAMESPACE: 1>



# OLD IMPL
if serialized_dict.__contains__("tableLocator"):
return deltacat.storage.model.table.Table
elif serialized_dict.__contains__("namespaceLocator"):
Expand Down