-
Notifications
You must be signed in to change notification settings - Fork 67
[IMP] snippets: move all work from parent to mp workers #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[IMP] snippets: move all work from parent to mp workers #137
Conversation
3cb6112
to
2c980fa
Compare
ok, I stopped the mattness. If anyone wants to review ... :-) |
2c980fa
to
bd55b91
Compare
bd55b91
to
3a69c90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loos mostly good.
3a69c90
to
9ba22c9
Compare
9ba22c9
to
a697852
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestion about the row processing method. Untested!
Regarding the test I'm OK with it, but you could put a comment to somehow clarify the extra steps in the test.
b6888a6
to
117114b
Compare
all comments resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @KangOl
117114b
to
b8eebed
Compare
𓃠 ? |
🐈 ? |
0ee95a6
to
3a916ac
Compare
@nseinlet concerning our earlier coffee-machine meeting: wdyt about the fixup just pushed? |
3a916ac
to
9405018
Compare
rebased, conflicts resolved |
9405018
to
fe1f381
Compare
Today I improved the comments about the backwards compatibility, rebased, squashed the fixups. Do I get a review? And/or a 🐈⬛ ? |
src/util/snippets.py
Outdated
def __call__(self, query): | ||
# backwards compatibility: caller passes rows in the "query" argument and expects us to return converted rows | ||
if not (self.dbname and self.update_query and isinstance(query, str)): | ||
return self._convert_row(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this simpler: if dbname we know is the new impl.
def __call__(self, query): | |
# backwards compatibility: caller passes rows in the "query" argument and expects us to return converted rows | |
if not (self.dbname and self.update_query and isinstance(query, str)): | |
return self._convert_row(query) | |
def __call__(self, row_or_query): | |
# backwards compatibility: caller passes rows in the "query" argument and expects us to return converted rows | |
if not self.dbname: | |
return self._convert_row(query) |
I find this merged implementation unnecessary. The two interfaces cannot be used transparently by the same client code because we need the extra parameters in the new version.
In this case I think the good old OOP may be cleaner. For example, rename this class to ConvertorNG
and code it for the query variant. Then make Convertor
a subclass that keeps the old behavior.
class ConvertorNG:
def __init__(self, converters, callback, dbname, update_query):
...
def __call__(self, query):
# filter rows if they do not have id, instead of by None
def _convert_row(self, row):
# do not return None if no id, use original impl
class Convertor(ConvertorNG):
"""Deprecated kept for compatibility use ConvertorNG"""
def __init__(self, converters, callback):
super().__init__(converters, callback, None, None)
def __call__(self, row):
return self._convert_row(self, row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the suggestion to make it simpler. Sub-classing may be perfectly clean, but it is also a lot of additional loc. Do we want that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the suggestion to make it simpler. Sub-classing may be perfectly clean, but it is also a lot of additional loc. Do we want that?
TL;DR: I'm OK with both variants.
It's not about LOCs, it's about the clarity in the implementation. If somebody out there is using the old interface we don't break them. We will always use the new one including this PR. IMO it isn't worth to have a double approach for us internally if we would never use the other alternative. I'm still OK with this variant of the code, but having to understand and read the "compatibility code" next time we need to debug the code is unnecessary if we can just move it out of the way.
IMO having a double use parameter query_or_row
only makes sense if we will use both variants. Since we are not, I'd suggest a new class keeping the old one for compatibility. Note that we do that already in many other places, just that as function aliases, here we are modifying a callable class hence the subclass suggestion.
That being said. Do not change it just for me. Let's ping @KangOl for a final review :)
5532679
to
aca366e
Compare
In `convert_html_columns()`, we select 100MiB worth of DB tuples and pass them to a ProcessPoolExecutor together with a converter callable. So far, the converter returns all tuples, changed or unchanged together with the information if it has changed something. All this is returned through IPC to the parent process. In the parent process, the caller only acts on the changed tuples, though, the rest is ignored. In any scenario I've seen, only a small proportion of the input tuples is actually changed, meaning that a large proportion is returned through IPC unnecessarily. What makes it worse is that processing of the converted results in the parent process is often slower than the conversion, leading to two effects: 1) The results of all workers sit in the parent process's memory, possibly leading to MemoryError (upg-2021031) 2) The parallel processing is being serialized on the feedback, defeating a large part of the intended performance gains To improve this, this commit - moves all work into the workers, meaning not just the conversion filter, but also the DB query as well as the DB updates. - by doing so reduces the amount of data passed by IPC to just the query texts - by doing so distributes the data held in memory to all worker processes - reduces the chunk size by one order of magnitude, which means - a lot less memory used at a time - a lot better distribution of "to-be-changed" rows when these rows are clustered in the table All in all, in my test case, this - reduces maximum process size in memory to 300MiB for all processes compared to formerly >2GiB (and MemoryError) in the parent process - reduces runtime from 17 minutes to less than 2 minutes
aca366e
to
34f4ec9
Compare
In
convert_html_columns()
, we select 100MiB worth of DB tuples and pass them to a ProcessPoolExecutor together with a converter callable. So far, the converter returns all tuples, changed or unchanged together with the information if it has changed something. All this is returned through IPC to the parent process. In the parent process, the caller only acts on the changed tuples, though, the rest is ignored. In any scenario I've seen, only a small proportion of the input tuples is actually changed, meaning that a large proportion is returned through IPC unnecessarily.What makes it worse is that processing of the converted results in the parent process is often slower than the conversion, leading to two effects:
To improve this, this commit
All in all, in my test case, this