-
Notifications
You must be signed in to change notification settings - Fork 68
[FIX] spreadsheet: batch process spreadsheet_revision.commands
#284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[FIX] spreadsheet: batch process spreadsheet_revision.commands
#284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! :)
src/util/spreadsheet/misc.py
Outdated
|
||
|
||
def iter_commands(cr, like_all=(), like_any=()): | ||
if not (bool(like_all) ^ bool(like_any)): | ||
raise ValueError("Please specify `like_all` or `like_any`, not both") | ||
cr.execute( | ||
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a context manager you do not need to close it explicitely1.
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE) | |
with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr: |
That said, this is just in the name of a more pythonic implementation. IOW: imo you can keep your current version, if you like it better.
Footnotes
3752d09
to
327a6f6
Compare
|
src/util/spreadsheet/misc.py
Outdated
SELECT id, | ||
commands | ||
FROM spreadsheet_revision | ||
WHERE commands LIKE {}(%s::text[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the formatting.
SELECT id, | |
commands | |
FROM spreadsheet_revision | |
WHERE commands LIKE {}(%s::text[]) | |
SELECT id, | |
commands | |
FROM spreadsheet_revision | |
WHERE commands LIKE {}(%s::text[]) |
src/util/spreadsheet/misc.py
Outdated
if "commands" not in data_loaded: | ||
continue | ||
data_old = json.dumps(data_loaded, sort_keys=True) | ||
with pg.named_cursor(cr, itersize=1) as ncr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can either leave the default itersize
, or optimize the value to something that would work depending on the data. Another alternative is to use fetchmany
directly.
with pg.named_cursor(cr, itersize=1) as ncr: | |
with pg.named_cursor(cr) as ncr: |
Some dbs have `spreadsheet_revision` records with over 10 millions characters in `commands`. If the number of record is high, this leads to memory errors. We distribute them in buckets of `memory_cap` maximum size, and use a named cursor to process them in buckets. Commands larger than `memory_cap` fit in one bucket.
327a6f6
to
508732d
Compare
start_bucket AS ( | ||
SELECT 1 AS bucket | ||
), | ||
ordered_rows AS ( | ||
SELECT id, | ||
LENGTH(commands) AS length, | ||
ROW_NUMBER() OVER (ORDER BY LENGTH(commands), id) AS rownum | ||
FROM spreadsheet_revision | ||
WHERE commands LIKE {}(%s::text[]) | ||
), | ||
assign AS ( | ||
SELECT o.id AS id, | ||
o.length, | ||
o.rownum, | ||
sb.bucket AS bucket, | ||
o.length AS sum | ||
FROM ordered_rows o, start_bucket sb | ||
WHERE o.rownum = 1 | ||
|
||
UNION ALL | ||
|
||
SELECT o.id AS id, | ||
o.length, | ||
o.rownum, | ||
CASE | ||
WHEN a.sum + o.length > {memory_cap} THEN a.bucket + 1 | ||
ELSE a.bucket | ||
END AS bucket, | ||
CASE | ||
WHEN a.sum + o.length > {memory_cap} THEN o.length | ||
ELSE a.sum + o.length | ||
END AS sum | ||
FROM assign a | ||
JOIN ordered_rows o | ||
ON o.rownum = a.rownum + 1 | ||
) | ||
SELECT count(*),ARRAY_AGG(id) | ||
FROM assign | ||
GROUP BY bucket | ||
ORDER BY bucket; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified. You only need for each id
the number of the bucket where it would be. In case one length is longer than one bucket you want to keep the record alone in it's bucket to avoid some smaller records making the group take even more space.
You can return the ids and commands already per bucket. That way you can iterate over the named cursor one by one.
Example query: (/!\warning: untested/!)
buckets AS (
SELECT id,
mod(SUM(length(commands) OVER (ORDER BY id), %(bucket_size)s) AS num,
length(commands) > %(bucket_size)s AS alone
FROM spreadsheet_revision
WHERE commands LIKE {}(%(filters)s::text[])
ORDER BY id
)
SELECT ARRAY_AGG(id ORDER BY id)
ARRAY_AGG(commands ORDER BY id)
FROM buckets
GROUP BY num, alone
then
for ids, datas in ncr: # you can iterate one by one here
for id, data in zip(ids, datas):
...
[FIX] spreadsheet: batch process
spreadsheet_revision.commands
Some dbs have
spreadsheet_revision
records with over 10 millions characters incommands
. If the number of record is high, this leads to memory errors here. We distribute them in buckets ofmemory_cap
maximum size, and use a named cursor to process them in buckets. Commands larger thanmemory_cap
fit in one bucket.Fixes upg-2899961