-
Notifications
You must be signed in to change notification settings - Fork 30
Description
After updating to ProtoBuf 1.0.0 #124 I found that summaries are not logged correctly to Tensorboard.
Some of them do get logged but some don't. I suspect that's because some summaries are fine but after trying to log incorrectly with some of them, the file or tensorboard stops registring the ones following that.
I prepared a minimal reproducing code by revising the Flux example to the new Flux API (the existing example uses a deprecated API)
nomadbl@fc9ba3e
During logging I observe an error message
[2023-07-01T23:12:43Z WARN rustboard_core::run] Read error in ./content/log/events.out.tfevents.1.68825314069885e9.lior-HP-Pavilion-Laptop-15-cs3xxx: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x85987b32), want: MaskedCrc(0x00000000) }))
Which after some googling I can only speculate it indicates it has something to do with multiprocessing and the file trying to get written by multiple instances of the logger in different threads.
So far I tried (without success) to fix it under that assumption by specifying the logger should lock the file:
src/TBLogger.jl
, 119:
file = open(fpath, "w"; lock=true)
Any other ideas or insights are welcome. I'll try to isolate the issue using the above mentioned reproducing code.