-
Notifications
You must be signed in to change notification settings - Fork 708
fix(cam_hal): throttle WARN spam so cam_task reduces it's stack footprint #761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(cam_hal): throttle WARN spam so cam_task reduces it's stack footprint #761
Conversation
– prevents stack-overflow in cam_task The old cam_take() used recursion to retry if * no JPEG EOI was found or * a NULL frame pointer was returned (GDMA glitch on ESP32-S3). Under heavy loss conditions this could overflow the cam_task stack and reboot the whole system. * Re-wrote cam_take() as a single while-loop – stack depth is now constant and independent of retry count. * Added strict timeout tracking (`remaining = timeout - elapsed`); function can never block longer than the caller’s timeout. * ESP32-S3 only * capped GDMA reset storms to 3 attempts (`MAX_GDMA_RESETS`) * logs one “giving up” warning, then yields (`vTaskDelay(1)`) to avoid busy-spin after hardware is wedged. * Non-S3 targets * emit early `ESP_LOGW` when a NULL frame ever appears, then yield one tick per loop to prevent CPU thrash. * Maintained existing JPEG-EOI trimming and YUV→GRAY memcpy paths; behaviour unchanged on successful frames. * Inline comment links to esp32-camera commit 984999f / issue espressif#620 for future context.
I can't build now — getting this error:
I'm using ESPHome 2025.7.0-dev
|
Woops :D |
74e41d5
to
4092742
Compare
Background ---------- • Every bad frame produced an ESP_LOGW(): - “Unexpected NULL frame …” (non-S3) - “NO-EOI – JPEG end marker missing” • Each ESP_LOGW() invokes vsnprintf() once ⇒ ~300 B of stack for that call. What’s changed -------------- * First occurrence of each warning: ESP_DRAM_LOGW_ONCE() → zero stack after the first print. * Subsequent occurrences: count with a uint16_t and emit a **literal** ESP_LOGW() only every 100th event (~ 60 B stack usage). * Counters auto-reset at 10 000 to avoid wraparound. * On ESP32-S3 the “Giving up GDMA reset…” message is also moved to ESP_DRAM_LOGW_ONCE() to avoid stack usage. Effects ------- * Occasional diagnostics are still available: first warning and then a summary every 100 events.
4092742
to
c439f08
Compare
Sorry, my brain mixed up the name with the behavior, that it will only fire "once" ;) It should be fixed now under the tag I think I need a break :) |
//currently this is used only for YUV to GRAYSCALE | ||
|
||
if (warn_eoi_miss_cnt == 0) { | ||
ESP_DRAM_LOGW(TAG, "NO-EOI – JPEG end marker missing"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this use might give issues on older cores ESP_DRAM_LOGW
. You have correctly used ESP_LOGW
in the other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which compatibility does this project need maintain?
Looks like it got worse — now I’m getting the old error too:
And something new as well:
|
Well, looks switching to literal messages wasn't really successful, it seems to still use 300 B of stack space, due to running through ESPHome’s logger helper. However, it now really catches on a different location, because I did not expect a logging flood from But in the end, it may just be a good decision to increase the stack space, what do you think @me-no-dev? |
@turenkomv I've reduced the log spam of cam_verify_jpeg_soi() in PR #760 and create a tag to include the additional changes: But I think we fixed probably most if not all "real" issues with the stack now. And just having the issue that all that log messages don't really fit in such a busy operation in the stack. Please try it anyway, I want to see what's crashing now, so we may be able to reduce the stack usage further. In the end, I think we need to configure just a larger stack usage. Overall, I think I have identified what the fundamental issue is, why your device isn't "working" right now. But I like to take the opportunity to make sure error handling is working better, instead of fixing the root cause why it's not fetching the images properly. Hope you understand. :) |
|
I increased the stack size using this config:
|
@turenkomv thanks! It's too warm to work right now, will look on these later. |
@turenkomv sorry for the delay, the heat took me a bit out of commission. So the increase of CONFIG_CAMERA_TASK_STACK_SIZE worked. It's now crashing in a different task, the frame buffer one, which is probably related to the heavy logging there, we could add a patch, so the log spam is tamed there too. However we could also try to temporarily increase the stack size of this task as well:
|
I had similar stability issues and resolved them by increasing the stack size of the framebuffer_task here Now I'm getting logs like this:
From what I understand, the number of cam_hal: EV-EOF-OVF messages strongly depends on the resolution. After reducing it to 640x480, those messages almost completely disappeared, and only the warnings/errors above remain. |
Perfect, thanks. This means that the software is at least stable and does not trash the stack in any way. It just uses too much stack to print all those messages and handle the errors. I'll prepare a fix to hopefully reduce a bit stack usage by not printing a message for every single frame in such a scenario. I got only an old ESP32S here with one of these OV2640 sensors (IIRC). These do work stable until 1600x1200, just EXTREMELY slow, like one frame every 6 seconds or so. Can you test until which resolution this is now working stable, if you configure an FPS of 1? |
I've just ordered one of these m5 units with an PY260 sensor, but it will take a couple of weeks usually to get delivered. So it would be great if you could give me further feedback. :) Overall the ESPHome approach needs to be reworked I think, as it does not use double buffering which slows things down massively and the memory management in the ESP32-camera library isn't really optimal either at first glance. It should be possible to speed things up. I've also noticed that my ESP32 crashed yesterday after running it absolutely stable for month, after it received the ESP32-camera library update 2.0.16 (esphome still uses 2.0.15). Which probably means there's a bug in the new 2.0.16 version which results in crashes. I have to look into that as well. |
@turenkomv I took a look at the m5 demo code, which uses a patched version of the esp32-camera as driver. Well, it's poorly documented and kinda hacked together, without even specifying which version of esp32-camera was modified. But I think I can make it work, with some iterations of testing and add support for the camera to esp32-camera directly, which they should have done from the beginning instead of making a fork. |
So far, I haven’t been able to find any configuration that works. |
Why did it work for @Fexiven using the standard driver version? |
The demo code - which is like 11 month old - modifies a driver which was only merged like 7 month ago: the mega ccm driver. How? I don't know. So from the diff it looks to me like the driver is there, they just used different settings for it. But maybe my assumption based on that diff is wrong, I can't check which driver is actually loading, since I don't have the hardware yet. But it looks to me like they are using the mega com driver to read out the hardware and made modifications to it. They also modified stuff outside of the mega com driver, which looks on first glance that it's an adaption to larger files produced by the sensor. But since it's not clear which version of esp32-camera exactly was taken and modified, it's hard to tell what was a modification and what is just a version mismatch between the version of esp32-camera. I've attached the diff from the version 2.0.14 to the the demo library: My suspicion from the beginning was, that the buffer is to small to fit the images. And here's the diff for creating the buffer in the file:
Note that I was unable to locate any datasheet for a "PY260". So I suspect that's just a marketing term for a different sensor which was modified, say with better optics or something like that? I don't know. If someone can supply me a link to a datasheet, that would be helpful. |
You mean you don't get any picture, but regardless what you configure it's now no longer crashing? |
Correct — I haven’t received any image output so far. The logs consistently show:
Between those, there are varying numbers of |
I tried increasing the timeout here: I also attempted to increase the buffer size via the following config: CONFIG_CAMERA_JPEG_MODE_FRAME_SIZE_AUTO: "n"
CONFIG_CAMERA_JPEG_MODE_FRAME_SIZE: "5038848" |
Yeah, another option is, that the camera is just not configured to output JPEGs and instead will return something else. That's why trying to detect JPEG start and end fails. I will dig a bit deeper into the code. |
Description
Precondition: Merge of #758
Background
• Every bad frame produced an ESP_LOGW():
• Each ESP_LOGW() invokes vsnprintf() once ⇒ ~300 B of stack for that call.
What’s changed
ESP_DRAM_LOGW_ONCE() → zero stack after the first print.
count with a uint16_t and emit a literal ESP_LOGW() only
every 100th event (~ 60 B stack usage).
ESP_DRAM_LOGW_ONCE() to avoid stack usage.
Effects
summary every 100 events.
Related
Thanks to @turenkomv who reported this issue here.
Testing
@turenkomv would you be so kind and would test this on your hardware? Tag is
fixes_for_m5_0.4
, which is the currently used version in ESPHome with all four patches on top.Checklist
Before submitting a Pull Request, please ensure the following: