Skip to content

Commit b09b301

Browse files
committed
[NO-TICKET] Attempt to fix flaky profiler spec under test-memcheck
**What does this PR do?** This PR attempts to fix a flaky profiler spec seen in https://github.com/DataDog/dd-trace-rb/actions/runs/15284503936/job/42991363261: ``` bundle exec rake compile spec:profiling:memcheck ... Failures: 1) Datadog::Profiling::Scheduler#stop when exporter has data to flush flushes the data and stops the loop Failure/Error: expect(scheduler.stop).to be true expected true got false # ./spec/datadog/profiling/scheduler_spec.rb:299:in 'block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:270:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:152:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/webmock-3.25.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run' ``` The flaky spec was introduced in #4679. The intention of the `expect(scheduler.stop).to be true` assertion is that the scheduler stops cleanly without having to rely on the forced termination timeout. Without the changes in that PR, this spec will fail because the scheduler will keep on running until it's forced to terminate. I not reproduce the issue locally. So for the spec to be failing it means that the forced termination still had to kick in... I've been staring at the code for a while and I don't see any races or bugs that might have caused this spec to fail sometimes. What did occur to me is that this spec has only been seen failing inside `test-memcheck`'s run with valgrind. And what does running an app inside valgrind cause? It causes the app to go way way slower than normal (and to be truly single-threaded -- even at the native code level). So perhaps the regular 1 second timeout, which is more than enough time in regular situations to give the background thread time to work and exit cleanly is not enough in a slow CI that is further slowed by valgrind? To hopefully fix this, I'm raising the timeout from its default of 1 to 10 seconds. This doesn't change the intention of the original spec -- this spec will still correctly fail if the code added in the above PR is commented out, and require the forced termination after 10 seconds. **Motivation:** Zero profiling flaky specs! **Additional Notes:** I've also tweaked the test to avoid leaking the scheduler thread when the test fails. In CI we saw two failures but I strongly suspect the second one... ``` 2) Datadog::Profiling::Scheduler#perform when enabled when perform fails calls the on_failure_proc and logs the error Failure/Error: expect(Datadog.logger).to receive(:warn).with(/Profiling::Scheduler thread error/) (#<Datadog::Core::Logger:0x000000002ebd95b0 @Level=1, @progname="datadog", @default_formatter=#<Logger::Formatter:0x000000002ec10560 @datetime_format=nil>, @Formatter=nil, @logdev=#<Logger::LogDevice:0x000000002c098ea0 @shift_period_suffix=nil, @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<IO:<STDOUT>>, @binmode=false, @reraise_write_errors=[], @skip_header=false, @mon_data=#<Monitor:0x000000002ec10510>, @mon_data_owner_object_id=13560>, @level_override={}>).warn(/Profiling::Scheduler thread error/) expected: 1 time with arguments: (/Profiling::Scheduler thread error/) received: 2 times with arguments: (/Profiling::Scheduler thread error/) # ./spec/datadog/profiling/scheduler_spec.rb:100:in 'block (5 levels) in <top (required)>' # ./spec/spec_helper.rb:270:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:152:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/webmock-3.25.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run' ``` was just a side-effect of the first spec failing and leaking the thread. **How to test the change?** Validate that CI is still green, and hope this is the correct fix for the flakiness :(
1 parent ff8ae5b commit b09b301

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

spec/datadog/profiling/scheduler_spec.rb

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,10 @@
284284
allow(exporter).to receive(:flush).and_return(flush)
285285
end
286286

287+
after do
288+
scheduler.stop(true) if instance_variable_defined?(:@stopped) && !@stopped
289+
end
290+
287291
# This test validates the behavior of the @stop_requested flag.
288292
#
289293
# Specifically, the looping behavior we get from the core helpers will keep on trying to flush
@@ -296,7 +300,9 @@
296300
scheduler.start
297301
wait_for { scheduler.run_loop? }.to be true
298302

299-
expect(scheduler.stop).to be true
303+
@stopped = false
304+
expect(scheduler.stop(false, 10)).to be true
305+
@stopped = true
300306
end
301307
end
302308
end

0 commit comments

Comments
 (0)