[NO-TICKET] Attempt to fix flaky profiler spec under test-memcheck

ivoanjo · ivoanjo · commit b09b301b8018 · 2025-05-28T11:34:28.000+01:00
**What does this PR do?** This PR attempts to fix a flaky profiler spec seen in https://github.com/DataDog/dd-trace-rb/actions/runs/15284503936/job/42991363261: ``` bundle exec rake compile spec:profiling:memcheck ... Failures: 1) Datadog::Profiling::Scheduler#stop when exporter has data to flush flushes the data and stops the loop Failure/Error: expect(scheduler.stop).to be true expected true got false # ./spec/datadog/profiling/scheduler_spec.rb:299:in 'block (4 levels) in <top (required)>' # ./spec/spec_helper.rb:270:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:152:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/webmock-3.25.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run' ``` The flaky spec was introduced in #4679. The intention of the `expect(scheduler.stop).to be true` assertion is that the scheduler stops cleanly without having to rely on the forced termination timeout. Without the changes in that PR, this spec will fail because the scheduler will keep on running until it's forced to terminate. I not reproduce the issue locally. So for the spec to be failing it means that the forced termination still had to kick in... I've been staring at the code for a while and I don't see any races or bugs that might have caused this spec to fail sometimes. What did occur to me is that this spec has only been seen failing inside `test-memcheck`'s run with valgrind. And what does running an app inside valgrind cause? It causes the app to go way way slower than normal (and to be truly single-threaded -- even at the native code level). So perhaps the regular 1 second timeout, which is more than enough time in regular situations to give the background thread time to work and exit cleanly is not enough in a slow CI that is further slowed by valgrind? To hopefully fix this, I'm raising the timeout from its default of 1 to 10 seconds. This doesn't change the intention of the original spec -- this spec will still correctly fail if the code added in the above PR is commented out, and require the forced termination after 10 seconds. **Motivation:** Zero profiling flaky specs! **Additional Notes:** I've also tweaked the test to avoid leaking the scheduler thread when the test fails. In CI we saw two failures but I strongly suspect the second one... ``` 2) Datadog::Profiling::Scheduler#perform when enabled when perform fails calls the on_failure_proc and logs the error Failure/Error: expect(Datadog.logger).to receive(:warn).with(/Profiling::Scheduler thread error/) (#<Datadog::Core::Logger:0x000000002ebd95b0 @Level=1, @progname="datadog", @default_formatter=#<Logger::Formatter:0x000000002ec10560 @datetime_format=nil>, @Formatter=nil, @logdev=#<Logger::LogDevice:0x000000002c098ea0 @shift_period_suffix=nil, @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<IO:<STDOUT>>, @binmode=false, @reraise_write_errors=[], @skip_header=false, @mon_data=#<Monitor:0x000000002ec10510>, @mon_data_owner_object_id=13560>, @level_override={}>).warn(/Profiling::Scheduler thread error/) expected: 1 time with arguments: (/Profiling::Scheduler thread error/) received: 2 times with arguments: (/Profiling::Scheduler thread error/) # ./spec/datadog/profiling/scheduler_spec.rb:100:in 'block (5 levels) in <top (required)>' # ./spec/spec_helper.rb:270:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:152:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/webmock-3.25.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' # ./spec/support/execute_in_fork.rb:32:in 'ForkableExample#run' ``` was just a side-effect of the first spec failing and leaking the thread. **How to test the change?** Validate that CI is still green, and hope this is the correct fix for the flakiness :(
diff --git a/spec/datadog/profiling/scheduler_spec.rb b/spec/datadog/profiling/scheduler_spec.rb
@@ -284,6 +284,10 @@
         allow(exporter).to receive(:flush).and_return(flush)
       end
 
+      after do
+        scheduler.stop(true) if instance_variable_defined?(:@stopped) && !@stopped
+      end
+
       # This test validates the behavior of the @stop_requested flag.
       #
       # Specifically, the looping behavior we get from the core helpers will keep on trying to flush
@@ -296,7 +300,9 @@
         scheduler.start
         wait_for { scheduler.run_loop? }.to be true
 
-        expect(scheduler.stop).to be true
+        @stopped = false
+        expect(scheduler.stop(false, 10)).to be true
+        @stopped = true
       end
     end
   end