It is a data race (undefined
behavior)
when a thread writes to memory that another thread is reading or writing.
std::mutex and std::atomic can be used to guard against data races.
Compilers such as gcc and clang can instrument the binary to detect data
races with the flag
-fsanitize=thread.
In Bitcoin Core it can be set via ./configure --with-sanitizers=thread.
However, data races may only happen intermittently and detection is not
guaranteed, even if the program is fed the same inputs.
In Bitcoin Core, validation code is protected by the cs_main global
recursive mutex.
<MacroFake> 2. Which two threads cause the data race fixed in this pull request? What is their purpose? Refer to the traceback in the pull request description.
<amirreza97> So why these two starts at the same time? Isn't the first thread (reading from blk???.dat files) precondition for running the main thread?
<larryruane> modifying the chain tip (correctly, with cs_main): ActivateBestChain calls ActivateBestChainSetp calls ConnectTip calls SetTip ... reading the chain tip (without lock): AppInitMain calls ActiveTip calls Tip
<larryruane> "Why is cs_main not held?" -- is it anything more than that it's an oversight? Ah I think I see, those first 3 methods you linked to don't scquire the lock
<MacroFake> michaelfolkson: If a mutex is acquired by a different thread, it can not be acquired by this thread. However, if the thread *wants* to acquire it, it will wait until it can
<lightlike> is there still a "race", just not one with UB? After the fix, loadblk still does its thing and possible changes the ActiveTip() several times, and at some point the main thread takes the cs_main and calls RPCNotifyBlockChange() with whatever is the current tip at this point - could that vary from run to run?
<larryruane> ignore if this is too detailed, but "... it will wait until it can" -- does it spin-wait, or sleep-wait? Or spin for a short time and then "give up" and sleep (let other threads run)?
<MacroFake> larryruane: I am not familiar with low level primitives, but I'd guess that the thread will be put to sleep and woken when the lock is ready to be picked up
<sipa> IIRC the basic idea is to use atomic CPU instructions to quickly check if the lock is contended or not, and if it is, make a kernel call that gets woken up when needed
<michaelfolkson> Were people aware of this data race already (if they built with the thread sanitizer) or did you discover it through doing something specific MacroFake?
<MacroFake> michaelfolkson: Data races are had to reproduce, especially if they only happen once per process lifetime (init), so I try to keep an eye on intermittent failures
<paul_c> Is it impossible to detect that issue with a thread sanitizer that detects data races because it can't automatically find bugs in the program or suggest ways to fix the data-races found?
<MacroFake> larryruane: Yes, so getting the lock and getting thing_1, then releasing the lock, then taking it again and getting thing_2 is not UB (a data-race detectable by a sanitizer)
<lightlike> I think in this case there would be no UB, just a rest response that would consist of parts that don't fit together (because the parts correspond to different chain tips)
<MacroFake> larryruane: The lambda is called in two different places, but if the variables are set via capturing, then only one location would need to change
<larryruane> MacroFake: can you tell us why `::cs_main` instead of just `cs_main`? I know the double-colon means global scope, but is there a danger that some class will implement a `cs_main` variable? (Or do they already?)
<MacroFake> larryruane: It doesn't matter fro ::cs_main, but for other globals that do not start with our g_ prefix (style guide), it could cause issues