libs)

May 26, 2021

https://github.com/bitcoin/bitcoin/pull/22006

Host: 0xB10C - PR author: 0xB10C

The PR branch HEAD was a076eb6 at the time of this review club meeting.

Notes

Context

User-Space, Statically Defined Tracing (USDT) allows peeking into runtime internals at statically defined tracepoints of user-space applications (Bitcoin Core in our case).
Build support and TRACEx macros based on systemtaps sys/sdt.h for USDT were merged in PR19866.
Issue #20981 discusses potential tracepoints.

Tracepoints and tracing scripts

We can hook into the tracepoints with tracing scripts via the Linux kernel.
If we don’t hook into the tracepoints, then they are NOPs and have little to no performance impact.
The tracepoints can pass data back to the tracing script, which contains the tracing logic, e.g. to collect statistics, print or visualize data, give alerts, etc.
Tracepoints need to be somewhat generic to allow for reusability in different tracing scripts, but they also need a clear use case. There is no need to plaster the code with (unused) tracepoints.
There are currently two main tools for writing USDT scripts and other tools are under development:
- bpftrace for one-liners and short scripts
- BPF compiler collection (BCC), e.g. with the Python wrapper for complex scripts

eBPF under the hood

Hooking into these tracepoints works via a technology called eBPF. Think of it as a small virtual machine (VM) in your Linux kernel where you can run sandboxed eBPF programs (even if there is a problem with your eBPF program, it can’t crash or otherwise harm your kernel).
The tracing scripts compile and then load eBPF bytecode into this VM. When attached to a tracepoint, the eBPF program is called with the arguments passed to the tracepoint.
Based on your use case, the eBPF program can, for example, filter the data or pass it along to the tracing logic in the tracing script. The eBFP VM is quite limited. For example, it has a stack size of 512 bytes.

                ┌──────────────────┐            ┌──────────────┐
                │ tracing script   │            │ bitcoind     │
                │==================│      2.    │==============│
                │  eBPF  │ tracing │      hooks │              │
                │  code  │ logic   │      into┌─┤►tracepoint 1─┼───┐ 3.
                └────┬───┴──▲──────┘          ├─┤►tracepoint 2 │   │ pass args
            1.       │      │ 4.              │ │ ...          │   │ to eBPF
    User    compiles │      │ pass data to    │ └──────────────┘   │ program
    Space    & loads │      │ tracing script  │                    │
    ─────────────────┼──────┼─────────────────┼────────────────────┼───
    Kernel           │      │                 │                    │
    Space       ┌──┬─▼──────┴─────────────────┴────────────┐       │
                │  │  eBPF program                         │◄──────┘
                │  └───────────────────────────────────────┤
                │ eBPF kernel Virtual Machine (sandboxed)  │
                └──────────────────────────────────────────┘

1. The tracing script compiles the eBPF code and loads the eBFP program into a kernel VM
2. The eBPF program hooks into one or more tracepoints
3. When the tracepoint is called, the arguments are passed to the eBPF program
4. The eBPF program processes the arguments and returns data to the tracing script

More information on eBPF:

Examples

The PR includes examples and documentation on how to run them.
For building Bitcoin Core with USDT support, you need the sys/sdt.h headers (when present, USDT support is automatically compiled in). On Debian-like systems you can install the package systemtap-sdt-dev (this is not yet documented in the PR).
As an exercise for reviewers: You can try to build Bitcoin Core with USDT support, list the available tracepoints (see doc/tracing.md), try out the example scripts (see contrib/tracing.md), and even add a custom tracepoint and tracing script.

Questions

What is the difference between USDT and using an uprobe to trace Bitcoin Core? What do they have in common? Why do we add tracepoints if we can just use uprobes? (Hint: see this comment and the following ones in PR19866.)
Why shouldn’t we do any “expensive” operations just to pass extra data into tracepoints? What are examples of such expensive operations?
Why are root privileges required to run tracing scripts?
What is eBPF (aka BPF), and how do we utilize it for USDT?
For debugging and monitoring of the peer-to-peer code, it can be useful to log the raw P2P message bytes. Is this possible with USDT and eBPF? What are limiting factors? Why?
Discussion: Should USDT be supported in Bitcoin Core release builds?
Discussion: Do you have ideas for places where static tracepoints make sense? See issue #20981 for inspiration.
Discussion: Can the tracepoints be automatically tested? Could they even help in functional testing?

Meeting Log

19:00

<b10c> #startmeeting

19:00

<jnewbery> hi!

19:00

<b10c> Hi! Welcome to the first Bitcoin Core Review Club on Libera!

19:00

<josibake> hi

19:00

<michaelfolkson> Historic

19:00

<michaelfolkson> hi

19:00

<b10c> Today, we'll be talking about User-Space, Statically Defined Tracing (USDT) for Bitcoin Core in the context of https://github.com/bitcoin/bitcoin/pull/22006

19:00

<svav> hi

19:00

<lightlike> hi

19:00

<emzy> hi

19:00

<b10c> Notes and questions are on https://bitcoincore.reviews/22006

19:00

<schmidty> hi

19:00

<b10c> Feel free to say hi to let everyone know you're here :)

19:01

<raj> hi.

19:01

<FelixWeis> hi

19:02

<b10c> Who had a chance to review the PR (y/n)?

19:02

<raj> y

19:02

<emzy> y

19:02

<josibake> n

19:03

<michaelfolkson> y

19:03

<lightlike> y

19:03

<b10c> Also interesting for me: Who was able to build bitcoind with USDT support, listed tracepoints, ran an example, and who experimented with their own tracepoints and scripts?

19:03

<glozow> hi

19:03

<b10c> Quick {build/list/example/own tracepoint}

19:03

<raj> {build/list/example}

19:04

<emzy> I was able to run all the examples.

19:04

<michaelfolkson> n (just conceptual)

19:04

<lightlike> {build/list/example}

19:04

<raj> bpftrace didn't worked for me. will make a review comment on that. Python files worked.

19:04

<emzy> {build/list/example}

19:05

<b10c> great! Lets start with the questions

19:05

<b10c> Can someone explain the difference between USDT and using an uprobe to trace Bitcoin Core? What do they have in common? Why do we add tracepoints if we can just use uprobes?

19:05

<b10c> ray: yes, please comment! that's important feedback

19:05

<b10c> raj*

19:07

<michaelfolkson> I've never used uprobe so it is hard to answer :)

19:07

<jnewbery> I'm most familiar with uprobes being attached to function calls. Are USDTs more flexible in where the developer can define them?

19:07

<michaelfolkson> I've read what people have said about how they compare

19:08

<michaelfolkson> Flexibility, more visibility?

19:08

<b10c> uprobes are great as they don't require any code changes to hook into functions

19:08

<lightlike> i read that usdt is static (requires compilation) and uprobe dynamic, but i have never used uprobe

19:09

<b10c> we don't need _static_ tracepoints for them

19:09

<b10c> lightlike: right!

19:10

<b10c> jnewbery: correct too! there are uprobes for function calls and uretprobes for function returns, but you can't hook into a specific code branch

19:10

<jonatack> hi

19:11

<michaelfolkson> lightlike: uprobe doesn't require compilation?

19:12

<michaelfolkson> Oh I think I know what you mean. It doesn't require compilation of additional e.g. USDT code

19:12

<lightlike> yes, no extra compilation for inserting the probes.

19:12

<b10c> and static tracepoints allow us to write scripts that will (hopefully) still work in 6 months as we are targeting a semi stable tracepoint API. Functions likely change over time

19:13

<b10c> next Q: When adding tracepoints, why shouldn’t we do any “expensive” operations just to pass extra data into tracepoints? What are examples of such expensive operations?

19:14

<glozow> performance? iiuc the tracepoints always process the args even if nothings hooked into them?

19:14

<michaelfolkson> Because that will bear a cost on those who aren't interested in the tracepoints?

19:15

<jnewbery> b10c: presumably we don't want to do anything that allocates, which could include string manipulations

19:16

<LarryRuane> Acquiring locks would be expensive (?)

19:16

<jb55> hi

19:17

<b10c> yes! if we do something 'expensive' just for the tracepoints, the this likely degrades performance for users not interested in the tracepoints

19:17

<svav> Delays introduced by a debugger might cause the program to change its behavior drastically, or perhaps fail

19:17

<michaelfolkson> It gets grey between tracing, logging and debugging for me. I would say acquiring locks that weren't acquired in the original code would be debugging?

19:18

<jb55> yes especially in tight loops, ideally we would only ever pass simple values and pointers to data structures

19:18

<jb55> so that it would basically become a noop when not attached

19:18

<b10c> right! 'expensive' could be string manipulation, serialization of transactions and blocks, locks

19:19

<sipa> (just chiming in, know nothing about USDT) what is the performance overhead of a tracepoint (with nothing attached to it)?

19:19

<b10c> svav: didn't think about this. Ideally the eBPF program doesn't cause long delays

19:19

<lightlike> I don't understand this completely: Wouldn't the code in the TRACE6 parts ignored by the compiler if the user is not interested in tracepoints and doesn't activate them?

19:20

<jb55> sipa: afaik, it just inserts a noop instructions that the kernel can hook jumps into

19:20

<sipa> so the cost is almost 0

19:20

<LarryRuane> It's probably okay, for this purpose, to read variables that are normally lock-protected, right? You might get inconsistent data, but that's probably acceptable here. (?)

19:21

<sipa> even in super tight loops, a nop instruction is like what... 1/4th of a cycle?

19:21

<willcl_ark> Hello!

19:21

<jb55> the point was, if you're formatting data for the trace argument, that would be wasted cycles

19:21

<b10c> lightlike: if you compile a bitcoind without USDT support, then nothing is different. If you compile with USDT support and don't hook into it then there is an extra NOP

19:23

<b10c> sipa: haven't looked at this closely, but we ideally don't add tracepoints where performance is really important

19:23

<b10c> e.g. tight loops

19:23

<sipa> jb55: is there a way to make it do the formatting inside the attached code?

19:24

<sipa> obviously you shouldn't go do extra work in the program itself just so it's available for a tracepoint that's likely unused

19:24

<michaelfolkson> And so far only three tracepoints have been added (net and validation)

19:24

<jb55> sipa: yes with bcc you can do the formatting within the kernel/ebpf vm, but it's extra work and you can't use simple tools like bpftrace (but they could add that feature over time)

19:25

<b10c> next Q: Why are root privileges required to run tracing scripts that hook into the tracepoints?

19:26

<michaelfolkson> Because we are loading programs in the kernel VM...?

19:26

<glozow> is this the tracing script that asks the kernel to load eBPF code?

19:26

<jb55> I think there is an ebpf capability so you don't need root, I haven't looked into it though

19:26

<jb55> might be a new linux thingie

19:27

<FelixWeis> does eBPF tracing work in docker for mac?

19:28

<jb55> I believe someone mentioned they tested it on mac and it works, not sure about docker.

19:28

<emzy> You can run bitcoind as a user. You only need root (or the ebpf capability) for the tracing scripts

19:28

<b10c> michaelfolkson: yes! has it any other advantages that we require root privileges to hook into the tracepoints

19:28

<jb55> also windows announced they are adding ebpf support recently

19:28

<b10c> emzy: correct! don't run bitcoind as root

19:28

<jb55> emzy: right, good clarification

19:28

<b10c> glozow: yes

19:29

<b10c> jb55: that would be interesting to take a look at, especially if we want to automatically test the tracepoints

19:29

<b10c> the non-root eBPF

19:30

<jb55> b10c: agreed

19:30

<michaelfolkson> FelixWeis: I would guess it wouldn't work in a container if you need access to the kernel. Doesn't Docker use the host OS kernel?

19:31

<michaelfolkson> A Linux container on top of the MacOS kernel? Might be talking rubbish...

19:31

<b10c> one advantage that the root-permission requirement has is that no other programs on our system can hook into bitcoind

19:31

<FelixWeis> it uses a linux vm which then has namespace support to do docker stuff.

19:32

<FelixWeis> jb55: I did'nt think eBPF works on macos nativly. isnt this a linux kernel thing?

19:33

<b10c> What is eBPF, and how do we utilize it for USDT?

19:33

<jb55> FelixWeis: it has a dtrace history, the macros expand to dtrace macros on macos.

100

19:33

<b10c> FelixWeis: IIRC jonasschnelli was able to build bitcoind with USDT support on macOS, but not sure if he could hook into anything

101

19:34

<michaelfolkson> I started on the free excerpt of the book :) e=extended but now eBPF is just discussed as BPF

102

19:34

<jb55> b10c: oh that makes more sense. but yeah I still haven't tested the dtrace stuff on macos

103

19:35

<michaelfolkson> BPF = Berkeley Packet Filter, improved packet capture tools

104

19:35

<b10c> right! BPF is e.g. used in wireshark or tcp dump and low latency firewalls

105

19:36

<b10c> eBPF is an extension (no often known as BPF too) that allows to do more than just networking stuff

106

19:37

<michaelfolkson> It filters packets before they are seen by applications like tcpdump

107

19:37

<jb55> there are more than just USDTs, you can dynamically trace any function within the codebase with uprobes (function enter) and uretprobes (function return). really handy for tracing executing codepaths.

108

19:37

<b10c> it's basically a technology to run kernel space programs that can't harm the kernel if they e.g. crash

109

19:37

<jb55> this works with any program on linux and you don't need any special build support

110

19:38

<glozow> very cool

111

19:38

<jb55> you can even trace functions within the linux kernel. fun stuff.

112

19:39

<michaelfolkson> Sounds awesome for Linux kernel development. Intuitively I wouldn't think it would be as useful for applications as evidently is

113

19:39

<michaelfolkson> *but evidently is

114

19:40

<b10c> Some of you might work on P2P code and want to use USDT to debug a new P2P message: Can USDT and eBPF be used to e.g. log the raw P2P mesage bytes?

115

19:40

<b10c> What are the limitations?

116

19:41

<michaelfolkson> Max allocation size of 32kb, P2P messages are larger than that

117

19:43

<b10c> correct. With bpftrace scripts we are even limited to 512 bytes (that's the eBPF VM stack size)

118

19:43

<lightlike> are there situations when it would make sense to use USDTs or uprobes instead of good old printf-debugging when developing something?

119

19:43

<emzy> Splitting them up would be against the no expensive operations in tracepoints.

120

19:44

<jb55> b10c: but would you be passing that much data on the stack? wouldn't you just have a pointer to the data?

121

19:44

<michaelfolkson> So for that reason and it not being on MacOS (and possibly other reasons other too) P2P message logging would still be used e.g. https://bitcoincore.reviews/19509

122

19:44

<b10c> with some tricks we can allocate up to 32kb in BCC scripts

123

19:47

<b10c> jb55 yes. IIRC you wouldn't be able to e.g. print the pointer data in bpftrace scripts as the printf string itself is limited to something <512 byte

124

19:47

<glozow> basic question. can we time stuff more precisely with this?

125

19:47

<michaelfolkson> And many/most Bitcoin P2P messages are smaller than 32kb...?

126

19:48

<b10c> I haven't tried passing pointers to BCC python scripts, not sure if they can read from other userspace processes

127

19:49

<b10c> michaelfolkson: correct, most messages are smaller than 32kb and we can handle them without problems

128

19:49

<jb55> glozow: yes, that's one of the connectblock examples. but keep in mind plugging in a tracepoint does have performance implications. but if you're fine with comparing differences between traced IBDs it works great.

129

19:50

<b10c> the contrib/tracing/log_raw_p2p_msgs.py example logs raw P2P messags as hex and prints a warning if the message was larger than 32kb and might be cut-off

130

19:51

<b10c> glozow: depends, but generally yes

131

19:52

<michaelfolkson> Did you consider other possible tracepoints b10c as the first demos? Why choose these 3 specific ones?

132

19:52

<FelixWeis> this could be useful for sites like statoshi.info

133

19:53

<b10c> I wanted to keep the PR slim and see if we can get general consensus that USDT is something useful for Bitcoin Core

134

19:53

<FelixWeis> right now, lopp maintains a fork with more lines of code to do stuff like function-timings.

135

19:53

<michaelfolkson> I think you've said it shows the limitations as well as the benefits

136

19:53

<jb55> the connectblock one was just me wanting to time IBDs more accurately. The p2p was one of the motivating reasons for me to add eBPF support, to potentially avoid ad-hoc logging everywhere (even if that's not possible right now for portability reasons)

137

19:54

<glozow> can we use it to benchmark something more granular than IBD, e.g. script checking? or measure script cache hit rate?

138

19:54

<b10c> jb55 actually came up with the first three tracepoints :) I picked them up as they do show what benefits USDT can have

139

19:55

<b10c> glozow: yes!

140

19:55

<michaelfolkson> FelixWeis: e.g. https://statoshi.info/d/000000003/function-timings?orgId=1

141

19:55

<jb55> some others that were suggested: more accurate coincache memory/perf tracking. any others that ya'll think of that might be useful, feel free to suggest!

142

19:55

<b10c> Other ideas for potential tracepoints?

143

19:55

<FelixWeis> michaelfolkson: exactly

144

19:56

<b10c> we've been discussing this in https://github.com/bitcoin/bitcoin/issues/20981 btw

145

19:56

<b10c> feel free to add your ideas there :)

146

19:56

<michaelfolkson> wumpus "traces for dis- and connections of peers would be useful too"

147

19:56

<jb55> laanwj talked about looking at the traces added to jlopp/statoshi to get some ideas, if we could hook prometheus into any core node at runtime, that would be dope.

148

19:56

<FelixWeis> its one thing to do bench_bitcoin, and annother to do timings on production system nodes

149

19:57

<jnewbery> jb55: eBPF seems like a good complimentary technique to message dumping, rather than a substitution for it

150

19:57

<b10c> Discussion: Do you think Bitcoin Core release build should include USDT support?

151

19:57

<jb55> b10c: yes as long as we make sure all tracepoint args don't have any perf implications

152

19:58

<glozow> are users interested in the tracing? :O

153

19:58

<jb55> it will be most useful when tracepoints are available to node maintainers instead of just devs (ie the prometheus example)

154

19:58

<michaelfolkson> Yes but presumably every trace point added has to have the performance discussion?

155

19:58

<emzy> Are there any security/privacy concerns when USDT support compiled in?

156

19:59

<b10c> jnewbery: agree, especially with the 32kb limit (if we can't find a way to read pointers)

157

19:59

<sipa> if it's truly just a nop instruction (and no additional work for data used by the tracdpoint), performance should be almost always irrelevant

158

20:00

<jnewbery> I may be wrong, but I see eBPF being most useful for understanding global application performance. Logging/message dumping still has a place for understanding message flows for particular peers, etc.

159

20:00

<jb55> emzy: plugging into the tracepoints requires root, so I suspect not? unless you attach a rogue script or something

160

20:00

<b10c> emzy: as long as root-permissions are required (at which point you can do anything you'd like to do anyway as an attacker) I don't see any

161

20:00

<michaelfolkson> And you only need root privileges if you load the programs, you don't have to

162

20:00

<b10c> #endmeeting