Tracing: first tracepoints and documentation on User-Space, Statically Defined Tracing (USDT) (utils/log/libs)

https://github.com/bitcoin/bitcoin/pulls/22006

Host: 0xB10C  -  PR author: 0xB10C

The PR branch HEAD was a076eb6 at the time of this review club meeting.

Notes

Context

  • User-Space, Statically Defined Tracing (USDT) allows peeking into runtime internals at statically defined tracepoints of user-space applications (Bitcoin Core in our case).

  • Build support and TRACEx macros based on systemtaps sys/sdt.h for USDT were merged in PR19866.

  • Issue #20981 discusses potential tracepoints.

Tracepoints and tracing scripts

  • We can hook into the tracepoints with tracing scripts via the Linux kernel.

  • If we don’t hook into the tracepoints, then they are NOPs and have little to no performance impact.

  • The tracepoints can pass data back to the tracing script, which contains the tracing logic, e.g. to collect statistics, print or visualize data, give alerts, etc.

  • Tracepoints need to be somewhat generic to allow for reusability in different tracing scripts, but they also need a clear use case. There is no need to plaster the code with (unused) tracepoints.

  • There are currently two main tools for writing USDT scripts and other tools are under development:

eBPF under the hood

  • Hooking into these tracepoints works via a technology called eBPF. Think of it as a small virtual machine (VM) in your Linux kernel where you can run sandboxed eBPF programs (even if there is a problem with your eBPF program, it can’t crash or otherwise harm your kernel).

  • The tracing scripts compile and then load eBPF bytecode into this VM. When attached to a tracepoint, the eBPF program is called with the arguments passed to the tracepoint.

  • Based on your use case, the eBPF program can, for example, filter the data or pass it along to the tracing logic in the tracing script. The eBFP VM is quite limited. For example, it has a stack size of 512 bytes.

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚ tracing script   β”‚            β”‚ bitcoind     β”‚
                β”‚==================β”‚      2.    β”‚==============β”‚
                β”‚  eBPF  β”‚ tracing β”‚      hooks β”‚              β”‚
                β”‚  code  β”‚ logic   β”‚      intoβ”Œβ”€β”€β–Ίtracepoint 1─┼───┐ 3.
                β””β”€β”€β”€β”€β”¬β”€β”€β”€β”΄β”€β”€β–²β”€β”€β”€β”€β”€β”€β”˜          β”œβ”€β”€β–Ίtracepoint 2 β”‚   β”‚ pass args
            1.       β”‚      β”‚ 4.              β”‚ β”‚ ...          β”‚   β”‚ to eBPF
    User    compiles β”‚      β”‚ pass data to    β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ program
    Space    & loads β”‚      β”‚ tracing script  β”‚                    β”‚
    ─────────────────┼──────┼─────────────────┼────────────────────┼───
    Kernel           β”‚      β”‚                 β”‚                    β”‚
    Space       β”Œβ”€β”€β”¬β”€β–Όβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
                β”‚  β”‚  eBPF program                         β”‚β—„β”€β”€β”€β”€β”€β”€β”˜
                β”‚  └────────────────────────────────────────
                β”‚ eBPF kernel Virtual Machine (sandboxed)  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. The tracing script compiles the eBPF code and loads the eBFP program into a kernel VM
2. The eBPF program hooks into one or more tracepoints
3. When the tracepoint is called, the arguments are passed to the eBPF program
4. The eBPF program processes the arguments and returns data to the tracing script

More information on eBPF:

Examples

  • The PR includes examples and documentation on how to run them.

  • For building Bitcoin Core with USDT support, you need the sys/sdt.h headers (when present, USDT support is automatically compiled in). On Debian-like systems you can install the package systemtap-sdt-dev (this is not yet documented in the PR).

  • As an exercise for reviewers: You can try to build Bitcoin Core with USDT support, list the available tracepoints (see doc/tracing.md), try out the example scripts (see contrib/tracing.md), and even add a custom tracepoint and tracing script.

Questions

  1. What is the difference between USDT and using an uprobe to trace Bitcoin Core? What do they have in common? Why do we add tracepoints if we can just use uprobes? (Hint: see this comment and the following ones in PR19866.)

  2. Why shouldn’t we do any β€œexpensive” operations just to pass extra data into tracepoints? What are examples of such expensive operations?

  3. Why are root privileges required to run tracing scripts?

  4. What is eBPF (aka BPF), and how do we utilize it for USDT?

  5. For debugging and monitoring of the peer-to-peer code, it can be useful to log the raw P2P message bytes. Is this possible with USDT and eBPF? What are limiting factors? Why?

  6. Discussion: Should USDT be supported in Bitcoin Core release builds?

  7. Discussion: Do you have ideas for places where static tracepoints make sense? See issue #20981 for inspiration.

  8. Discussion: Can the tracepoints be automatically tested? Could they even help in functional testing?

Meeting Log

  119:00 <b10c> #startmeeting
  219:00 <jnewbery> hi!
  319:00 <b10c> Hi! Welcome to the first Bitcoin Core Review Club on Libera!
  419:00 <josibake> hi
  519:00 <michaelfolkson> Historic
  619:00 <michaelfolkson> hi
  719:00 <b10c> Today, we'll be talking about User-Space, Statically Defined Tracing (USDT) for Bitcoin Core in the context of https://github.com/bitcoin/bitcoin/pull/22006
  819:00 <svav> hi
  919:00 <lightlike> hi
 1019:00 <emzy> hi
 1119:00 <b10c> Notes and questions are on https://bitcoincore.reviews/22006
 1219:00 <schmidty> hi
 1319:00 <b10c> Feel free to say hi to let everyone know you're here :)
 1419:01 <raj> hi.
 1519:01 <FelixWeis> hi
 1619:02 <b10c> Who had a chance to review the PR (y/n)?
 1719:02 <raj> y
 1819:02 <emzy> y
 1919:02 <josibake> n
 2019:03 <michaelfolkson> y
 2119:03 <lightlike> y
 2219:03 <b10c> Also interesting for me: Who was able to build bitcoind with USDT support, listed tracepoints, ran an example, and who experimented with their own tracepoints and scripts?
 2319:03 <glozow> hi
 2419:03 <b10c> Quick {build/list/example/own tracepoint}
 2519:03 <raj> {build/list/example}
 2619:04 <emzy> I was able to run all the examples.
 2719:04 <michaelfolkson> n (just conceptual)
 2819:04 <lightlike> {build/list/example}
 2919:04 <raj> bpftrace didn't worked for me. will make a review comment on that. Python files worked.
 3019:04 <emzy> {build/list/example}
 3119:05 <b10c> great! Lets start with the questions
 3219:05 <b10c> Can someone explain the difference between USDT and using an uprobe to trace Bitcoin Core? What do they have in common? Why do we add tracepoints if we can just use uprobes?
 3319:05 <b10c> ray: yes, please comment! that's important feedback
 3419:05 <b10c> raj*
 3519:07 <michaelfolkson> I've never used uprobe so it is hard to answer :)
 3619:07 <jnewbery> I'm most familiar with uprobes being attached to function calls. Are USDTs more flexible in where the developer can define them?
 3719:07 <michaelfolkson> I've read what people have said about how they compare
 3819:08 <michaelfolkson> Flexibility, more visibility?
 3919:08 <b10c> uprobes are great as they don't require any code changes to hook into functions
 4019:08 <lightlike> i read that usdt is static (requires compilation) and uprobe dynamic, but i have never used uprobe
 4119:09 <b10c> we don't need _static_ tracepoints for them
 4219:09 <b10c> lightlike: right!
 4319:10 <b10c> jnewbery: correct too! there are uprobes for function calls and uretprobes for function returns, but you can't hook into a specific code branch
 4419:10 <jonatack> hi
 4519:11 <michaelfolkson> lightlike: uprobe doesn't require compilation?
 4619:12 <michaelfolkson> Oh I think I know what you mean. It doesn't require compilation of additional e.g. USDT code
 4719:12 <lightlike> yes, no extra compilation for inserting the probes.
 4819:12 <b10c> and static tracepoints allow us to write scripts that will (hopefully) still work in 6 months as we are targeting a semi stable tracepoint API. Functions likely change over time
 4919:13 <b10c> next Q: When adding tracepoints, why shouldn’t we do any β€œexpensive” operations just to pass extra data into tracepoints? What are examples of such expensive operations?
 5019:14 <glozow> performance? iiuc the tracepoints always process the args even if nothings hooked into them?
 5119:14 <michaelfolkson> Because that will bear a cost on those who aren't interested in the tracepoints?
 5219:15 <jnewbery> b10c: presumably we don't want to do anything that allocates, which could include string manipulations
 5319:16 <LarryRuane> Acquiring locks would be expensive (?)
 5419:16 <jb55> hi
 5519:17 <b10c> yes! if we do something 'expensive' just for the tracepoints, the this likely degrades performance for users not interested in the tracepoints
 5619:17 <svav> Delays introduced by a debugger might cause the program to change its behavior drastically, or perhaps fail
 5719:17 <michaelfolkson> It gets grey between tracing, logging and debugging for me. I would say acquiring locks that weren't acquired in the original code would be debugging?
 5819:18 <jb55> yes especially in tight loops, ideally we would only ever pass simple values and pointers to data structures
 5919:18 <jb55> so that it would basically become a noop when not attached
 6019:18 <b10c> right! 'expensive' could be string manipulation, serialization of transactions and blocks, locks
 6119:19 <sipa> (just chiming in, know nothing about USDT) what is the performance overhead of a tracepoint (with nothing attached to it)?
 6219:19 <b10c> svav: didn't think about this. Ideally the eBPF program doesn't cause long delays
 6319:19 <lightlike> I don't understand this completely: Wouldn't the code in the TRACE6 parts ignored by the compiler if the user is not interested in tracepoints and doesn't activate them?
 6419:20 <jb55> sipa: afaik, it just inserts a noop instructions that the kernel can hook jumps into
 6519:20 <sipa> so the cost is almost 0
 6619:20 <LarryRuane> It's probably okay, for this purpose, to read variables that are normally lock-protected, right? You might get inconsistent data, but that's probably acceptable here. (?)
 6719:21 <sipa> even in super tight loops, a nop instruction is like what... 1/4th of a cycle?
 6819:21 <willcl_ark> Hello!
 6919:21 <jb55> the point was, if you're formatting data for the trace argument, that would be wasted cycles
 7019:21 <b10c> lightlike: if you compile a bitcoind without USDT support, then nothing is different. If you compile with USDT support and don't hook into it then there is an extra NOP
 7119:23 <b10c> sipa: haven't looked at this closely, but we ideally don't add tracepoints where performance is really important
 7219:23 <b10c> e.g. tight loops
 7319:23 <sipa> jb55: is there a way to make it do the formatting inside the attached code?
 7419:24 <sipa> obviously you shouldn't go do extra work in the program itself just so it's available for a tracepoint that's likely unused
 7519:24 <michaelfolkson> And so far only three tracepoints have been added (net and validation)
 7619:24 <jb55> sipa: yes with bcc you can do the formatting within the kernel/ebpf vm, but it's extra work and you can't use simple tools like bpftrace (but they could add that feature over time)
 7719:25 <b10c> next Q: Why are root privileges required to run tracing scripts that hook into the tracepoints?
 7819:26 <michaelfolkson> Because we are loading programs in the kernel VM...?
 7919:26 <glozow> is this the tracing script that asks the kernel to load eBPF code?
 8019:26 <jb55> I think there is an ebpf capability so you don't need root, I haven't looked into it though
 8119:26 <jb55> might be a new linux thingie
 8219:27 <FelixWeis> does eBPF tracing work in docker for mac?
 8319:28 <jb55> I believe someone mentioned they tested it on mac and it works, not sure about docker.
 8419:28 <emzy> You can run bitcoind as a user. You only need root (or the ebpf capability) for the tracing scripts
 8519:28 <b10c> michaelfolkson: yes! has it any other advantages that we require root privileges to hook into the tracepoints
 8619:28 <jb55> also windows announced they are adding ebpf support recently
 8719:28 <b10c> emzy: correct! don't run bitcoind as root
 8819:28 <jb55> emzy: right, good clarification
 8919:28 <b10c> glozow: yes
 9019:29 <b10c> jb55: that would be interesting to take a look at, especially if we want to automatically test the tracepoints
 9119:29 <b10c> the non-root eBPF
 9219:30 <jb55> b10c: agreed
 9319:30 <michaelfolkson> FelixWeis: I would guess it wouldn't work in a container if you need access to the kernel. Doesn't Docker use the host OS kernel?
 9419:31 <michaelfolkson> A Linux container on top of the MacOS kernel? Might be talking rubbish...
 9519:31 <b10c> one advantage that the root-permission requirement has is that no other programs on our system can hook into bitcoind
 9619:31 <FelixWeis> it uses a linux vm which then has namespace support to do docker stuff.
 9719:32 <FelixWeis> jb55: I did'nt think eBPF works on macos nativly. isnt this a linux kernel thing?
 9819:33 <b10c> What is eBPF, and how do we utilize it for USDT?
 9919:33 <jb55> FelixWeis: it has a dtrace history, the macros expand to dtrace macros on macos.
10019:33 <b10c> FelixWeis: IIRC jonasschnelli was able to build bitcoind with USDT support on macOS, but not sure if he could hook into anything
10119:34 <michaelfolkson> I started on the free excerpt of the book :) e=extended but now eBPF is just discussed as BPF
10219:34 <jb55> b10c: oh that makes more sense. but yeah I still haven't tested the dtrace stuff on macos
10319:35 <michaelfolkson> BPF = Berkeley Packet Filter, improved packet capture tools
10419:35 <b10c> right! BPF is e.g. used in wireshark or tcp dump and low latency firewalls
10519:36 <b10c> eBPF is an extension (no often known as BPF too) that allows to do more than just networking stuff
10619:37 <michaelfolkson> It filters packets before they are seen by applications like tcpdump
10719:37 <jb55> there are more than just USDTs, you can dynamically trace any function within the codebase with uprobes (function enter) and uretprobes (function return). really handy for tracing executing codepaths.
10819:37 <b10c> it's basically a technology to run kernel space programs that can't harm the kernel if they e.g. crash
10919:37 <jb55> this works with any program on linux and you don't need any special build support
11019:38 <glozow> very cool
11119:38 <jb55> you can even trace functions within the linux kernel. fun stuff.
11219:39 <michaelfolkson> Sounds awesome for Linux kernel development. Intuitively I wouldn't think it would be as useful for applications as evidently is
11319:39 <michaelfolkson> *but evidently is
11419:40 <b10c> Some of you might work on P2P code and want to use USDT to debug a new P2P message: Can USDT and eBPF be used to e.g. log the raw P2P mesage bytes?
11519:40 <b10c> What are the limitations?
11619:41 <michaelfolkson> Max allocation size of 32kb, P2P messages are larger than that
11719:43 <b10c> correct. With bpftrace scripts we are even limited to 512 bytes (that's the eBPF VM stack size)
11819:43 <lightlike> are there situations when it would make sense to use USDTs or uprobes instead of good old printf-debugging when developing something?
11919:43 <emzy> Splitting them up would be against the no expensive operations in tracepoints.
12019:44 <jb55> b10c: but would you be passing that much data on the stack? wouldn't you just have a pointer to the data?
12119:44 <michaelfolkson> So for that reason and it not being on MacOS (and possibly other reasons other too) P2P message logging would still be used e.g. https://bitcoincore.reviews/19509
12219:44 <b10c> with some tricks we can allocate up to 32kb in BCC scripts
12319:47 <b10c> jb55 yes. IIRC you wouldn't be able to e.g. print the pointer data in bpftrace scripts as the printf string itself is limited to something <512 byte
12419:47 <glozow> basic question. can we time stuff more precisely with this?
12519:47 <michaelfolkson> And many/most Bitcoin P2P messages are smaller than 32kb...?
12619:48 <b10c> I haven't tried passing pointers to BCC python scripts, not sure if they can read from other userspace processes
12719:49 <b10c> michaelfolkson: correct, most messages are smaller than 32kb and we can handle them without problems
12819:49 <jb55> glozow: yes, that's one of the connectblock examples. but keep in mind plugging in a tracepoint does have performance implications. but if you're fine with comparing differences between traced IBDs it works great.
12919:50 <b10c> the contrib/tracing/log_raw_p2p_msgs.py example logs raw P2P messags as hex and prints a warning if the message was larger than 32kb and might be cut-off
13019:51 <b10c> glozow: depends, but generally yes
13119:52 <michaelfolkson> Did you consider other possible tracepoints b10c as the first demos? Why choose these 3 specific ones?
13219:52 <FelixWeis> this could be useful for sites like statoshi.info
13319:53 <b10c> I wanted to keep the PR slim and see if we can get general consensus that USDT is something useful for Bitcoin Core
13419:53 <FelixWeis> right now, lopp maintains a fork with more lines of code to do stuff like function-timings.
13519:53 <michaelfolkson> I think you've said it shows the limitations as well as the benefits
13619:53 <jb55> the connectblock one was just me wanting to time IBDs more accurately. The p2p was one of the motivating reasons for me to add eBPF support, to potentially avoid ad-hoc logging everywhere (even if that's not possible right now for portability reasons)
13719:54 <glozow> can we use it to benchmark something more granular than IBD, e.g. script checking? or measure script cache hit rate?
13819:54 <b10c> jb55 actually came up with the first three tracepoints :) I picked them up as they do show what benefits USDT can have
13919:55 <b10c> glozow: yes!
14019:55 <michaelfolkson> FelixWeis: e.g. https://statoshi.info/d/000000003/function-timings?orgId=1
14119:55 <jb55> some others that were suggested: more accurate coincache memory/perf tracking. any others that ya'll think of that might be useful, feel free to suggest!
14219:55 <b10c> Other ideas for potential tracepoints?
14319:55 <FelixWeis> michaelfolkson: exactly
14419:56 <b10c> we've been discussing this in https://github.com/bitcoin/bitcoin/issues/20981 btw
14519:56 <b10c> feel free to add your ideas there :)
14619:56 <michaelfolkson> wumpus "traces for dis- and connections of peers would be useful too"
14719:56 <jb55> laanwj talked about looking at the traces added to jlopp/statoshi to get some ideas, if we could hook prometheus into any core node at runtime, that would be dope.
14819:56 <FelixWeis> its one thing to do bench_bitcoin, and annother to do timings on production system nodes
14919:57 <jnewbery> jb55: eBPF seems like a good complimentary technique to message dumping, rather than a substitution for it
15019:57 <b10c> Discussion: Do you think Bitcoin Core release build should include USDT support?
15119:57 <jb55> b10c: yes as long as we make sure all tracepoint args don't have any perf implications
15219:58 <glozow> are users interested in the tracing? :O
15319:58 <jb55> it will be most useful when tracepoints are available to node maintainers instead of just devs (ie the prometheus example)
15419:58 <michaelfolkson> Yes but presumably every trace point added has to have the performance discussion?
15519:58 <emzy> Are there any security/privacy concerns when USDT support compiled in?
15619:59 <b10c> jnewbery: agree, especially with the 32kb limit (if we can't find a way to read pointers)
15719:59 <sipa> if it's truly just a nop instruction (and no additional work for data used by the tracdpoint), performance should be almost always irrelevant
15820:00 <jnewbery> I may be wrong, but I see eBPF being most useful for understanding global application performance. Logging/message dumping still has a place for understanding message flows for particular peers, etc.
15920:00 <jb55> emzy: plugging into the tracepoints requires root, so I suspect not? unless you attach a rogue script or something
16020:00 <b10c> emzy: as long as root-permissions are required (at which point you can do anything you'd like to do anyway as an attacker) I don't see any
16120:00 <michaelfolkson> And you only need root privileges if you load the programs, you don't have to
16220:00 <b10c> #endmeeting