Per-peer Message Logging (p2p, utils/log/libs)

https://github.com/bitcoin/bitcoin/pulls/19509

Host: troygiorshev  -  PR author: troygiorshev

The PR branch HEAD was 4fc2e3906 at the time of this review club meeting.

Notes

  • PR 19509 is a feature PR, adding per-peer message logging to Bitcoin Core.

  • Nodes on the Bitcoin network communicate using a variety of messages. This PR makes it possible to log these messages, in chronological order and on a per-peer basis.

  • The purpose of this feature is twofold: it should assist people in debugging their node, and it should help people who are unfamiliar with the Bitcoin network to get a feel of how everything works.

  • This feature is locked behind a debug-only -logmessages option. When enabled, it writes every message the node receives to a file, in a folder structure described in the PR. It uses our CAutoFile here.

  • The log parser lives in a new folder, at contrib/message-logging/message-logging-parser.py. As discussed here it takes a list of files to parse, and optionally an output argument.

  • LogMessage takes the newly introduced Span, as discussed in a previous PR review club.

  • The Add LogMessage commit accounts for the increase in file descriptor usage. This is due to be improved in Issue 18911.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (You’re always encouraged to put your PR review on GitHub, even after the PR has been merged.)

  2. How did you test this PR? Was setting up and running everything straightforward enough? As a reminder, the test runner has a --nocleanup option.

  3. Did you verify that file descriptor accounting was being done correctly in the Add LogMessage commit? What is the soft file descriptor limit on your system? What about the hard limit?

  4. The Clean PushMessage and ProcessMessages commit is a cleanup. Is this justified or is this just noise? When should cleanups like this be done?

  5. The Add Message Logging Test commit adds a test. What is being tested? What isn’t being tested? Should this be expanded? Did you read the out-of-tree build discussion here?

  6. Both jnewbery (comment) and practicalswift (comment) proposed extensions to this message logging. What are they? Do you agree that these are a good idea? Are there any extensions that you would like to see?

  7. Bonus: For those of you really interested in python, here’s a great opportunity to explore the oddities of pathlib! How does Path.cwd() / Path(logpath) always return an absolute path, even when logpath itself is already an absolute path?

Meeting Log

  117:00 <troygiorshev> #startmeeting
  217:00 <troygiorshev> hellooo
  317:00 <commodore> hi hi
  417:00 <nehan> hi
  517:00 <jkczyz> hi
  617:00 <jnewbery> hi
  717:00 <gzhao408> heyooooo
  817:00 <lightlike> hi
  917:00 <troygiorshev> welcome to review club everyone! today we'll be talking about a feature pr, adding per-peer message logging to core!
 1017:00 <michaelfolkson> hi
 1117:01 <troygiorshev> notes and questions in the usual spot: https://bitcoincore.reviews/19509.html
 1217:01 <ajonas> hi
 1317:01 <amiti> hi
 1417:01 <troygiorshev> who's had a chance to review the pr? (y/n)
 1517:01 <amiti> y
 1617:01 <felixweis> Hi
 1717:01 <michaelfolkson> y
 1817:01 <jkczyz> n
 1917:02 <jnewbery> y
 2017:02 <lightlike> y
 2117:02 <gzhao408> y
 2217:02 <troygiorshev> awesome!
 2317:02 <troygiorshev> would anyone like to summarize 19509?
 2417:03 <troygiorshev> (for anyone lurking :D )
 2517:04 <michaelfolkson> Anyone new want to try to summarize? I thought your descriptions were very good troygiorshev, commit messages too
 2617:04 <ecurrencyhodler> It's a way to log p2p messages between nodes in Bitcoin Core.
 2717:05 <troygiorshev> michaelfolkson: thanks!
 2817:05 <troygiorshev> ecurrencyhodler: yup, that's it!
 2917:05 <commodore> as well as code to produce json outputs for the logs
 3017:05 <lightlike> Adding an option to dump all p2p messages our node sends/receives into a binary file, with some tools to parse this data.
 3117:05 <troygiorshev> commodore lightlike: great points as well
 3217:06 <ariard> hi
 3317:06 <troygiorshev> it's not a ton more than what it says on the tin! important for feature PRs to be simple and well defined
 3417:06 <troygiorshev> ok, first question, how did you test this PR? Was setting up and running everything straightforward enough?
 3517:07 <ecurrencyhodler> I thought Bitcoin Core already kept track of p2p message logs through -printtoconsole. Is the difference that this would just be recorded to a file?
 3617:08 <amiti> my experience was smooth! I recompiled and ran my node locally with `logmessages` enabled, saw the new files in the data dir & fed them into the parser to interpret them. it was super cool!
 3717:09 <amiti> some other things I did to test involved adding logging to the p2p_message_logging test to see what messages are being recorded
 3817:09 <troygiorshev> ecurrencyhodler: good catch on that config option. It's not quite the same though. printtoconsole is used to print the _debug log_ to console (as opposed to file). This PR is adding much more. Not just debugging, it logs all of the p2p messages that your node sends back and forth, with their details in a mostly human readable form. this isn't currently being done anywhere
 3917:09 <jnewbery> ecurrencyhodler: -printtoconsole prints the debug log to the console, not the p2p messages
 4017:10 <ecurrencyhodler> very cool ty
 4117:11 <jnewbery> another thing you could do to test would be to add -logmessages to the config for a functional test, then run with --nocleanup, and check the logging files that are left in the test directory
 4217:12 <troygiorshev> amiti: good point bringing up p2p_message_logging.py, it's becoming a very multipurpose test!
 4317:12 <jnewbery> I think you could just add "-logmessages" to the list here: https://github.com/bitcoin/bitcoin/blob/c6532fa6c14d3cea5941cff9411937db9a8d5f1a/test/functional/test_framework/test_node.py#L95-L102
 4417:13 <lightlike> though if "-net" debugging is enabled, most p2p messages get an entry in the log - so I would think that this is mostly useful for testing the low-level "net" stuff (as opposed to the net_processing higher-level logic)?
 4517:14 <jnewbery> lightlike: exactly. You should usually be able to trace control flow using the debug logs. This new message logging would help if you need to know the exact messages being sent/received.
 4617:15 <troygiorshev> lightlike: especially for beginners, or for people debugging their node, they might want to see *exactly* what's going on
 4717:16 <troygiorshev> next question: Did you verify that file descriptor accounting was being done correctly in the "Add LogMessage" commit? What is the soft file descriptor limit on your system? What about the hard limit?
 4817:16 <troygiorshev> if you didn't find this, try `ulimit -a` and `ulimit -aH` :)
 4917:19 <lightlike> for me, it's 4096 "open files". Is this considerably lower on some systems?
 5017:19 <gzhao408> Verified to the best of my ability :P my limits say 256 and "unlimited"
 5117:19 <gzhao408> I have a question about why NUM_FDS_MESSAGE_LOGGING is just 1? Do we only allocate one file descriptor for all peer message logging, and we open/close every time we log so we can reuse it every time? (I could be misunderstanding how the file descriptors are used here, I’m assuming we need 1 per open file).
 5217:20 <troygiorshev> gzhao408: that's a great question!
 5317:20 <jnewbery> mine are 1024 and ~1 million
 5417:21 <troygiorshev> gzhao408: you have it exactly right, because we open and close every time we log, we can reuse our single file descriptor each time!
 5517:22 <troygiorshev> on my system I get the following:
 5617:22 <troygiorshev> ulimit -n: 1024
 5717:22 <troygiorshev> ulimit -nH: 4096
 5817:22 <troygiorshev> ulimit -x: unlimited
 5917:23 <troygiorshev> The first two are especially important for a single process. If a process tries to take more than its soft limit (-n) of file descriptors, the OS will kill it!
 6017:23 <jonatack> hi
 6117:24 <troygiorshev> Any process is allowed to increse their soft limit (-n) up to but not past their hard limit (-nH)
 6217:24 <troygiorshev> and -x is the total number available to all processes combined
 6317:25 <troygiorshev> I can guess that gzhao408 is probably on a Mac, because they're the only modern system that has such a low limit!
 6417:25 <troygiorshev> lightlike: so, yeah, it is :)
 6517:25 <gzhao408> Does using more file descriptors help with performance? e.g. I'm guessing that writes are buffered, and you flush on close (depends on implementation) so I was curious if you had considered using 1 per peer or something? Or how you came to the conclusion of just 1?
 6617:25 <commodore> soft: 256, hard unlimited
 6717:26 <gzhao408> troygiorshev: u caught me :cold_sweat
 6817:26 <commodore> ulimit -S -n; ulimit -H -n #macos
 6917:26 <lightlike> did I read the code correctly such that this means that we need one file descriptor per peer connection, so we will be able to connect one peer less after this?
 7017:26 <sipa> gzhao408: if you have fewer file descriptors available, bitcoin core will just refuse to create more connections
 7117:26 <sipa> it doesn't change performance
 7217:26 <gzhao408> sipa: so the fewer the better?
 7317:26 <sipa> what?
 7417:26 <troygiorshev> lightlike: see what sipa just said, yup
 7517:27 <troygiorshev> lightlike: IF we don't have enough available
 7617:27 <sipa> gzhao408: if you want many connections you need more file descriptors
 7717:27 <sipa> if you don't, you don't care
 7817:28 <troygiorshev> gzhao408: so having one (or maybe two) file descriptors per peer could more than triple our usage! It's possible this would bring it up to the hard limit on some systems and restrict the number of peers we could connect to
 7917:28 <gzhao408> Sorry I meant, I'm gathering that choosing to use 1 file descriptor is to minimize the file descriptor requirements
 8017:28 <felixweis> if you have low resources, you might want to reduce file descriptors if you run a pi webserver.
 8117:28 <troygiorshev> if anyone has found it fun diving into this file descriptor stuff, there's an issue open right now to clean it up! #18911
 8217:28 <sipa> gzhao408: right
 8317:29 <gzhao408> thanks sipa!
 8417:29 <lightlike> it seems that nMaxConnections is reduced even if we don't enable the new logging option - though we probably don't care about this?
 8517:29 <troygiorshev> anyone interested keep going on that, I'll throw the next question down
 8617:29 <troygiorshev> The "Add Message Logging Test" commit adds a test. What is being tested? What isn't being tested? Should this be expanded? Did you read the out-of-tree build discussion?
 8717:31 <michaelfolkson> So only the structure of the message, not deserializing or order of messages or message type
 8817:32 <troygiorshev> michaelfolkson: yup, did you see why it was done that way?
 8917:32 <amiti> I think it makes sense for it to be more of a validity check & not worry about the specific messages etc.
 9017:32 <jnewbery> lighlike: generally, we're not at the limit of permitted file descriptors
 9117:32 <michaelfolkson> By too brittle you mean not the best place to test it? Test will be too unreliable?
 9217:34 <troygiorshev> michaelfolkson: pretty much yup! if, say, the order of messages in our handshake changes, or the format of one changes even a little bit then the test would break! that's too much of a pain for everyone
 9317:34 <felixweis> is ":" not an issue on windows filesystem?
 9417:34 <felixweis> for the directory name ip:port
 9517:36 <troygiorshev> felixweis: it almost certainly is, good catch!
 9617:37 <troygiorshev> looks like nobody has tested it on windows yet, and while I expect most people running the debug options will be on linux or mac, this definitely needs to be checked and fixed
 9717:37 <ecurrencyhodler> Sorry for the tangent. But what is the value of a PR like this? Is it purely educational? Or is there a more functional purpose such as potentially identifying a bad acting node?
 9817:38 <sipa> or debugging
 9917:39 <troygiorshev> ecurrencyhodler: not a tangent at all. feature PRs have to be treated a little differently than refactor or bugfix prs.
10017:39 <troygiorshev> ecurrencyhodler: so that sort of question is a really useful one
10117:40 <ecurrencyhodler> okay whew haha
10217:40 <troygiorshev> i agree with both you and sipa, educational and debugging. though I wouldn't be surprised if someone uses it for something else some day!
10317:41 <jnewbery> I'm interested in being able to see exactly what traffic I'm receiving from peers on my node
10417:41 <jnewbery> particularly if I'm able to filter by version/subversion/etc
10517:41 <felixweis> yeah i've tried to decypher wireshark in the past
10617:42 <troygiorshev> felixweis: how did it go?
10717:42 <michaelfolkson> Do a lot of contributors use Wireshark currently? I have never used it on Core. One of the thousand things I'd like to have done but haven't yet
10817:42 <felixweis> too much hasstle
10917:42 <troygiorshev> michaelfolkson: do you think you would use this more?
11017:42 <jnewbery> I could do that using wireshark outside the software, but having a toggle in the application itself that just dumps application level data for the peers I'm interested in seems much more useful
11117:42 <ecurrencyhodler> I wonder if you could you also do some data analysis on the information that's dumped? Maybe clue people in to optimizing p2p network more?
11217:42 <felixweis> i've played around with this now. this gives me together with message-logging-parser.py is the MVP
11317:43 <jnewbery> felixweis: yes, wireshark is very powerful and customizable but is definitely a hassle to get what you want
11417:43 <troygiorshev> ecurrencyhodler: that's a great idea! I await your results :D
11517:43 <felixweis> since the messages are POST p2p encoding. e.g. application messages
11617:43 <ecurrencyhodler> hehe will do troy.
11717:44 <michaelfolkson> troygiorshev: Yeah certainly more accessible
11817:45 <troygiorshev> next question
11917:45 <troygiorshev> The "Clean PushMessage and ProcessMessages" commit is a cleanup. Is this justified or is this just noise? When should cleanups like this be done?
12017:47 <michaelfolkson> This was just style guide type stuff right?
12117:48 <amiti> I am +1 for cleanup commits that are separate and its made clear that its just a cleanup
12217:48 <troygiorshev> michaelfolkson: yup!
12317:48 <troygiorshev> amiti: glad to hear it
12417:49 <michaelfolkson> Yeah +1 don't see why not. I suppose only question is whether it should be its own PR? But I'd lean towards no
12517:49 <jnewbery> amiti: I agree. If you're touching most of a function and you're going to end up changing it to use the new code style, you might as well do it in the first commit and get it out of the way
12617:49 <amiti> also I've spent way too long debugging how the code broke when I just added logging, turned out it was one of those one-line-if-statements without brackets. so small stuff like this can matter.
12717:50 <sipa> ifs with missing braces is one of the (few) actual known examples of coding style that has caused real world bugs
12817:50 <troygiorshev> sipa: ah cool!
12917:50 <sipa> in openssl, of course
13017:50 <troygiorshev> 10 minutes left, last question!
13117:51 <troygiorshev> Both jnewbery and practicalswift proposed extensions to this message logging. What are they? Do you agree that these are a good idea? Are there any extensions that you would like to see?
13217:51 <troygiorshev> there are definitly no wrong answers to this one!
13317:52 <jnewbery> sipa: if it's the one I'm thinking of, I think it was apple's ssl implementation that had the if braces bug
13417:52 <troygiorshev> https://github.com/bitcoin/bitcoin/pull/19509#pullrequestreview-447894982
13517:52 <troygiorshev> https://github.com/bitcoin/bitcoin/pull/19509#pullrequestreview-448269477
13617:52 <michaelfolkson> Right so jnewbery only wants to log messages for a specific peer. To save space, make it cleaner I'm assuming
13717:52 <sipa> jnewbery: you're right!
13817:53 <ajonas> I love the idea of the building a fuzzing corpus
13917:54 <jnewbery> ajonas: I agree. It's a great idea
14017:54 <jnewbery> michaelfolkson: yes, if I want to debug interactions with an individual peer, I don't want to dump potentially 100s of MBs of traffic from other peers
14117:55 <jnewbery> especially if I want to leave this running for long periods
14217:55 <michaelfolkson> Makes sense. Both good suggestions
14317:55 <troygiorshev> anyone tried enabling this over an entire ibd? how large were the files?
14417:56 <felixweis> this works during IBD? could be interesting to visualize peer round-robin
14517:56 <jnewbery> it'd be >300GB since you'd be dumping the entire blockchain in net serialized format
14617:56 <sipa> one use case i have is that when investigating the effect of some P2P change, i want to look at the exact transcript... usually not at the level of seeing all the messages, but you do want to see it per peer; using debug.log is really annoying as peer ids get reused across restarts, for example
14717:57 <sipa> so having a file per peer solves that pretty elegantly
14817:57 <ajonas> I did IBD on testnet with it on
14917:58 <ajonas> the files were big, the python script just seemed to hang
15017:58 <sipa> if i can ask: what is the rationale for splitting received/sent messages up into two files?
15117:58 <felixweis> maybe a flag to limit to certain message types
15217:58 <sipa> (maybe this was addressed in the PR, just tell me so if i'm missing context)
15317:59 <jonatack> https://github.com/bitcoin/bitcoin/pull/19509#issuecomment-664374225
15417:59 <troygiorshev> sipa: 1: saves a byte per message, 2: hypothetically someone might want to just see received or sent
15517:59 <troygiorshev> one minute left!
15618:00 <jnewbery> it means we don't need to a field to indicate whether the message was outgoing or incoming. The parser can take any number of files as input, so you can use both together if you want the two-way transcript
15718:00 <troygiorshev> ok that's time! didn't get to the bonus question but anyone interested in python can find it themselves :)
15818:00 <felixweis> hm maybe interweaving send/receive isn't that bad because ofthen things are request/respongs (inv/getdata, ...)
15918:01 <sipa> troygiorshev: is there a way to interleave them again?
16018:01 <sipa> having to look at them separately would make it very painful to use
16118:01 <troygiorshev> sipa: yup! just giving both files to the parser interleaves them
16218:01 <troygiorshev> it's the intended use :)
16318:01 <sipa> ah great
16418:01 <lightlike> can we reconstruct the time order in which messages were processed in our node, even with the splitting in two files?
16518:01 <troygiorshev> thanks to everyone for coming and thanks for all of the great discussion!
16618:02 <jnewbery> lightlike: yes, each message has a timestamp of sent/received time
16718:02 <felixweis> thans troygiorshev
16818:02 <lightlike> thanks
16918:02 <jnewbery> so feeding them both into the parser will interleave them in order
17018:02 <michaelfolkson> Thanks troygiorshev!
17118:03 <jnewbery> the idea is that the parser is just one potential tool to analyze these files. If you want to do more quantitive analysis of the traffic, you could build a tool that parses/analyzes them in a different way.
17218:03 <troygiorshev> #endmeeting