Mitigate disk filling attacks by rate limiting LogPrintf (utils/log/libs)

https://github.com/bitcoin/bitcoin/pulls/21603

Host: jnewbery  -  PR author: dergoegge

The PR branch HEAD was 855d05e43 at the time of this review club meeting.

This week, we’ll look at two alternative PRs for rate-limiting logging to the disk:

Notes

  • Bitcoin Core can log events either to stdout or to a log file on disk (or both). There are various configuration options to control the logging, including:

    • -debuglogfile to specify the file to log to. Setting -nodebuglogfile disables logging to file entirely.
    • -debug to enable low-level debug logging. Can be specified as -debug=<comma-separated list of categories> to only log certain categories. Note that even with -nodebug, Bitcoin Core will still log important events to file by default.
    • -logips, -logtimestamps, -logthreadnames, -logsourcelocations and -logtimemicros all specify additional information that can be added to each log line.
    • -printtoconsole controls whether the logging is output to stdout as well as the log file.
  • We expect most users to run with default options. For logging, that means that important, high-level logs will be written to file, but low-level debugging logs will not.

  • A specific log in the code can be assigned to one of several different categories. Those categories include things like “net” for events in the networking and p2p code, “validation” for events in the validation code, and so on.

  • We use two macros for logging:

    • LogPrintf() logs unconditionally (unless -nodebuglogfile and -noprinttoconsole are set).
    • LogPrint() logs if the specified category is enabled. For example, LogPrint(BCLog::NET, "<log message>"); will only log if the ‘net’ category is enabled.
  • One possible class of denial of service attacks is for an adversary to cause a victim node to repeatedly write logs to disk. This would eventually cause the victim’s disk to fill up, and the node to fail.

  • For that reason, we want the unconditional LogPrintf() macro to be used sparingly, and only in places where an attacker can’t easily trigger a log being written.

  • More generally, it would be good if there was a mechanism that could eliminate this entire class of potential exploits by throttling logging whenever it exceeds a certain rate. Issue 21559 documents the potential exploit and suggested fix.

  • PRs 21603 and 21706 are two alternative approaches. One implements rate-limiting per log location, and one implements rate-limiting globally. Both were implemented by the same author, who is seeking conceptual/approach review.

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?

  2. Macros are generally discouraged in C++. Why do we use them for our logging routines?

  3. Try enabling logging for all categories and starting Bitcoin Core (eg with bitcoind -debug=all). Is the output useful? Are there specific categories which generate a lot of logs?

  4. The two approaches are to rate-limit logs per-location and globally. What are the advantages and disadvantages of those two approaches? Which do you think would result in more useful log output?

  5. Both PRs add a new LogPrintfWithoutRateLimiting() macro. What is this used for? Why is it important?

  6. Both PRs add new variables m_bytes_dropped and m_messages_dropped. How are these variables used? How would they be useful for a developer trying to debug an issue?

Meeting Log

  110:00 <jnewbery> #startmeeting
  210:00 <amiti> hi
  310:00 <dunxen> hi!
  410:00 <hiii> can you recommend some bitcoin dev and other crypto channels?
  510:00 <LarryRuane> hi
  610:00 <jnewbery> hi folks. Welcome to PR review club. Feel free to say hi to let everyone know you're here
  710:00 <dergoegge> hi
  810:00 <sriramdvt> hi
  910:00 <svav> Hi All
 1010:00 <jnewbery> hiii: hi! Welcome.
 1110:00 <lightlike> hi
 1210:00 <schmidty> Hi
 1310:00 <hiii> oh hi amiti.. arent u a mentor for summerofbitcoin?
 1410:00 <jnewbery> Is anyone else here for the first time?
 1510:01 <vasanth2[m]> me!
 1610:01 <michaelfolkson> hi
 1710:01 <prakash> hi
 1810:01 * hannibit hi
 1910:01 <jnewbery> vasanth2[m]: welcome :)
 2010:01 <emzy> hi
 2110:01 <vasanth2[m]> thank you jnewbery :D
 2210:01 <jnewbery> The notes and questions are in the normal place: https://bitcoincore.reviews/21603
 2310:02 <amiti> hiii: yup :salute:
 2410:02 <jnewbery> who had a chance to read the notes & questions, and review the PR? (y/n)?
 2510:02 <LarryRuane> y
 2610:02 <emzy> n
 2710:02 * hannibit n
 2810:02 <michaelfolkson> y
 2910:02 <amiti> brief review y
 3010:02 <dergoegge> y :D
 3110:02 <dunxen> y-ish
 3210:02 <hiii> i feel honored to chat with u, hi again amiti.. where can I ask about summerofbitcoin related questions? as in when is it starting, etc. If there is an offtopic channel please let me know
 3310:02 <svav> y - Read the notes, did not review the PR
 3410:03 <sriramdvt> n
 3510:03 <prakash> y
 3610:03 <sipa> hiii: not now, please, there is a meeting ongoing about a specific pull request
 3710:03 <jnewbery> hiii: a bit offtopic for now. Can you save it for after the meeting please?
 3810:03 <hiii> ok got it
 3910:03 <jnewbery> dergoegge authored the PRs, so I'm glad he also had time to review them :)
 4010:04 <jnewbery> I have a series of questions to guide the conversation, but feel free to jump in at any point and ask your own questions if anything is unclear. We're all here to learn, so questions are very welcome.
 4110:04 <jnewbery> ok, let's get started
 4210:04 <jnewbery> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?
 4310:04 <jnewbery> And also, can you briefly summarize what the PRs are trying to do?
 4410:05 <FelixWeis> y
 4510:05 <svav> Are trying to prevent a potential attack vector, the vector being the logging system and its overstimulation
 4610:06 <michaelfolkson> So Concept ACK and then Approach ACK is whether it should be global or not
 4710:06 <svav> If a node is made to log too much, it can fill up all disk space and bring down the node
 4810:06 <jnewbery> svav: right. Trying to prevent an attacker from filling our disk with logs
 4910:06 <emzy> Concept ACK. It implements rate limits for every logging category separately.
 5010:07 <LarryRuane> I'm not sure if I'm an approach ACK even ... are you all aware of an existing mitigation for excess logging? the logrotate command, and also see https://github.com/bitcoin/bitcoin/search?q=m_reopen_file
 5110:07 <dunxen> Concept ACK. The PRs add throttling for debug logging to prevent filling disk and DoS-ing our node
 5210:07 <jnewbery> How about people's review approach? Did anyone test this manually? Were you able to trigger the log rate-limiting?
 5310:07 <LarryRuane> what logrotate does is, in effect, gives you a sliding window of the latest log messages (configurable in size)
 5410:08 <michaelfolkson> LarryRuane: And deletes old logs?
 5510:08 <jnewbery> LarryRuane: very good point. Can you explain how it works?
 5610:08 <LarryRuane> when a log file reaches a certain size, logrotate can move it (safely) to another name, and recreate a new log file
 5710:08 <FelixWeis> mostly tried to fiddle with range-diff to understand the 2 different prs and their commonality
 5810:09 <LarryRuane> then logrotate can do something with it, like compress it, delete it, send it somewhere, etc (configurable)
 5910:09 <svav> In general, how big a problem do people think disk filling attacks could be at present? How was it established that this is a serious attack surface?
 6010:09 <jnewbery> The functionality that LarryRuane is refering to was added here: https://github.com/bitcoin/bitcoin/pull/917
 6110:09 <LarryRuane> downside of logrotate, possibly: i'm not sure if there's a windows version .. it's built into linux (and probably other unixs)
 6210:10 <sipa> LarryRuane: is it common/easy to use logrotate for rotating user logs (as opposed to system-level services)?
 6310:10 <lightlike> since this is not a bitcoin-specific attack vector: are there any famous examples where some kind of logging DOS attack was executed and caused damage?
 6410:10 <svav> Has anyone worked out how quickly this attack could bring a node down?
 6510:10 <jnewbery> svav: there have been quite a few logs that have been removed/changed to LogPrint(<category>) to prevent such a thing being possible
 6610:11 <jnewbery> this PR is a more general approach
 6710:11 <LarryRuane> sipa: I really don't know, there is a man page for logrotate, so that may indicate that it's not internal-linux only
 6810:11 <FelixWeis> when we do a log rotate, does it create a new debug.log? on macos its continuing to write to debug.log.1
 6910:11 <LarryRuane> i thought about writing a script in contrib, or documenting logrotate (specifically for bitcoind), that may be a good idea
 7010:12 <sipa> LarryRuane: i'm familiar with logrotate, and i have no doubt it can be configured for our purposes - my question if it's easy to do so
 7110:12 <jnewbery> I think logrotate is probably the best solution, but we can't expect all users to configure logrotate if they just want to start a bitcoin core node
 7210:12 <sipa> can we make bitcoind "subscribe" to logrotate, without needing system configuration?
 7310:12 <sipa> as in without needing the user to go edit a config file
 7410:12 <dergoegge> svav: https://github.com/bitcoin/bitcoin/pull/19995#issuecomment-697434667 praticalswift has some numbers here
 7510:12 <LarryRuane> i used to know a lot about logrotate, need to re-investigate
 7610:12 <michaelfolkson> Do people here actively monitor and manage the size of their logs? I can imagine they get pretty big on most verbose logging even without an attack
 7710:13 <sipa> how many people enable debug=net or other spammy log categories even
 7810:13 <jnewbery> sipa: that seems like a good question, but what would the defaults be? Archive files and only keep the x most recent?
 7910:13 <emzy> I see no one using logrotate for the GUI version of any software...
 8010:13 <sipa> emzy: indeed
 8110:13 <LarryRuane> what's interesting is, barring logging bugs (that these two PRs are addressing), the logging should be linear with the size of the blockchain (the UpdateTip messages)
 8210:14 <sipa> if you don't enable any debug logging this should be almost a non-issue
 8310:14 <jnewbery> I believe the motivation here is to eliminate the possibility of a non-category log (ie a logPrint(...) rather than a logPrintf(<category>, ...)) from completely filling up the disk
 8410:14 <LarryRuane> so if you have space for blockchain growth, you have space for logging (a small constant factor increase)
 8510:15 <jnewbery> I think if you're running with -debug=all or -debug=spammy_component, then you're already an advanced user and you should be able to manage you disk
 8610:15 <emzy> LarryRuane: I think that depends on logging settings.
 8710:15 <LarryRuane> yes, IIUC, these PRs are really to work around logging bugs
 8810:15 <LarryRuane> emzy: yes I meant with no extra debug logs turned on
 8910:16 <jnewbery> but if you just install and run bitcoind or bitcoin-qt and hit run, then we don't want a log to be able to fill your disk
 9010:16 <amiti> oh interesting, does the current logic in the PR apply to the logging categories?
 9110:16 <amiti> I was wondering about how categories were handled, net logs print a LOT
 9210:16 <jnewbery> I believe the logic applies to both, but the main motivation is for non-category logs
 9310:17 <amiti> yeah I saw the categories being counted & maintained (esp for the PR that locally supresses), but also looks like its bounded by the same rate across the board?
 9410:17 <jnewbery> I'm going to move on to the next question, but if you still have approach questions or comments, feel free to continue sharing them
 9510:17 <jnewbery> 2. Macros are generally discouraged in C++. Why do we use them for our logging routines?
 9610:17 <dergoegge> amiti: it only applies to non-category logs
 9710:17 <LarryRuane> jnewbery: yes, if you specifically enable categories, you're aware that logging could fill up the disk
 9810:17 <jnewbery> dergoegge: ah. Thanks!
 9910:17 <amiti> dergoegge: ohhh, ok
10010:17 <LarryRuane> jnewbery: filenames and line numbers!
10110:18 <FelixWeis> i believe macros are the only way to get file/fucntion/line number info ?
10210:18 <dergoegge> see here: https://github.com/bitcoin/bitcoin/blob/855d05e4377cd1eb902cd74c80056d59cb98b7b8/src/logging.h#L220
10310:18 <dergoegge> /* skip_disk_usage_rate_limiting */ true
10410:18 <jnewbery> oh interesting. That wasn't going to be my answer!
10510:18 <amiti> I found the PR history of this functionality interesting
10610:18 <dergoegge> people using -debug are assumed to be advanced and know the risks
10710:19 <amiti> looks like #17218 is what brought back a macro so we don't unnecessary evaluate expressions
10810:19 <jnewbery> dergoegge: +1
10910:19 <FelixWeis> should enabling debug= logs disable the rate limiting?
11010:19 <jnewbery> FelixWeis: there's a separate config option to disable rate limiting
11110:20 <LarryRuane> amiti: "so we don't unnecessary evaluate expressions" -- that makes sense, do we test that the logging expressions don't crash the system? I think possibly that the functional tests run with -debug
11210:20 <michaelfolkson> dergoegge: But people using -debug are a more attractive target :) Their logs are already huge (assuming they aren't cleaning disk regularly)
11310:20 <jnewbery> amiti: exactly. Good code archaeology! 17218 fixes a regression where logPrintf was changed to a function.
11410:21 <jnewbery> oh sorry, I mean LogPrint(), not LogPrintf()
11510:21 <LarryRuane> michaelfolkson: my impression is that people enable debug logs only for a limited time, to look into a specific problem
11610:21 <michaelfolkson> LarryRuane: Right, sounds reasonable
11710:22 <jnewbery> LarryRuane: The functional tests run with all debug logs *except* leveldb and libevent logs: https://github.com/bitcoin/bitcoin/blob/6bc1eca01b2f88e081e71b783b3d45287700f8a5/test/functional/test_framework/test_node.py#L101-L103
11810:22 <dergoegge> LarryRuane: i would assume the same
11910:23 <jnewbery> leveldb and libevent logs are very talkative!
12010:23 <FelixWeis> Im using debug=mempool in a project to get ATMP and replacement timings. If I didn't read the changelog when upgrading I might miss out on valuable statistics.
12110:23 <LarryRuane> jnewbery: that's wise, was going to suggest that if not already being done
12210:24 <svav> So we use a macro instead of a function for conditional logging so arguments are not evaluated when logging for the category is not enabled ... ?
12310:24 <jnewbery> svav: exactly right
12410:24 <jnewbery> Macros are essentially find-and-replace commands run by the preprocessor, LogPrint(<category>, ...) will expand to:
12510:24 <LarryRuane> I wonder if there's a way to have the compiler ensure that logging expressions don't have side-effects
12610:25 <michaelfolkson> leveldb was the worst! I can't even imagine the logs being that useful assuming a leveldb bug but maybe they are in some edge case
12710:25 <dergoegge> FelixWeis: category logs will never be dropped
12810:25 <jnewbery> if (enabled(<category>)) log(...)
12910:25 <FelixWeis> dergoegge: oh good
13010:25 <jnewbery> so if the category is disabled, then we won't evaluate the arguements to the LogPrint() macro
13110:26 <jnewbery> which might be nice if we had something like LogPrint(<category>, ReadThingFromDisk(), DoAnExpensiveCalculation(), ...)
13210:26 <jnewbery> ok, moving on to the next question
13310:26 <jnewbery> 3. Try enabling logging for all categories and starting Bitcoin Core (eg with bitcoind -debug=all). Is the output useful? Are there specific categories which generate a lot of logs?
13410:27 <LarryRuane> it is useful, net generates a lot
13510:27 <jnewbery> Did people manage to do this? How did you enjoy the bitcoind logs?
13610:27 <michaelfolkson> A fun read
13710:27 <LarryRuane> grep is your friend :)
13810:27 <dergoegge> usefulness depends on what you are looking for i guess, but yea lots of net logs
13910:28 <lightlike> in the functional tests, it's really helpful to understand what's going on
14010:28 <michaelfolkson> The net logs looked useful to me
14110:28 <jnewbery> good! There are lots of config options to make the logs more useful: -logips, -logtimestamps, -logthreadnames, -logsourcelocations and -logtimemicros
14210:28 <LarryRuane> also libevent .. can someone give a quick explanation of what libevent does? (if that's too much side-track, that's ok)
14310:29 <michaelfolkson> Though some of those net logs appear in the minimal log
14410:29 <jnewbery> using -logthreadnames and looking at the validationinterface callbacks being enqueued and then the callbacks being called is a really good exercise to see how validation processing happens and triggers events in different subsystems
14510:29 <jnewbery> LarryRuane: we use libevent for our RPC server. I think that's it currently.
14610:30 <michaelfolkson> What does libevent do specifically though?
14710:30 <jnewbery> I think there was talk of using some of its functionality for our p2p networking, but I don't really know the details of that
14810:31 <jnewbery> lightlike: I agree. I'd suggest running some functional tests with -nocleanup, and then running the combine logs parser with -c to colorize the different nodes. Gives you a good view of what the different nodes in the test are doing.
14910:32 <jnewbery> Next question: The two approaches are to rate-limit logs per-location and globally. What are the advantages and disadvantages of those two approaches? Which do you think would result in more useful log output?
15010:33 <michaelfolkson> Advantages of global - better cut off logs, better log experience in the case of cut off. Disadvantages of global - we generally want to avoid global due to security, tight coupling
15110:34 <emzy> I think this PR makes more sense than the global alternative, because global rate limiting would open another attack vector.
15210:34 <emzy> An attacker could trigger the global limit in one category and conceal another attack from the log.
15310:34 <svav> I don't know much about this, but I'd say if you'd do it locally, it might be better, because it won't affect the whole network
15410:35 <LarryRuane> no, global in this context means across the entire bitcoind node (the local node)
15510:35 <svav> Locally means only nodes affected by overstimulation of logging would have restrictions applied, so only nodes being impacted would be controlled
15610:35 <lightlike> one advantage of global: less bookkeeping necessary over all the different locations that might log.
15710:35 <michaelfolkson> So kind of user experience vs security conversation. Depends how non-trivial the security concern is
15810:35 <jnewbery> svav: you're misunderstanding how I'm using 'local' and 'global'. 'local' means "only rate-limit the loggin from this line of code". 'global' means "rate-limit logs from across the whole program"
15910:35 <LarryRuane> i was expecting the global PR (21706) to be simpler, but it seems about the same complexity
16010:36 <svav> jnewbery: thanks for clarifying
16110:37 <LarryRuane> overall, I'm surprised how complex both of these PRs are (over 200 lines of diff), given what they do .. but maybe that's just how it has to be
16210:37 <jnewbery> LarryRuane: much of that is new tests
16310:37 <LarryRuane> ah, good point
16410:38 <jnewbery> dergoegge added the second (global) implementation based on a review comment I made in the original PR: https://github.com/bitcoin/bitcoin/pull/19995#pullrequestreview-505540213 (thanks dergoegge!)
16510:39 <LarryRuane> maybe a simpler approach, that might be good enough, is to keep track of only the most recent logged message, and if it's repeated (soon), don't log it ... and after some timeout, log "last message repeated 200 times" or whatever
16610:39 <LarryRuane> but then i guess if each log message includes some specific data, like a hash, that wouldn't help
16710:39 <jnewbery> I'll explain why I suggested global log-shedding. If we only drop logs from a single location, we end up with a log file that looks normal, but some logs are missing. The only way to know that is to search for a very specific "start dropping logs" or "stop dropping logs" log.
16810:40 <jnewbery> whereas if you drop logs globally, it's much more obvious that they're being dropped
16910:40 <LarryRuane> that makes sense to me
17010:41 <jnewbery> I think partial logging is quite dangerous. I've been tripped up many times debugging issues when the logs are incomplete and I've been sent in the wrong direction. For example, what would you conclude if there was the following code:
17110:41 <jnewbery> log(1);
17210:41 <jnewbery> if (condition) {
17310:41 <jnewbery> do_thing();
17410:41 <jnewbery> log(2);
17510:41 <jnewbery> }
17610:41 <jnewbery> log(3)
17710:41 <jnewbery> and then you saw this log:
17810:41 <jnewbery> cat log.txt
17910:41 <jnewbery> > 1 3
18010:41 <LarryRuane> !condition :)
18110:42 <jnewbery> LarryRuane: right
18210:42 <dergoegge> but if 2 was suppressed the logs would say so
18310:42 <dergoegge> but you might have to search for it
18410:42 <jnewbery> dergoegge: maybe much further up in the log file
18510:42 <jnewbery> and if you weren't aware of that feature you wouldn't know to look
18610:43 <dergoegge> thats true, could be very annoying
18710:44 <jnewbery> I don't think it's a huge issue, but it's something I've tripped over before. Thanks to dergoegge for implementing both approaches
18810:44 <jnewbery> Any other thoughts on global vs local, or shall we move on?
18910:44 <dergoegge> i dont have a strong preference and am happy to maintain either
19010:44 <michaelfolkson> So how to pick which one? Gut feel is that global just about edges it
19110:45 <jnewbery> I think if no-one has a very strong preference, then the person implementing it gets to choose. I'm an approach ACK on either
19210:46 <jnewbery> ok, next question. Both PRs add a new LogPrintfWithoutRateLimiting() macro. What is this used for? Why is it important?
19310:47 <dergoegge> jnewbery: that sounds fair, i will pick one of the two in the upcoming days based on the reviews
19410:48 <dergoegge> LogPrintfWithoutRateLimiting() is only used once in validation.cpp for "new best=<hash>" logs
19510:48 <lightlike> to exclude messages from the rate limting that are written a lot during IBD and cannot be abused as an attack vector (because they require miner's work)
19610:48 <dergoegge> lightlike: well said +1
19710:49 <jnewbery> lightlike dergoegge: right!
19810:49 <jnewbery> it'd be a shame to start rate-limiting a really important log during initial sync
19910:50 <jnewbery> 6. Both PRs add new variables m_bytes_dropped and m_messages_dropped. How are these variables used? How would they be useful for a developer trying to debug an issue?
20010:52 <dergoegge> they are used to keep track of how many messages and how many bytes were dropped. we can use this to decide if logging should be resumed or print a tally at the end of a suppression period
20110:53 <jnewbery> dergoegge: you're right!
20210:53 <dergoegge> only the non-global approach uses them to decide if logging should be resumed
20310:53 <jnewbery> you've done a very thorough review :)
20410:53 <dergoegge> :D
20510:53 <jnewbery> ok, any final questions before we wrap up?
20610:53 <prakash> Hi..I am new here. I have a question on global vs local. Is global dropping all Logprintf messages if the rate limit kicks in ? If so, would'nt an attacker be able to disable logging altogether before an actual attack?
20710:55 <jnewbery> prakash: that's exactly what global log-shedding would do. However, I don't think attackers are generally worried about their victim's logs.
20810:55 <jnewbery> but it's a good thought!
20910:55 <LarryRuane> A general question came up for me during review of these PRs ... if we add a new function, is the convention to be before or after the first (maybe only) call point?
21010:56 <emzy> jnewbery: in general I dissagree. Not sure if in Bitcoin it is the case.
21110:56 <jnewbery> LarryRuane: I don't understand the question. What is before or after the call point?
21210:56 <LarryRuane> is "bottom-up" (low level functions followed by higher-level), or "top-down" (higher level followed by lower level functions)?
21310:57 <jnewbery> emzy: maybe you're right. I just can't think of a scenario where someone exploiting a weakness in Bitcoin Core would also need to suppress logs.
21410:57 <emzy> jnewbery: Right. But there could be one.
21510:57 <prakash> jnewbery: I was implying that the attacker can hide what he is doing. Yes. It may be unrealistic scenario. Thanks for clarifying :)
21610:58 <jnewbery> emzy prakash: you may be right!
21710:58 <lightlike> LarryRuane: do you mean the placement of the code for the function in the source code?
21810:59 <prakash> but i was able to print repeated messages logged in about incorrect rpc password : "ThreadRPCServer incorrect password attempt from 127.0.0.1" ...so technically it would be easy to turn off logging.
21910:59 <prakash> but this would be a targeted attack vector. not network wide attack.
22010:59 <LarryRuane> yes ... this PR adds ComputeQuotaUsageStats(), and it's called from BCLog::Logger::LogPrintStr() ... so is it preferred to add the new function before or after the function that's calling it?
22110:59 <michaelfolkson> prakash: Is there no limit on rpc password attempts?
22211:00 <jnewbery> prakash: right. The rpc port shouldn't be open to the public internet, but it's a very good point.
22311:00 <prakash> i dont see a limit in logging
22411:00 <prakash> ohh. i see.
22511:00 <dergoegge> prakash: the limit is 1MB per hour so that would have to be a lot of failed attempts
22611:00 <jnewbery> ok, I've got to run now, but obviously feel free to continue the discussion!
22711:00 <jnewbery> #endmeeting