libs)

Jun 16, 2021

https://github.com/bitcoin/bitcoin/pull/21603

Host: jnewbery - PR author: dergoegge

The PR branch HEAD was 855d05e43 at the time of this review club meeting.

This week, we’ll look at two alternative PRs for rate-limiting logging to the disk:

Notes

Bitcoin Core can log events either to stdout or to a log file on disk (or both). There are various configuration options to control the logging, including:
- -debuglogfile to specify the file to log to. Setting -nodebuglogfile disables logging to file entirely.
- -debug to enable low-level debug logging. Can be specified as -debug=<comma-separated list of categories> to only log certain categories. Note that even with -nodebug, Bitcoin Core will still log important events to file by default.
- -logips, -logtimestamps, -logthreadnames, -logsourcelocations and -logtimemicros all specify additional information that can be added to each log line.
- -printtoconsole controls whether the logging is output to stdout as well as the log file.
We expect most users to run with default options. For logging, that means that important, high-level logs will be written to file, but low-level debugging logs will not.
A specific log in the code can be assigned to one of several different categories. Those categories include things like “net” for events in the networking and p2p code, “validation” for events in the validation code, and so on.
We use two macros for logging:
- LogPrintf() logs unconditionally (unless -nodebuglogfile and -noprinttoconsole are set).
- LogPrint() logs if the specified category is enabled. For example, LogPrint(BCLog::NET, "<log message>"); will only log if the ‘net’ category is enabled.
One possible class of denial of service attacks is for an adversary to cause a victim node to repeatedly write logs to disk. This would eventually cause the victim’s disk to fill up, and the node to fail.
For that reason, we want the unconditional LogPrintf() macro to be used sparingly, and only in places where an attacker can’t easily trigger a log being written.
More generally, it would be good if there was a mechanism that could eliminate this entire class of potential exploits by throttling logging whenever it exceeds a certain rate. Issue 21559 documents the potential exploit and suggested fix.
PRs 21603 and 21706 are two alternative approaches. One implements rate-limiting per log location, and one implements rate-limiting globally. Both were implemented by the same author, who is seeking conceptual/approach review.

Questions

Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?
Macros are generally discouraged in C++. Why do we use them for our logging routines?
Try enabling logging for all categories and starting Bitcoin Core (eg with bitcoind -debug=all). Is the output useful? Are there specific categories which generate a lot of logs?
The two approaches are to rate-limit logs per-location and globally. What are the advantages and disadvantages of those two approaches? Which do you think would result in more useful log output?
Both PRs add a new LogPrintfWithoutRateLimiting() macro. What is this used for? Why is it important?
Both PRs add new variables m_bytes_dropped and m_messages_dropped. How are these variables used? How would they be useful for a developer trying to debug an issue?

Meeting Log

10:00

<jnewbery> #startmeeting

10:00

<amiti> hi

10:00

<dunxen> hi!

10:00

<hiii> can you recommend some bitcoin dev and other crypto channels?

10:00

<LarryRuane> hi

10:00

<jnewbery> hi folks. Welcome to PR review club. Feel free to say hi to let everyone know you're here

10:00

<dergoegge> hi

10:00

<sriramdvt> hi

10:00

<svav> Hi All

10:00

<jnewbery> hiii: hi! Welcome.

10:00

<lightlike> hi

10:00

<schmidty> Hi

10:00

<hiii> oh hi amiti.. arent u a mentor for summerofbitcoin?

10:00

<jnewbery> Is anyone else here for the first time?

10:01

<vasanth2[m]> me!

10:01

<michaelfolkson> hi

10:01

<prakash> hi

10:01

* hannibit hi

10:01

<jnewbery> vasanth2[m]: welcome :)

10:01

<emzy> hi

10:01

<vasanth2[m]> thank you jnewbery :D

10:01

<jnewbery> The notes and questions are in the normal place: https://bitcoincore.reviews/21603

10:02

<amiti> hiii: yup :salute:

10:02

<jnewbery> who had a chance to read the notes & questions, and review the PR? (y/n)?

10:02

<LarryRuane> y

10:02

<emzy> n

10:02

* hannibit n

10:02

<michaelfolkson> y

10:02

<amiti> brief review y

10:02

<dergoegge> y :D

10:02

<dunxen> y-ish

10:02

<hiii> i feel honored to chat with u, hi again amiti.. where can I ask about summerofbitcoin related questions? as in when is it starting, etc. If there is an offtopic channel please let me know

10:02

<svav> y - Read the notes, did not review the PR

10:03

<sriramdvt> n

10:03

<prakash> y

10:03

<sipa> hiii: not now, please, there is a meeting ongoing about a specific pull request

10:03

<jnewbery> hiii: a bit offtopic for now. Can you save it for after the meeting please?

10:03

<hiii> ok got it

10:03

<jnewbery> dergoegge authored the PRs, so I'm glad he also had time to review them :)

10:04

<jnewbery> I have a series of questions to guide the conversation, but feel free to jump in at any point and ask your own questions if anything is unclear. We're all here to learn, so questions are very welcome.

10:04

<jnewbery> ok, let's get started

10:04

<jnewbery> Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? What was your review approach?

10:04

<jnewbery> And also, can you briefly summarize what the PRs are trying to do?

10:05

<FelixWeis> y

10:05

<svav> Are trying to prevent a potential attack vector, the vector being the logging system and its overstimulation

10:06

<michaelfolkson> So Concept ACK and then Approach ACK is whether it should be global or not

10:06

<svav> If a node is made to log too much, it can fill up all disk space and bring down the node

10:06

<jnewbery> svav: right. Trying to prevent an attacker from filling our disk with logs

10:06

<emzy> Concept ACK. It implements rate limits for every logging category separately.

10:07

<LarryRuane> I'm not sure if I'm an approach ACK even ... are you all aware of an existing mitigation for excess logging? the logrotate command, and also see https://github.com/bitcoin/bitcoin/search?q=m_reopen_file

10:07

<dunxen> Concept ACK. The PRs add throttling for debug logging to prevent filling disk and DoS-ing our node

10:07

<jnewbery> How about people's review approach? Did anyone test this manually? Were you able to trigger the log rate-limiting?

10:07

<LarryRuane> what logrotate does is, in effect, gives you a sliding window of the latest log messages (configurable in size)

10:08

<michaelfolkson> LarryRuane: And deletes old logs?

10:08

<jnewbery> LarryRuane: very good point. Can you explain how it works?

10:08

<LarryRuane> when a log file reaches a certain size, logrotate can move it (safely) to another name, and recreate a new log file

10:08

<FelixWeis> mostly tried to fiddle with range-diff to understand the 2 different prs and their commonality

10:09

<LarryRuane> then logrotate can do something with it, like compress it, delete it, send it somewhere, etc (configurable)

10:09

<svav> In general, how big a problem do people think disk filling attacks could be at present? How was it established that this is a serious attack surface?

10:09

<jnewbery> The functionality that LarryRuane is refering to was added here: https://github.com/bitcoin/bitcoin/pull/917

10:09

<LarryRuane> downside of logrotate, possibly: i'm not sure if there's a windows version .. it's built into linux (and probably other unixs)

10:10

<sipa> LarryRuane: is it common/easy to use logrotate for rotating user logs (as opposed to system-level services)?

10:10

<lightlike> since this is not a bitcoin-specific attack vector: are there any famous examples where some kind of logging DOS attack was executed and caused damage?

10:10

<svav> Has anyone worked out how quickly this attack could bring a node down?

10:10

<jnewbery> svav: there have been quite a few logs that have been removed/changed to LogPrint(<category>) to prevent such a thing being possible

10:11

<jnewbery> this PR is a more general approach

10:11

<LarryRuane> sipa: I really don't know, there is a man page for logrotate, so that may indicate that it's not internal-linux only

10:11

<FelixWeis> when we do a log rotate, does it create a new debug.log? on macos its continuing to write to debug.log.1

10:11

<LarryRuane> i thought about writing a script in contrib, or documenting logrotate (specifically for bitcoind), that may be a good idea

10:12

<sipa> LarryRuane: i'm familiar with logrotate, and i have no doubt it can be configured for our purposes - my question if it's easy to do so

10:12

<jnewbery> I think logrotate is probably the best solution, but we can't expect all users to configure logrotate if they just want to start a bitcoin core node

10:12

<sipa> can we make bitcoind "subscribe" to logrotate, without needing system configuration?

10:12

<sipa> as in without needing the user to go edit a config file

10:12

<dergoegge> svav: https://github.com/bitcoin/bitcoin/pull/19995#issuecomment-697434667 praticalswift has some numbers here

10:12

<LarryRuane> i used to know a lot about logrotate, need to re-investigate

10:12

<michaelfolkson> Do people here actively monitor and manage the size of their logs? I can imagine they get pretty big on most verbose logging even without an attack

10:13

<sipa> how many people enable debug=net or other spammy log categories even

10:13

<jnewbery> sipa: that seems like a good question, but what would the defaults be? Archive files and only keep the x most recent?

10:13

<emzy> I see no one using logrotate for the GUI version of any software...

10:13

<sipa> emzy: indeed

10:13

<LarryRuane> what's interesting is, barring logging bugs (that these two PRs are addressing), the logging should be linear with the size of the blockchain (the UpdateTip messages)

10:14

<sipa> if you don't enable any debug logging this should be almost a non-issue

10:14

<jnewbery> I believe the motivation here is to eliminate the possibility of a non-category log (ie a logPrint(...) rather than a logPrintf(<category>, ...)) from completely filling up the disk

10:14

<LarryRuane> so if you have space for blockchain growth, you have space for logging (a small constant factor increase)

10:15

<jnewbery> I think if you're running with -debug=all or -debug=spammy_component, then you're already an advanced user and you should be able to manage you disk

10:15

<emzy> LarryRuane: I think that depends on logging settings.

10:15

<LarryRuane> yes, IIUC, these PRs are really to work around logging bugs

10:15

<LarryRuane> emzy: yes I meant with no extra debug logs turned on

10:16

<jnewbery> but if you just install and run bitcoind or bitcoin-qt and hit run, then we don't want a log to be able to fill your disk

10:16

<amiti> oh interesting, does the current logic in the PR apply to the logging categories?

10:16

<amiti> I was wondering about how categories were handled, net logs print a LOT

10:16

<jnewbery> I believe the logic applies to both, but the main motivation is for non-category logs

10:17

<amiti> yeah I saw the categories being counted & maintained (esp for the PR that locally supresses), but also looks like its bounded by the same rate across the board?

10:17

<jnewbery> I'm going to move on to the next question, but if you still have approach questions or comments, feel free to continue sharing them

10:17

<jnewbery> 2. Macros are generally discouraged in C++. Why do we use them for our logging routines?

10:17

<dergoegge> amiti: it only applies to non-category logs

10:17

<LarryRuane> jnewbery: yes, if you specifically enable categories, you're aware that logging could fill up the disk

10:17

<jnewbery> dergoegge: ah. Thanks!

10:17

<amiti> dergoegge: ohhh, ok

100

10:17

<LarryRuane> jnewbery: filenames and line numbers!

101

10:18

<FelixWeis> i believe macros are the only way to get file/fucntion/line number info ?

102

10:18

<dergoegge> see here: https://github.com/bitcoin/bitcoin/blob/855d05e4377cd1eb902cd74c80056d59cb98b7b8/src/logging.h#L220

103

10:18

<dergoegge> /* skip_disk_usage_rate_limiting */ true

104

10:18

<jnewbery> oh interesting. That wasn't going to be my answer!

105

10:18

<amiti> I found the PR history of this functionality interesting

106

10:18

<dergoegge> people using -debug are assumed to be advanced and know the risks

107

10:19

<amiti> looks like #17218 is what brought back a macro so we don't unnecessary evaluate expressions

108

10:19

<jnewbery> dergoegge: +1

109

10:19

<FelixWeis> should enabling debug= logs disable the rate limiting?

110

10:19

<jnewbery> FelixWeis: there's a separate config option to disable rate limiting

111

10:20

<LarryRuane> amiti: "so we don't unnecessary evaluate expressions" -- that makes sense, do we test that the logging expressions don't crash the system? I think possibly that the functional tests run with -debug

112

10:20

<michaelfolkson> dergoegge: But people using -debug are a more attractive target :) Their logs are already huge (assuming they aren't cleaning disk regularly)

113

10:20

<jnewbery> amiti: exactly. Good code archaeology! 17218 fixes a regression where logPrintf was changed to a function.

114

10:21

<jnewbery> oh sorry, I mean LogPrint(), not LogPrintf()

115

10:21

<LarryRuane> michaelfolkson: my impression is that people enable debug logs only for a limited time, to look into a specific problem

116

10:21

<michaelfolkson> LarryRuane: Right, sounds reasonable

117

10:22

<jnewbery> LarryRuane: The functional tests run with all debug logs *except* leveldb and libevent logs: https://github.com/bitcoin/bitcoin/blob/6bc1eca01b2f88e081e71b783b3d45287700f8a5/test/functional/test_framework/test_node.py#L101-L103

118

10:22

<dergoegge> LarryRuane: i would assume the same

119

10:23

<jnewbery> leveldb and libevent logs are very talkative!

120

10:23

<FelixWeis> Im using debug=mempool in a project to get ATMP and replacement timings. If I didn't read the changelog when upgrading I might miss out on valuable statistics.

121

10:23

<LarryRuane> jnewbery: that's wise, was going to suggest that if not already being done

122

10:24

<svav> So we use a macro instead of a function for conditional logging so arguments are not evaluated when logging for the category is not enabled ... ?

123

10:24

<jnewbery> svav: exactly right

124

10:24

<jnewbery> Macros are essentially find-and-replace commands run by the preprocessor, LogPrint(<category>, ...) will expand to:

125

10:24

<LarryRuane> I wonder if there's a way to have the compiler ensure that logging expressions don't have side-effects

126

10:25

<michaelfolkson> leveldb was the worst! I can't even imagine the logs being that useful assuming a leveldb bug but maybe they are in some edge case

127

10:25

<dergoegge> FelixWeis: category logs will never be dropped

128

10:25

<jnewbery> if (enabled(<category>)) log(...)

129

10:25

<FelixWeis> dergoegge: oh good

130

10:25

<jnewbery> so if the category is disabled, then we won't evaluate the arguements to the LogPrint() macro

131

10:26

<jnewbery> which might be nice if we had something like LogPrint(<category>, ReadThingFromDisk(), DoAnExpensiveCalculation(), ...)

132

10:26

<jnewbery> ok, moving on to the next question

133

10:26

<jnewbery> 3. Try enabling logging for all categories and starting Bitcoin Core (eg with bitcoind -debug=all). Is the output useful? Are there specific categories which generate a lot of logs?

134

10:27

<LarryRuane> it is useful, net generates a lot

135

10:27

<jnewbery> Did people manage to do this? How did you enjoy the bitcoind logs?

136

10:27

<michaelfolkson> A fun read

137

10:27

<LarryRuane> grep is your friend :)

138

10:27

<dergoegge> usefulness depends on what you are looking for i guess, but yea lots of net logs

139

10:28

<lightlike> in the functional tests, it's really helpful to understand what's going on

140

10:28

<michaelfolkson> The net logs looked useful to me

141

10:28

<jnewbery> good! There are lots of config options to make the logs more useful: -logips, -logtimestamps, -logthreadnames, -logsourcelocations and -logtimemicros

142

10:28

<LarryRuane> also libevent .. can someone give a quick explanation of what libevent does? (if that's too much side-track, that's ok)

143

10:29

<michaelfolkson> Though some of those net logs appear in the minimal log

144

10:29

<jnewbery> using -logthreadnames and looking at the validationinterface callbacks being enqueued and then the callbacks being called is a really good exercise to see how validation processing happens and triggers events in different subsystems

145

10:29

<jnewbery> LarryRuane: we use libevent for our RPC server. I think that's it currently.

146

10:30

<michaelfolkson> What does libevent do specifically though?

147

10:30

<jnewbery> I think there was talk of using some of its functionality for our p2p networking, but I don't really know the details of that

148

10:31

<jnewbery> lightlike: I agree. I'd suggest running some functional tests with -nocleanup, and then running the combine logs parser with -c to colorize the different nodes. Gives you a good view of what the different nodes in the test are doing.

149

10:32

<jnewbery> Next question: The two approaches are to rate-limit logs per-location and globally. What are the advantages and disadvantages of those two approaches? Which do you think would result in more useful log output?

150

10:33

<michaelfolkson> Advantages of global - better cut off logs, better log experience in the case of cut off. Disadvantages of global - we generally want to avoid global due to security, tight coupling

151

10:34

<emzy> I think this PR makes more sense than the global alternative, because global rate limiting would open another attack vector.

152

10:34

<emzy> An attacker could trigger the global limit in one category and conceal another attack from the log.

153

10:34

<svav> I don't know much about this, but I'd say if you'd do it locally, it might be better, because it won't affect the whole network

154

10:35

<LarryRuane> no, global in this context means across the entire bitcoind node (the local node)

155

10:35

<svav> Locally means only nodes affected by overstimulation of logging would have restrictions applied, so only nodes being impacted would be controlled

156

10:35

<lightlike> one advantage of global: less bookkeeping necessary over all the different locations that might log.

157

10:35

<michaelfolkson> So kind of user experience vs security conversation. Depends how non-trivial the security concern is

158

10:35

<jnewbery> svav: you're misunderstanding how I'm using 'local' and 'global'. 'local' means "only rate-limit the loggin from this line of code". 'global' means "rate-limit logs from across the whole program"

159

10:35

<LarryRuane> i was expecting the global PR (21706) to be simpler, but it seems about the same complexity

160

10:36

<svav> jnewbery: thanks for clarifying

161

10:37

<LarryRuane> overall, I'm surprised how complex both of these PRs are (over 200 lines of diff), given what they do .. but maybe that's just how it has to be

162

10:37

<jnewbery> LarryRuane: much of that is new tests

163

10:37

<LarryRuane> ah, good point

164

10:38

<jnewbery> dergoegge added the second (global) implementation based on a review comment I made in the original PR: https://github.com/bitcoin/bitcoin/pull/19995#pullrequestreview-505540213 (thanks dergoegge!)

165

10:39

<LarryRuane> maybe a simpler approach, that might be good enough, is to keep track of only the most recent logged message, and if it's repeated (soon), don't log it ... and after some timeout, log "last message repeated 200 times" or whatever

166

10:39

<LarryRuane> but then i guess if each log message includes some specific data, like a hash, that wouldn't help

167

10:39

<jnewbery> I'll explain why I suggested global log-shedding. If we only drop logs from a single location, we end up with a log file that looks normal, but some logs are missing. The only way to know that is to search for a very specific "start dropping logs" or "stop dropping logs" log.

168

10:40

<jnewbery> whereas if you drop logs globally, it's much more obvious that they're being dropped

169

10:40

<LarryRuane> that makes sense to me

170

10:41

<jnewbery> I think partial logging is quite dangerous. I've been tripped up many times debugging issues when the logs are incomplete and I've been sent in the wrong direction. For example, what would you conclude if there was the following code:

171

10:41

<jnewbery> log(1);

172

10:41

<jnewbery> if (condition) {

173

10:41

<jnewbery> do_thing();

174

10:41

<jnewbery> log(2);

175

10:41

<jnewbery> }

176

10:41

<jnewbery> log(3)

177

10:41

<jnewbery> and then you saw this log:

178

10:41

<jnewbery> cat log.txt

179

10:41

<jnewbery> > 1 3

180

10:41

<LarryRuane> !condition :)

181

10:42

<jnewbery> LarryRuane: right

182

10:42

<dergoegge> but if 2 was suppressed the logs would say so

183

10:42

<dergoegge> but you might have to search for it

184

10:42

<jnewbery> dergoegge: maybe much further up in the log file

185

10:42

<jnewbery> and if you weren't aware of that feature you wouldn't know to look

186

10:43

<dergoegge> thats true, could be very annoying

187

10:44

<jnewbery> I don't think it's a huge issue, but it's something I've tripped over before. Thanks to dergoegge for implementing both approaches

188

10:44

<jnewbery> Any other thoughts on global vs local, or shall we move on?

189

10:44

<dergoegge> i dont have a strong preference and am happy to maintain either

190

10:44

<michaelfolkson> So how to pick which one? Gut feel is that global just about edges it

191

10:45

<jnewbery> I think if no-one has a very strong preference, then the person implementing it gets to choose. I'm an approach ACK on either

192

10:46

<jnewbery> ok, next question. Both PRs add a new LogPrintfWithoutRateLimiting() macro. What is this used for? Why is it important?

193

10:47

<dergoegge> jnewbery: that sounds fair, i will pick one of the two in the upcoming days based on the reviews

194

10:48

<dergoegge> LogPrintfWithoutRateLimiting() is only used once in validation.cpp for "new best=<hash>" logs

195

10:48

<lightlike> to exclude messages from the rate limting that are written a lot during IBD and cannot be abused as an attack vector (because they require miner's work)

196

10:48

<dergoegge> lightlike: well said +1

197

10:49

<jnewbery> lightlike dergoegge: right!

198

10:49

<jnewbery> it'd be a shame to start rate-limiting a really important log during initial sync

199

10:50

<jnewbery> 6. Both PRs add new variables m_bytes_dropped and m_messages_dropped. How are these variables used? How would they be useful for a developer trying to debug an issue?

200

10:52

<dergoegge> they are used to keep track of how many messages and how many bytes were dropped. we can use this to decide if logging should be resumed or print a tally at the end of a suppression period

201

10:53

<jnewbery> dergoegge: you're right!

202

10:53

<dergoegge> only the non-global approach uses them to decide if logging should be resumed

203

10:53

<jnewbery> you've done a very thorough review :)

204

10:53

<dergoegge> :D

205

10:53

<jnewbery> ok, any final questions before we wrap up?

206

10:53

<prakash> Hi..I am new here. I have a question on global vs local. Is global dropping all Logprintf messages if the rate limit kicks in ? If so, would'nt an attacker be able to disable logging altogether before an actual attack?

207

10:55

<jnewbery> prakash: that's exactly what global log-shedding would do. However, I don't think attackers are generally worried about their victim's logs.

208

10:55

<jnewbery> but it's a good thought!

209

10:55

<LarryRuane> A general question came up for me during review of these PRs ... if we add a new function, is the convention to be before or after the first (maybe only) call point?

210

10:56

<emzy> jnewbery: in general I dissagree. Not sure if in Bitcoin it is the case.

211

10:56

<jnewbery> LarryRuane: I don't understand the question. What is before or after the call point?

212

10:56

<LarryRuane> is "bottom-up" (low level functions followed by higher-level), or "top-down" (higher level followed by lower level functions)?

213

10:57

<jnewbery> emzy: maybe you're right. I just can't think of a scenario where someone exploiting a weakness in Bitcoin Core would also need to suppress logs.

214

10:57

<emzy> jnewbery: Right. But there could be one.

215

10:57

<prakash> jnewbery: I was implying that the attacker can hide what he is doing. Yes. It may be unrealistic scenario. Thanks for clarifying :)

216

10:58

<jnewbery> emzy prakash: you may be right!

217

10:58

<lightlike> LarryRuane: do you mean the placement of the code for the function in the source code?

218

10:59

<prakash> but i was able to print repeated messages logged in about incorrect rpc password : "ThreadRPCServer incorrect password attempt from 127.0.0.1" ...so technically it would be easy to turn off logging.

219

10:59

<prakash> but this would be a targeted attack vector. not network wide attack.

220

10:59

<LarryRuane> yes ... this PR adds ComputeQuotaUsageStats(), and it's called from BCLog::Logger::LogPrintStr() ... so is it preferred to add the new function before or after the function that's calling it?

221

10:59

<michaelfolkson> prakash: Is there no limit on rpc password attempts?

222

11:00

<jnewbery> prakash: right. The rpc port shouldn't be open to the public internet, but it's a very good point.

223

11:00

<prakash> i dont see a limit in logging

224

11:00

<prakash> ohh. i see.