#15741 Batch write imported stuff in importmulti (wallet)

https://github.com/bitcoin/bitcoin/15741

Host: ariard

Meeting log

10:58 < jnewbery> Reminder that this week's meeting starts in two hours. We'll be looking at #15741
10:58 < jnewbery> https://bitcoin-core-review-club.github.io/
10:59 < jnewbery> ariard is going to host the meeting this week
11:11 < hrofu> first PR review for me, should be fun
12:07 < jnewbery> hrofu: welcome!
12:33 < jonatack_> anyone want to share useful bash one-liners for reviewing?
12:33 < jonatack_> here's one i use for compiling, though it could be smarter:
12:34 < jonatack_> ./autogen.sh ; export BDB_PREFIX='<path-to-/db4>' ; ./configure BDB_LIBS="-L${BDB_PREFIX}/lib -ldb_cxx-4.8" BDB_CFLAGS="-I${BDB_PREFIX}/include" --enable-lcov --enable-gprof ; compiledb make -j"$(($(nproc)+1))"
12:35 < jonatack_> set as a bash alias
12:35 < jonatack_> for linux
12:35 < pinheadmz> in general, when reviewing a new PR, what are some of your strategies? read first? download first? I notice this PR doesnt have tests. I usually liek to start by reading tests because they are just simpler to understadn
12:37 < jonatack_> i git pull, open the pr branch, launch the compilation, and while that's running open it up and read it in local dev
12:43 < jonatack_> if anyone is looking for a recent PR to review that's not too long, there's https://github.com/bitcoin/bitcoin/pull/15996 that could use review
12:44 < jb55> looks like #15741 was merged...
12:45 < jonatack_> there has been progress on the last two PR review club PRs too...
12:45 < jonatack_> https://github.com/bitcoin/bitcoin/pull/15450 is active again
12:46 < jonatack_> and https://github.com/bitcoin/bitcoin/pull/15834
12:54 < jonatack_> pinheadmz: harding posted a good writeup on his review process during the first meeting https://gist.github.com/harding/168f82e7986a1befb0e785957fb600dd
12:55 < pinheadmz> jonatack_: tnx!
12:55 < jonatack_> pinheadmz: and my messy WIP beginner study notes are here: https://github.com/jonatack/bitcoin-development/blob/master/how-to-review-bitcoin-core-prs.txt
12:56 < jnewbery> let's discuss review/test techniques during the meeting, so people don't miss out
12:56 < pinheadmz> ah great resource too, thanks again
12:56 < jonatack_> jnewbery: sure. didn't want to derail.
12:56 < jnewbery> it's definitely on-topic!
12:57 < jnewbery> we can split the meeting today - spend some time on the PR and some on more general review techniques
12:57 < jonatack_> nice
12:57 < jnewbery> A couple of you have asked for more general advice
12:58 < ariard> yes, would be great to have more feedback on review techniques!
12:58 < sosthene> Hi all
13:00 < jnewbery> hi!
13:00 < kanzure> hi
13:00 < lightlike> hello
13:00 < ariard> hi!
13:00 < b10c> Hi!
13:01 < merehap> Hi!
13:01 < fanquake> hi
13:01 < amiti> hi
13:01 < jonatack_> hi!
13:01 < achow101> heyo
13:01 < ariard> jnewbery: starting?
13:01 < dmkathayat_> Hi!
13:01  * jb55 waves
13:01 < michaelfolkson> Hey everyone
13:02 < jnewbery> ok, we'll talk about a couple of items today. One is PR #15741, led by ariard. I see we have the PR author here. Then we'll cover general review tips and techniques
13:02 < handcart> https://github.com/bitcoin/bitcoin/pull/15741
13:02 < jnewbery> let's start with the PR. You've got the floor, ariard!
13:02 < ariard> ok so 15741 was about a performance improvement on importmulti in case of many items
13:03 < ariard> on the process, I tried first to identify which kind of PR it was
13:03 < ariard> doc, code style, buf fix, new feature, test addition
13:03 < handcart> https://bitcoincore.org/en/doc/0.16.0/rpc/wallet/importmulti/
13:04 < ariard> because IMO knowing this fact is going to guide how you read commits first time, how much time you will need for review and which kind of tests needed
13:04 < ariard> then I asked to myself if scope was well-defined (because sometimes it's a refactor but in fact cover a wallet change)
13:05 < ariard> after that I read each commit a first time (just to have a general idea of how changes related to each other)
13:05 < ariard> read again each commit, check if function match their comment description, if there is new data structure, how to reduce complexity of them, ...
13:06 < ariard> if it's user facing commit, have an idea and how to test it manually
13:06 < ariard> then on the PR content, the problem solved there is how to optimize disk access while importing item
13:07 < ariard> Basically, if I get it well, core wallet has in-memory berkeley database, and before to PR, it was dumping them on-disk when object reference was out-of-scope
13:08 < ariard> IMO, I think it wasn't that much a problem before introduction of ranged descriptors where each of them, once expanded is gonna be a new item
13:08 < ariard> Thanks to PR, on-disk writes are now batched every 1000 database write
13:09 < handcart> "importing 2x 1000 keys seems about 6 times faster."
13:09 < ariard> Wallet main class are WalletBatch (an abstract wrapper), BerkeleyBatch and BerkeleyDatabase on top of Berkekeley Database API
13:10 < jb55> my 10000 key import went from 8 mintues to 3 seconds xD
13:10 < jnewbery> description of the wallet database classes is here: https://github.com/bitcoin/bitcoin/blob/c7cfd20a77ce57d200b3b9e5e0dfb0d63818abdc/src/wallet/walletdb.h#L21
13:10 < achow101> it was a problem if you wanted to import several thousand individual items with importmulti. just that making that command is annoying so no one actually tried. with ranged descriptors, importing such a large number is very easy
13:10 < ariard> Main modifications where in method WriteIC of WalletBatch to Flush every 1000
13:11 < ariard> I agree, and that why I only tested with ranged descriptors, having a thousand individual items if there is that much users with this number of items
13:11 < ariard> (maybe large exchanges)
13:12 < lightlike> do you know a good resource to learn more about ranged descriptors? I didn't find anything good by a quick google search.
13:12 < ariard> what else ? some memory exhaustion where removed from UpgradeKeyMetadata as there is no more a risk to have too large data structure in memory
13:12 < jnewbery> In my experience, exchanges don't use the Bitcoin core wallet
13:12 < jb55> this is a big deal now because you can dump output descriptors with HWI to track your hw wallet key balances without third party vendors. before it was too annoying because importmulti was so slow
13:12 < jb55> https://github.com/bitcoin/bitcoin/blob/master/doc/descriptors.md
13:12 < jnewbery> lightlike: https://github.com/bitcoin/bitcoin/blob/c7cfd20a77ce57d200b3b9e5e0dfb0d63818abdc/doc/descriptors.md
13:12 < jnewbery> thanks jb55. Beat me to it!
13:12 < jb55> sorry :D
13:13 < lightlike> jb55, jnewbery: thanks!
13:13 < ariard> given that PR was already reviewed by a lot of people, I only manually test importmulti with ranged descriptors on my laptop to see performance diff between PR and master
13:13 < achow101> ariard: The UpgradeKeyMetadata change was to unify the usage of batching. there shouldn't be any memory exhaustion issues there
13:13 < ariard> and that kinda of all I ave to say!
13:13 < jnewbery> thanks for the summary ariard
13:14 < jnewbery> I had a question. How should we test this? Is there any way to add automated tests?
13:14 < ariard> achow101: ah and the comment was something like "avoid creating overlarge in-memory batches" what is the issue with having too large in-memory batches ?
13:15 < achow101> ariard: that was the 1000 item limit which was kept.
13:16 < achow101> jnewbery: since this is a performance change, I don't think there really is a way to add automated tests. there isn't necessarily a "correct" or "incorrect" to test for
13:16 < jb55> yeah I was about to type the same thing, more of a benchmark suite thing right?
13:16 < ariard> achow101: okay do you try with a different cap like 100 or 500 ? (thinking that's easy to test too I should have)
13:16 < ariard> *did
13:17 < michaelfolkson> Also some basic questions. Why would you import so many pubkeys? Can't you just import a HD seed?
13:17 < achow101> ariard: no, but I did try larger numbers. I think it hit the point of diminishing returns very quickly. I tried 10000 and there was basically no difference
13:18 < jnewbery> achow101: yes, I agree it's difficult to automate any testing for performance
13:18 < achow101> michaelfolkson: you can't import xpubs and BIP 39 seeds to Bitcoin Core. importing a seed is also not possible for watch only wallets
13:18 < achow101> this is effectively importing an xpub
13:19 < jnewbery> We do have benchmarks, but they're more microbenchmarks (eg how fast does this sha function run?)
13:19 < jnewbery> There are a few notes on profiling bitcoind in https://github.com/bitcoin/bitcoin/blob/c7cfd20a77ce57d200b3b9e5e0dfb0d63818abdc/test/README.md and https://github.com/bitcoin/bitcoin/blob/c7cfd20a77ce57d200b3b9e5e0dfb0d63818abdc/test/functional/README.md
13:20 < jnewbery> so if you want to test performance changes manually, you can use perf and get a profile of where cpu time is being spent
13:21 < achow101> the way I tested is listed at the top of the PR. I used the time command and just timed the same importmulti command on the master branch and on the pr branch
13:23 < jonatack_> achow101: wdyt, would testing that importmulti import writes are happening in batches rather than atomically be feasible and useful?
13:23 < jnewbery> That's a really good habit to get into when writing PRs: giving reviewers tips on how to review and test the changes
13:24 < jonatack_> yes, "how to review this" is super helpful
13:24 < sosthene> I'm still not sure about the use case for importmulti sorry, it is like importing an xpub to make a watch-only wallet? Are there other situations we need to import so many keys?
13:24 < lightlike> I was wondering whether there is a way to test that there was no functional impact of the PR, and it is just faster. Naively, like comparing wallets created by the import before and after.
13:24 < achow101> jonatack_: I haven't the faintest idea how you would even check for that
13:24 < michaelfolkson> Cool, thanks <achow101>. And before batching pubkeys was introduced, they were written to the database individually. That means each one was flushed to disk individually? So next to no memory requirements and the database being closed and reopened each for each write. Have I got it right?
13:25 < achow101> michaelfolkson: yeah, pretty mush. it's also a lot more than just pubkeys. Scripts, metadata, etc.
13:26 < achow101> lightlike: test/functional/wallet_importmulti.py are the tests for importmulti itself and they're pretty comprehensive.
13:27 < achow101> sosthene: yes, it is largely used for importing things for watch only wallets. for me, I'm using it with the HWI tool in order to use my hardware wallet with Bitcoin Core
13:27 < jb55> ^
13:28 < jnewbery> lightlike: the wallet files are bdb, so if you want to compare contents, you'll need a tool to open them
13:28 < jb55> so you can imagine now with a gui tool + HWI users could use trezor directly with their full node. last thing we need is hw wallet PSBT support for spending
13:29 < jnewbery> I have https://github.com/jnewbery/bitcointools which can crack open a wallet.dat file. It's probably not complete (PRs welcome!)
13:29 < jb55> something like: hey core I want to spend this watch-only output, give me a PSBT. I then give that PSBT to my hw wallet for signing.
13:29 < lightlike> jnewbery: yes i know, i tried the dumpwallet rpc but that didnt really work well.
13:29 < jnewbery> so you could run wallet_importmulti.py --nocleanup before the changes, copy the wallet.dat files, then do the same after the changes and compare the wallet.dat files
13:30 < achow101> my goto has been using the db_dump utility provided by bdb and just dumping and then diff'ing the output of the wallet files. probably not the best method though...
13:30 < jnewbery> dumpwallet might be easier - you could add a dumpwallet call to the end of the test
13:30 < jnewbery> achow101: is ordering static with db_dump?
13:30 < jnewbery> (I've never used it)
13:31 < achow101> dunno. the order might have been an artifact of using the same import command though
13:32 < michaelfolkson> It was commented (Sjors) that this moves complexity out of the RPC codebase. Why does this PR simplify the RPC codebase?
13:32 < amiti> I have a question about what happened with this conversation thread: https://github.com/bitcoin/bitcoin/pull/15741#discussion_r272755099
13:32 < michaelfolkson> Thanks for those Bitcoin tools <jnewbery> I wasn't aware of that.
13:32 < amiti> Was a specific reason to force the write identified? It seems like the merged code didn’t include it. What was the final thought process here?
13:33 < achow101> michaelfolkson: part of the pr was moving a bunch of the import stuff from importmulti's handler into the CWallet class. so imports are handled by the wallet and not the rpc. the batching is also handled by WalletBatch instead of by the caller (e.g. rpc) as was done before
13:35 < jnewbery> Generally speaking we want to the RPC code to be as thin a layer as possible. bitcoind has other interfaces (GUI, REST, ZMQ), so the less code is in the interface modules, the more can be reused
13:36 < jonatack_> achow101 wrote "As soon as a batch goes out of scope, every single update is committed to the file. If you make 1500 updates, the first 1000 will be written after you do the 1000th write operation, and then the last 500 will be written when the batch object is deleted."
13:36 < jonatack_> maybe something there could be tested idk
13:36 < jonatack_> (not to be critical, this is a great improvement)
13:37 < achow101> amiti: in the line after, SetWalletFlag is called which itself does a write to the database. however we only want to do that after we are sure that all of the keymetadata has been upgraded so we force a write to ensure that the updated metadata is written before setting that flag
13:39 < michaelfolkson> <jonatack_> You mean write a test to ensure that items are actually flushed to disk after 1000 writes?
13:41 < jonatack_> TBH i didn't review this PR properly so nvm if it's unhelpful
13:42 < michaelfolkson> I struggle to draw the line between writing a test where it is obvious from the code that it does what it should and deciding commented code is sufficient with no test
13:43 < jnewbery> achow101: > in the line after, SetWalletFlag is called...
13:44 < jnewbery> what line? Can you give a link? I'm struggling to follow (since the original thread was on a commit that doesn't exist anymore)
13:44 < achow101> jnewbery: https://github.com/bitcoin/bitcoin/blob/master/src/wallet/wallet.cpp#L398
13:44 < jnewbery> thanks!
13:45 < achow101> michaelfolkson: generally, if it is testable, then there should be a test. even if the code is obviously correct, that could change or something else could change which cauess functionality to break
13:45 < jnewbery> ok, I expect there are more questions, but to make sure we don't overrun, let's move on to more general tips for test/review.
13:45 < achow101> and comments go out of date which can be a problem
13:45 < jnewbery> Thanks for presenting, ariard, and thanks for being here to answer questions, achow101!
13:46 < jonatack_> * clapping *
13:46 < achow101> np
13:46 < michaelfolkson> Cool, thanks guys
13:46 < sosthene> thanks!
13:46 < jnewbery> If you have more questions about this or the wallet in general, feel free to ask in #bitcoin-core-dev
13:46 < jnewbery> *about this PR or the wallet in general
13:47 < jnewbery> ok, onto general tips/techniques. I can give an overview of my workflow, or we can go straight into questions
13:47 < jnewbery> whichever you all prefer
13:47 < jb55> sure let's hear your workflow
13:47 < michaelfolkson> I vote overview
13:47 < sosthene> agree
13:47 < jonatack_> yes
13:48 < ariard> yes overview
13:48 < jnewbery> ok, well first off, I'll always download the PR branch to my machine so I can build and review locally. I don't use the github webpage to review, just to leave comments
13:49 < jnewbery> I've got a short script that checks out the PR branch and queries the github API to add a comment to that branch locally
13:49 < jnewbery> that just makes it easier when I have a bunch of PRs checked out locally that I can run a `git branch` command and see what they are
13:50 < jnewbery> once I have the branch locally, I'll set off a build in a VM while I look through the changes
13:50 < jnewbery> first, I run something like `git log --oneline upstream/master..`
13:50 < jnewbery> that gives me a list of all the commits in the PR branch, one per line
13:51 < jnewbery> and then I use a one-liner:
13:51 < jnewbery> for commit in `git log master..HEAD --oneline | cut -d' ' -f1 | tac`; do git log -1 $commit; git difftool ${commit}{^,} --dir-diff; done
13:51 < jnewbery> which I have saved as git-review
13:52 < jnewbery> that steps through the commits one-by-one, printing the commit log to the console then opening my difftool program
13:52 < jnewbery> I'll look at the diff, and when I quit the difftool program, git-review will step forward to the next commit
13:52 < jnewbery> First run through, I'll just skim everything, reading the commit logs and looking at the overall changes, so I get an idea of what the PR is doing
13:53 < jnewbery> Then I'll go through again, but look at each commit in more detail, reviewing every line in detail
13:53 < jnewbery> hmm, what else?
13:54 < jnewbery> I make extensive use of the functional tests by adding `import pdb; pdb.set_trace()` break points, and then manually running RPC commands on the nodes under test
13:54 < jb55> jnewbery: do you pull the PR description from the api or just use the website?
13:54 < jnewbery> I pull the PR description from the github API and then use it to label the branch
13:54 < jb55> label in what sense?
13:54 < jnewbery> Here's what my `git branch` output looks like:
13:54 < jnewbery> → gb
13:55 < jnewbery> master                                fe47ae168 upstream/master                                                                          -- 
13:55 < jnewbery>  pr10102                               3440513a4                                                                                          -- [ryanofsky] [experimental] Multiprocess bitcoin - https://github.com/bitcoin/bitcoin/pull/10102
13:55 < jnewbery>  pr10823                               03fa5a1b4                                                                                          -- [greenaddress] Allow all mempool txs to be replaced after a configurable timeout (default 6h) - https://github.com/bitcoin/bitcoin/pull/10823
13:55 < jnewbery>  pr12360                               ab740a047                                                                                          -- [jnewbery] Bury bip9 deployments - https://github.com/bitcoin/bitcoin/pull/12360
13:55 < jnewbery>  pr13756                               66f3e9780                                                                                          -- [kallewoof] wallet: "avoid_reuse" wallet flag for improved privacy - https://github.com/bitcoin/bitcoin/pull/13756
13:55 < jnewbery> ...
13:55 < jnewbery> There's an attribute of a git branch called `description` that you can update manually
13:55 < jb55> whoa
13:55 < jb55> I did not know that
13:56 < fanquake> jnewbery what OS VM are you building in? using depends?
13:56 < jb55> I personally use https://gist.github.com/piscisaureus/3342247 + git checkout -b pr${PR} refs/pull/origin/${PR}
13:56 < fanquake> Running make check and the functional (--extended?) test after compiling?
13:56 < jnewbery> my `gb` doesn't map onto `git branch` exactly. I do some formatting on the output so it includes the description
13:57 < jnewbery> fanquake: I'm using vagrant, just the standard ubuntu bionic image
13:57 < jnewbery> I have a vagrant file that installs all the dependencies. I can share that if people want it
13:58 < michaelfolkson> Yes please
13:58 < jnewbery> Yes, I'll run make check and the functional tests. Not usually extended because I'm too impatient
13:59 < michaelfolkson> You're running tests using Travis
13:59 < michaelfolkson> ?
13:59 < jnewbery> I, like most developers, am pretty embarrassed about my workflow. It's just a bunch of stuff that's built up over the years and I'm sure there's plenty of space for optimizations
13:59 < jnewbery> Yes, I have travis set up on jnewbery/bitcoin, so whenever I push a branch to my github repo it'll get built in travis
14:00 < jnewbery> Although there's not much reason to look at that for reviewing, since the branch is already built in bitcoin/bitcoin travis
14:00 < jnewbery> Here's my vagrant config: https://github.com/jnewbery/btc-dev YMMV
14:01 < jnewbery> That's time. Sorry for the monologue!
14:01 < jnewbery> We can do more general techniques next week
14:01 < michaelfolkson> That was great, thank you. I'll look over this and may have some questions next week
14:01 < jonatack__> Thank you!
14:01 < jnewbery> I'm still looking for suggestions for future PRs to cover and volunteers to lead discussion
14:02 < merehap> Thanks!
14:02 < jonatack__> ariard: great job
14:02 < jnewbery> Thanks everyone, and thanks again ariard and achow101!
14:02 < sosthene> Thanks, I need to digest all the links sent today!
14:02 < b10c> Thanks!
14:02 < lightlike> thanks!
14:02 < jonatack__> achow101: thank you for coming
14:02 < michaelfolkson> <sosthene> Me too lol
14:03 < ariard> you're welcome, thanks for workflow jnewbery, a assumeutxo PR for a coming week ?
14:08 < jnewbery> ariard: the next assumeutxo PR is https://github.com/bitcoin/bitcoin/pull/15976 . It's a refactor but it touches consensus/validation code, so it needs careful review
14:09 < jnewbery> There's another good resoure on profiling here: https://github.com/bitcoin/bitcoin/pull/12649/files#diff-d47c2f6882badae58f3881a6ce0a430cR89
14:59 < jnewbery> I've put my git-pr and git-br tools up at https://gist.github.com/jnewbery/1d45a6b5e14b3f3fefe5942d0cc2608d. YMMV, no representations or warranties, expressed or implied, etc, etc