Span improvements (refactoring)

https://github.com/bitcoin/bitcoin/pull/18468

Host: sipa  -  PR author: sipa

The PR branch HEAD was 26acc8dd at the time of this review club meeting.

Note: The PR being discussed (#18468: Span Improvements) originally included commit Support conversion between Spans of compatible types that was merged as part of #18591: Add C++17 build to Travis. We will be discussing the changes in that commit as well, as they functionally belong together.

Notes

  • Today’s PR is the recently merged #18468: Span Improvements. It brings our Span type closer to the functionality of C++20’s proposed std::span and then demonstrates some of its uses by simplifying code that uses it in various places.

  • A Span can be thought of as a pair composed of a pointer to a data type together with a length, identifying a range of contiguous elements in memory. In many ways it acts like a vector, but it doesn’t own any data — which means no copying of the actual data is involved. It is an extremely lightweight object, but it does come with a cost: higher-level code is responsible for guaranteeing that the pointed-to data is still available.

  • std::span is a new data type that will likely be part of the upcoming C++20 standard (see the cppreference.com page on std::span). While Bitcoin Core won’t switch to that standard any time soon (we’re currently on C++11 and will be transitioning to C++17 over the next 2 releases), it is a remarkably useful and simple abstraction, which is why we have our own backported version that is compatible with C++11. Span isn’t quite as powerful as std::span, but today’s PR brings it a lot closer.

  • The gist of the changes is introducing implicit construction of Span objects from range-like objects (arrays, std::array, std::vector, prevector, std::string, etc.) and automatic conversion between Span of compatible member types. Implicit construction and conversion is a powerful but dangerous C++ feature that should be used cautiously. Most of the complexity in the code changes is in making sure these operations cannot be used in dangerous or unexpected ways.

  • Several other PRs have been merged and proposed that make use of Span. Looking over those may give an intuition for why this data type is so useful:

Questions

  1. Did you review the PR? Concept ACK, approach ACK, tested ACK, or NACK? (You’re always encouraged to put your PR review on GitHub, even after the PR has been merged.)

  2. Do you think Span is a useful abstraction? Can you think of more places where it could be used to simplify existing code?

  3. When reviewing the PR, did you compare with the proposed std::span interface? What differences did you notice?

  4. What condition is imposed on converting Span<T1> into a Span<T2>? Why is it useful to permit such conversion, and what are the risks in doing so unconditionally?

  5. Why is MakeSpan useful? Can’t it be replaced with just invoking the Span::Span constructor?

  6. What are some other examples of features from future C++ versions that have been backported as utility features in the Bitcoin Core codebase?

Meeting Log

  113:00 <jnewbery> #startmeeting
  213:00 <fjahr> hi
  313:00 <felixweis> hi
  413:00 <willcl_ark> hi
  513:00 <gimballock> hi
  613:00 <jnewbery> welcome to review club everyone! Feel free to say hi to let everyone know you're here.
  713:00 <hashbasher> 'ello
  813:00 <jnewbery> (or just lurk. lurkers welcome here too)
  913:00 <emzy> hi
 1013:00 <michaelfolkson> hi
 1113:01 <jnewbery> Today we're looking at PR 18468 (Span improvements), although I expect the discussion might touch on other aspects of Spans that aren't specifically in that PR.
 1213:01 <jnewbery> Notes and questions are in the usual place: https://bitcoincore.reviews/18468.html
 1313:01 <jkczyz> hi
 1413:01 <jnewbery> sipa has very kindly offered to host today. He also contributed all of our Span implementation in Bitcoin Core, so he should be able to answer at least some of our questions :)
 1513:01 <jnewbery> Over to you, sipa
 1613:01 <raj_> hi
 1713:01 <sipa> hi
 1813:02 <sipa> so, today is a bit of an unusual review club, as there is very little bitcoin-specific knowledge involved
 1913:02 <troygiorshev> hi
 2013:02 <nehan_> hi
 2113:02 <sipa> but that's well compensated by needing some nitty gritty details of C++ templates instead...
 2213:02 <sipa> so, who has had a chance to review the PR?
 2313:03 <raj_> y
 2413:03 <fjahr> y
 2513:03 <troygiorshev> n
 2613:03 <nehan_> n
 2713:03 <willcl_ark> y
 2813:03 <jkczyz> n - mostly lurking today but find the topic interesting
 2913:03 <emzy> y/n
 3013:03 <jnewbery> y
 3113:03 <felixweis> n still looking at the history of span
 3213:04 <sipa> perhaps some additional last-minute reference material: https://github.com/bitcoin/bitcoin/pull/19367
 3313:04 <sipa> inspired by things pointed out in the PR, as well as explaining some of its uses
 3413:05 <sipa> so the first real question: Do you think Span is a useful abstraction? Can you think of more places where it could be used to simplify existing code?
 3513:05 <amiti> I found 19367 very helpful. thanks for adding!
 3613:05 <fjahr> Definitely useful but I couldn't come up with good places to use it yet. Anything where we read a lot of data from disk I suspect.
 3713:06 <jnewbery> +1 19367 is very helpful
 3813:06 <nehan_> I'd like to understand when a span is better than a pointer or reference to the original datastructure
 3913:06 <michaelfolkson> Can we quickly first just confirm the motivation for Span? We are concerned with buffer overflows right? And including a size with a pointer ensures these don't occur?
 4013:06 <willcl_ark> I couldn't tell if Span was improving program performance, or just providing a friendlier interface for the programmer
 4113:07 <raj_> I am still tumbling over the usefulness part. Can someone explain how is it better than other collection types?
 4213:07 <jnewbery> willcl_ark: I had the same question. Is the motivation code clarity/safety or performance
 4313:07 <sipa> raj_: that one is easy to answer: a Span is not a collection!
 4413:07 <sipa> raj_: it is a reference to a range of elements, but it's cheap to copy/modify/pass around
 4513:08 <sipa> in a way, it's an alternative (in some cases) to always passing around a pointer/length pair, or a begin iterator/end iterator pair
 4613:08 <sipa> so when compared to that, i'd say it's a better interface, that's more readable and less prone to errors
 4713:08 <sipa> but as it's literally just encapsulating a pointer and a length, there is no performance difference to doing that instead
 4813:09 <jnewbery> right, it doesn't own the data, so it's up to the programmer to not shoot themself in the foot by doing something like using after free
 4913:09 <willcl_ark> ok, thanks
 5013:09 <nehan_> sipa: when is it important to have the length/end iterator?
 5113:09 <andrewtoth> so it is similar to a tuple in other languages?
 5213:09 <sipa> andrewtoth: no
 5313:09 <sipa> that'd be std::tuple
 5413:09 <nehan_> it reminds me a little of a slice in go
 5513:09 <andrewtoth> ahh thanks
 5613:09 <sipa> nehan_: i believe it is similar to that
 5713:10 <sipa> i've seen C++ codebases that had their own span like type named Slice
 5813:10 <nehan_> however it's super scary because updates to the original object might invalidate the span
 5913:10 <sipa> as a motivation, let me link to this: https://github.com/bitcoin/bitcoin/pull/19326
 6013:10 <jnewbery> it could also be called a view
 6113:11 <sipa> just scrolling through it, you can see dozens of places where we compute a begin and end iterator, and pass both into a function that hashes
 6213:11 <jnewbery> (in fact, I think the original c++ proposal was called array_view)
 6313:11 <raj_> +1 jnewbery view seems like a better name to express its nature. specially the scarry parts.
 6413:12 <sipa> there is a "const char" specific span like object too introduced in c++17, called string_view
 6513:12 <sipa> which is very similar, but has some string-like operations on it defined for convenience
 6613:13 <emzy> jnewbery: array_view builds a better image im my mind as slice or sapn. tnx.
 6713:13 <sipa> michaelfolkson: i wouldn't say that preventing buffer overdlows is the primary motivation... indirectly i expect that more readable code for working with ranges of objects reduces the chance of that, but i'd say readability is the goal
 6813:14 <willcl_ark> seems also (a little bit!) like a Python memoryview in that case, although not only for streams
 6913:14 <sipa> fjahr: unsure what disk access has to do with it, as spans always refer to ranges of elements *in memory*
 7013:15 <hashbasher> +1 nehan's question, when's it important to know the lenght/end iterator?
 7113:16 <emzy> span is very generic, for every object type?
 7213:16 <sipa> emzy: indeed
 7313:16 <michaelfolkson> sipa: Thanks. So is this use case of making the script interpreter independent from CScript possible without Span? Span just makes it easier due to more readable code?
 7413:16 <sipa> michaelfolkson: exactly
 7513:16 <michaelfolkson> Cool
 7613:16 <sipa> for example look at https://github.com/bitcoin/bitcoin/blob/master/src/script/interpreter.cpp#L1525L1558
 7713:17 <sipa> the script interpreter for witness scripts, it has to deal with a range of elements that are the actual stack, and then a final one that is the script
 7813:17 <willcl_ark> It certainly seems handy for the experienced programmer, but also seems to introduce some "hidden" pitfalls (which are described in #19367) that you might be more inclined to consider with a pointer/length pair.
 7913:17 <sipa> instead of code that does iterator arithmetic to keep track of everything
 8013:18 <willcl_ark> Does it improve performance at all avoiding computing begin/end iterators?
 8113:18 <fjahr> sipa: I was just thinking of at which points we might want to handle data efficiently and I thought it might be connect with data that gets saved to disk sometimes. But yes, it does not directly make the case to use a span.
 8213:18 <sipa> you build a span of all the elements, and then extract subspans
 8313:18 <sipa> hashbasher: say you have an array of 20 bytes, and want to hash them
 8413:18 <jnewbery> ok, here's an attempt at motivation. sipa, let me know if this is off:
 8513:19 <sipa> the typical way would be to call a hash function, which you pass a pointer to the beginning of the array, plus the length
 8613:19 <jnewbery> Often, we'll have data stored in continguous range-like contains (arrays, vactors, ...). There are functions that we want to call with that data, but those functions don't want to care about the specifics of the container, so currently they take a pointer and a size, which the caller needs to fish out of the container.
 8713:19 <jnewbery> With a Span, and implicit conversion from those containers to Span, the function can just take a Span, the caller can pass any of those range-like containers, and the conversion will take care of it for you.
 8813:19 <sipa> jnewbery: yeah, good to bring that up - it's a great use case but not the only one
 8913:19 <sipa> it's a way of writing functions that operate generically on whatever type of container your data is stored in
 9013:20 <sipa> and an anti-pattern that sometimes emerges is "oh, let's make the function only accept a vector, and if someone has it in some other form, they can construct a vector with a copy of the data" - which is obviously wasteful
 9113:21 <jnewbery> wasteful and wouldn't work if you wanted to do something non-const with the data
 9213:21 <sipa> with a span you'd write the function once, and it'll work when called with a vector, or an array, or an std::array, or a prevector, or a uint256, or a CPubKey (the latter two also being types that act as ranges bytes!)
 9313:22 <willcl_ark> OK that's definitely useful
 9413:22 <hashbasher> +1 makes much more sense to me now
 9513:22 <nehan_> sipa: why span instead of generics?
 9613:23 <sipa> nehan_: by generics you mean templates?
 9713:23 <nehan_> yes
 9813:23 <sipa> good question
 9913:23 <sipa> one is that it forces instantiation at compile time for every type you invoke it with
10013:24 <sipa> which may be worth it, but in many cases it's also not: if literally all your function needs is a range, there is no reason to instantiate it for vector and prevector and array and ...
10113:24 <sipa> another is that spans support range-like operations, like subspans
10213:24 <sipa> which is e.g. heavily relied on in the descriptor parsing code
10313:25 <sipa> which passes down smaller and smaller spans to the string to be parsed, as it gets dissected
10413:25 <jnewbery> the downside of instatiating for all types is marginally longer compile times and marginally larger binary size? Anything else?
10513:26 <sipa> you couldn't do that with templated functions - you'd still need to create copies any time you want to pass down a substring
10613:26 <sipa> jnewbery: yeah
10713:26 <sipa> also forcing everything to be in headers
10813:26 <jnewbery> ah yes
10913:26 <MarcoFalke> sipa: You could (in theory) put all template instantiations in the cpp file
11013:26 <MarcoFalke> But that requires knowing them upfront
11113:26 <sipa> MarcoFalke: yes, yuck :)
11213:27 <sipa> and as said, that doesn't cover all use cases
11313:28 <sipa> so perhaps it's reasonable to state it as: a more lightweight alternative to making functions templated over various input types, in case all they need is a range of elements... while simultaneously also supporting passing down sub-ranges
11413:28 <sipa> perhaps it's time for the next question?
11513:28 <sipa> When reviewing the PR, did you compare with the proposed std::span interface? What differences did you notice?
11613:29 <sipa> https://en.cppreference.com/w/cpp/container/span for reference
11713:29 <jnewbery> ashamed to say that I did not, even though that question was a pretty strong hint I probably should have
11813:29 <raj_> It doesn't have rbegin, rend and empty.
11913:29 <fjahr> Extent/size is not a template parameter and we don't have rbegin or rend
12013:29 <sipa> very good
12113:29 * sipa fishes for more
12213:30 <willcl_ark> based on some of the PR comments, it seems like Bitcoin::Span is a subset of std::span
12313:30 <sipa> indeed
12413:30 <sipa> let's have a look at the constructors
12513:30 <sipa> https://en.cppreference.com/w/cpp/container/span/span
12613:30 <sipa> in particular the first 2
12713:32 <sipa> they take iterators as arguments
12813:32 <jnewbery> we don't have a ctor from an iterator?
12913:32 <sipa> indeed
13013:32 <sipa> any guess why not?
13113:32 <jnewbery> Because it's hard to implement without std::address_of
13213:32 <sipa> as wumpus just discovered in https://github.com/bitcoin/bitcoin/pull/19373 it'd be pretty useful to have iterator-based constructors
13313:33 <jnewbery> according to a code comment I read somewhere :)
13413:33 <sipa> what would be the problem with just doing template<typename It> Span(It begin, It end) { return Span<...>(&*begin, &*end); }
13513:34 <instagibbs> guess: you're de-reffing out of bounds memory location in &*end?
13613:34 <sipa> instagibbs: bingo
13713:34 <sipa> end is likely an iterator to the end of an object, which may not be dereferencable
13813:35 <sipa> std::address_of is added in C++20 that lets you get a pointer to the element an iterator refers to, which deferencing it
13913:35 <MarcoFalke> Then, why wouldn't Span<...>(&*begin, end-begin) work ( assuming operator& is not overriden)?
14013:35 <sipa> MarcoFalke: what if begin is not dereferencable?
14113:36 <MarcoFalke> hm :)
14213:37 <sipa> you can start special casing the situation where begin==end, and then return a Span(nullptr, 0), and otherwise return Span(&*begin, end-begin)
14313:37 <sipa> but that leads to more pitfalls, as the span.data() that comes out would be wrong
14413:38 <sipa> anyone have more comments around this? or anything else we've been discussing?
14513:38 <willcl_ark> so the difficult case here is an empty array/vector?
14613:39 <sipa> willcl_ark: or a span at the end of one (e.g. if you have a vector<int> vec{1,2,3}, then span(vec.end(), vec.end()) is expected to be legal
14713:39 <willcl_ark> ah ok
14813:39 <raj_> why we are not doing rbegin and rend? or empty by checking if size=0?
14913:40 <michaelfolkson> I'm assuming if/when Core updates to C++ 20 these differences won't pose a problem? It will just be unusued functionality?
15013:40 <sipa> raj_: why not rbegin/rend? simply because we haven't encountered a use
15113:40 <troygiorshev> When would begin not be dereferenceable? naively that doesn't sound like a very useful iterator
15213:40 <sipa> troygiorshev: just gave an example Span<...>(vec.end(), vec.end()) is expected to be legal
15313:40 <sipa> and begin here is the vector's end
15413:41 <MarcoFalke> michaelfolkson: I'd presume when (and if) we upgrade to C++10, then span.h will get deleted and replaced by std::span
15513:41 <troygiorshev> sipa: ah thx
15613:41 <MarcoFalke> *C++20
15713:41 <hashbasher> sipa: can you list a few functional areas of bitcoin core that this change benefits?
15813:41 <sipa> michaelfolkson, MarcoFalke: exactly
15913:41 <michaelfolkson> MarcoFalke: And it would just slot in easily right?
16013:41 <sipa> hashbasher: i've given a list of PRs that make use of Span in the notes
16113:41 <sipa> hashbasher: probably best to look at those
16213:42 <MarcoFalke> michaelfolkson: jup, because Span claims to be a subset of std::span
16313:42 <hashbasher> oh ok, i saw that, didn't go through them
16413:42 <sipa> michaelfolkson: yes, it should pretty much be a drop in replacement
16513:42 <sipa> ok, next question
16613:42 <sipa> What condition is imposed on converting Span<T1> into a Span<T2>? Why is it useful to permit such conversion, and what are the risks in doing so unconditionally?
16713:42 <sipa> this is probably the most advanced C++ question in the series
16813:43 <sipa> but feel free to guess
16913:43 <jnewbery> std::is_convertible<T (*)[], C (*)[]
17013:43 <raj_> if the underlying types are convertible then we do it?
17113:43 <sipa> which means?
17213:43 <fjahr> if implicit conversion between the types is possible we can convert, not sure exactly what can go wrong
17313:43 <jnewbery> can you implicitly convert from a pointer to an array of type T to a pointer to an array of type C
17413:43 <sipa> not exactly
17513:43 <sipa> indeed!
17613:44 <sipa> from *an array* of T to *an array* of C
17713:44 <sipa> why does this matter? why not just permitting it when a pointer to T can be converted to a pointer to C?
17813:44 <jnewbery> important because that implicit conversion means that the objects are of the same size and so pointer arithmetic works as you'd expect
17913:44 <sipa> exactly
18013:44 <sipa> if you have a parent class A, and a child class B
18113:45 <sipa> then pointers to B can be converted to pointers to A
18213:45 <sipa> but A and B may not have the same size, so after converting, pointer arithmetic may end up in the middle of B's elements
18313:46 <sipa> but why do we want such conversion in the first place?
18413:46 <jnewbery> I spent a long time scratching my head over the template definitions of the constructors before seeing the comment a few lines below about "pointers to arrays" https://github.com/bitcoin/bitcoin/blob/205b87d2f6bd01285de50ba742e32e4ab1298b13/src/span.h#L53-L57
18513:46 <MarcoFalke> So array conversion implies implicit conversion?
18613:46 <sipa> MarcoFalke: yeah, this matches the criterion in std::span, which makes me confident there aren't edge cases left
18713:47 <sipa> if it's possible to treat an array of T as an array of C (e.g. if you have an array of ints, you can treat it as an array of const ints), then converting a Span of C to a Span of T is also possible
18813:47 <willcl_ark> well the code comment-given example of converting to const seems reasonable, but I'm not sure about converting to another child class
18913:48 <sipa> willcl_ark: less useful, but i believe it also permits adding a volatile qualifier
19013:49 <jnewbery> The commit log confused me: "This prevents constructing a Span<A> given two pointers into an array of B (where B is a subclass of A), at least without explicit cast to pointers to A." I think it is possible to construct Span<A> (as long as those pointers to arrays are implicitly convertible)
19113:49 <sipa> so if you have an array "B arr[5];"
19213:50 <sipa> then Span<A>(arr, arr+5) should not work
19313:50 <jnewbery> or is it literally just conversions between cv-qualifiers?
19413:50 <sipa> because if you have a function that accepts an array of A, you can't pass it an array of B
19513:50 <sipa> jnewbery: i haven't come across other useful examples, to be honest
19613:51 <MarcoFalke> Would "A arr[5]; Span<B>(arr, arr+5);" compile?
19713:51 <sipa> MarcoFalke: no
19813:51 <sipa> but B arr[5]; Span<A>((A*)arr, (A*)(arr+5)) will
19913:51 <jnewbery> ah. "Given two pointers into an array". I think I get it now
20013:52 <sipa> if you do the pointer casting yourself you bypass the protection (and get garbage)
20113:52 <MarcoFalke> oh. I guess it might be useful when we want to (e.g.) pass a span of validation interfaces
20213:52 <sipa> this is an interesting question, why std::span uses 'convertible array' as criterion instead of 'add cv qualifiers'. i don't know the answer
20313:53 <sipa> MarcoFalke: i'm not sure that'd work
20413:53 <MarcoFalke> Span<CValidationInterface>{(CValidationInterface*) wallet_array, (CValidationInterface*) (wallet_array+5)};
20513:53 <fjahr> sorry, was the answer to the risks that the elements could have a different size or was there something else? i think i misunderstood the question that there was something else
20613:53 <sipa> fjahr: indeed
20713:53 <fjahr> tyt
20813:53 <fjahr> ty
20913:54 <sipa> Why is MakeSpan useful? Can’t it be replaced with just invoking the Span::Span constructor?
21013:55 <sipa> (and as we're running out of time, a hint: why does C++20 not have an equivalent std::make_span)
21113:55 <sipa> i guess that's more an extra question than a hint, ignore me!
21213:56 <petterhs> 0,
21313:56 <michaelfolkson> It constructs a Span of the right type automatically
21413:56 <sipa> yep
21513:56 <MarcoFalke> but why can't the constructor do that?
21613:56 <willcl_ark> does ::Span not do that?
21713:57 <sipa> not in C++11
21813:57 <michaelfolkson> No idea why not in C++20. It isn't, I checked
21913:57 <sipa> you cannot construct a data type without specifying its template parameters
22013:57 <sipa> if you try Span(std::vector<int>{1,2,3}) it will complain about missing template
22113:58 <sipa> however, since C++17, there exists something called deduction guides
22213:58 <sipa> which are rules that a type can specify to figure out the template argument based on a constructor's arguments
22313:59 <jnewbery> https://en.cppreference.com/w/cpp/language/class_template_argument_deduction
22413:59 <sipa> so std::span, which is in C++20, uses those to avoid the need for a make_span
22513:59 <sipa> and there you can just replace all uses of MakeSpan with calling the constructor directly
22613:59 <sipa> without specifying the template argument
22713:59 <sipa> jnewbery: indeed
22813:59 <MarcoFalke> so on master the verbose version Span<int>(std::vector<int>{1,2,3}) would compile?
22913:59 <sipa> MarcoFalke: indeed
23014:00 <sipa> MakeSpan is just convenience to avoid needing to specify the <int>
23114:00 <sipa> ok, any very last minute comments/insights/questions?
23214:00 <willcl_ark> so it will make the scripted diff when we use std::span smaller :)
23314:01 <sipa> yeah
23414:01 <jnewbery> that was really informative. I learned a lot
23514:01 <sipa> nothing?
23614:01 <sipa> ok!
23714:02 <sipa> thank you all for attending
23814:02 <jnewbery> Thanks for hosting, sipa!
23914:02 <fjahr> sipa: thanks!
24014:02 <troygiorshev> thx sipa!
24114:02 <raj_> thanks sipa. Lot to digest..
24214:02 <felixweis> thanks sipa!
24314:02 <andrewtoth> thanks sipa!
24414:02 <michaelfolkson> Yeah thanks sipa, that was great
24514:02 <emzy> Thanks sipa!
24614:02 <MarcoFalke> thx. TIL about CTAD
24714:02 <gimballock> (y) thanks
24814:03 <willcl_ark> thanks Sipa!
24914:03 <jnewbery> #endmeeting
25014:03 <instagibbs> thanks sipa!