It’s been a little while, hasn’t it?
Anyway, this post is about C++ and how nothing ever goes wrong with it, ever. I
was building a map with a pair of strings as the key and a complex (and
confidential) value. For the purposes of this post, I’ll say the value is an
int
; the exact value type is irrelevant to this post, anyway.
One other caveat: this is performance-sensitive code, so I don’t want to make
unnecessary copies. The underlying key pieces will outlive the map, so I want
the keys to be std::string_view
, rather than std::string
. So my types look
like this:
struct ConfigItem {
std::string first_key;
std::string second_key;
// In the real project, this is a big value.
int value;
};
// Outlives the map.
std::vector<ConfigItem> items;
absl::flat_hash_map<std::pair<std::string_view, std::string_view>, const ConfigItem*> config_map;
I could make the key some named struct
, but since I was using Abseil’s
absl::flat_hash_map
, I would have to implement absl::Hash
for the type, and
I didn’t want to deal with all of that.
The problem arises when populating config_map
from items
. I thought the easy
thing to do was:
for (const auto& item : items) {
config_map[std::make_pair(item.first_key, item.second_key)] = &item;
}
Seems simple enough, right? Loop through items
, create a key, and the values
are pointers to the item. As long as items
lives longer than config_map
,
there won’t be any dangling references, right? Right?
Try to figure out what will go wrong if we iterate over the map like this:
for (const auto& [key, value] : config_map) {
// absl::Substitute doesn't seem to work in godbolt with Clang.
std::cout << '[' << key.first << ", " << key.second << "] = ("
<< value->first_key << ", " << value->second_key << ", "
<< value->value << ")" << std::endl;
}
If you missed it, you’re not alone.
Running it
If you run this, you’ll get an output like:
[@����04289383, ��46930886] = (config1_first_1804289383, config1_second_846930886, 1)
[config2_first, config2_second] = (config2_first, config2_second, 2)
(the order might be different, since the iteration order of
absl::flat_hash_map
can change).
There are a few weird things here:
- If the
std::string
fields are created directly from a literal, they seem to work just fine. - If the strings are allocated on the heap, then they get garbled.
- In either case, the string values in the
ConfigItem
struct seem to be fine.
The culprit
The issue is in the call to std::make_pair
. If you squint really hard at the
compiler output, you can see this:
...
call std::pair<std::__strip_reference_wrapper<std::decay<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::type>::__type, std::__strip_reference_wrapper<std::decay<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>::type>::__type> std::make_pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
jmp .LBB1_13
.LBB1_13: # in Loop: Header=BB1_11 Depth=1
lea rdi, [rbp - 472]
lea rsi, [rbp - 536]
call std::pair<std::basic_string_view<char, std::char_traits<char> >, std::basic_string_view<char, std::char_traits<char> > >::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, true>(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&&)
jmp .LBB1_14
...
The important thing to notice is the two calls, one to
std::make_pair<std::string, std::string>()
, and the other to the constructor
for std::pair<std::string_view, std::string_view>
. There’s a bunch of C++
template shenanigans clouding this up, but the gist of it is that it is
creating a std::pair<std::string, std::string>
(which creates a copy of the
strings) and then immediately makes a std::pair<std::string_view,
std::string_view>
out of that and inserts it into the map. The problem is
that the std::string_view
elements of the pair being inserted into the map point
at the temporary std::string
objects created by std::make_pair
, which are
immediately destroyed. So you have std::string_view
objects pointing at
invalid memory.
Breaking it down a bit more (and simplifying some of the compiler output as
best I can), the map-populating loop has (stripping the std::
prefixes) a
call
instruction:
call pair<string, string> make_pair<string const&, string const&>(string const&, string const&)
This is the call to std::make_pair
. Notice that it returns a
std::pair<std::string, std::string>
. Then, there is
a call
to
call pair<string_view, string_view>::pair<string, string, true>(pair<string,string>&&)
That’s the std::pair
constructor which takes a pair<string, string>
and
makes a pair<string_view, string_view>
. Then, it calls a big pile of
template nonsense to put the key and value into the map. I’m not going to try
to simplify the template parameters for this one:
call decltype (absl::container_internal::FlatHashMapPolicy<std::pair<std::basic_string_view...
There’s a bunch of indecipherable template shenanigans which I assume boils
down to “call the function which puts things in the map.” There’s an
operator[]
and an absl::hash_internal::Hash
in there, so that’s a safe
bet.
After that, there’s the problem call:
call pair<string, string>::~pair() [base object destructor]
That’s where we destroy the temporary pair we created in std::make_pair
.
Since the temporary std::pair
contains std::string
instances, their memory
is freed, so the std::string_view
objects now point to invalid memory.
Why does only one string_view
freak out?
But the keys config2_first
and config2_second
appear just fine. What
gives?
I believe it is due to the way they’re constructed. The strings for config1
care created with constructor #7, which takes a const basic_string& other
.
config2
strings use constructor #5, which takes const CharT* s
(in this
case, CharT
is just char
). The documentation says that constructor #5
makes a copy of the data, but it seems like somehow the std::string_view
ends up pointing at the string literal when we get around to printing, so the
key still prints something meaningful. This is where my C++ knowledge fails
me.
Fixing it
So how do we fix it? Easy: force std::make_pair
to create a
pair<string_view, string_view>
in the first place:
for (const auto& item : items) {
config_map[std::make_pair<std::string_view, std::string_view>(
item.first_key, item.second_key)] = &item;
}
That results in the assembly:
call string::operator string_view() const@PLT
... asm stuff
call string::operator string_view() const@PLT
... movs and leas
call pair<string_view, basic_string_view> make_pair<string_view, string_view>(string_view&&, string_view&&)
... assembly stuff
call decltype (absl::container_internal::FlatHashMapPolicy<std::pair<std::basic_string_view<...
In other words, it creates a string_view
, then creates another
string_view
, then calls std::make_pair
with std::string_view
arguments,
then puts an entry into the map.
When we run it, the keys for config1
are no longer garbled.
Comparison to Rust
Okay, okay, “rewriting it in Rust” is a meme, but Rust is supposed to prevent exactly this sort of thing. Here’s a quick-and-dirty attempt at rewriting it in Rust:
use std::collections::HashMap;
extern crate rand;
struct ConfigItem {
first_key: String,
second_key: String,
value: i32,
}
pub fn main() {
let config1_first = rand::random::<u8>();
let config1_second = rand::random::<u8>();
let items = vec![
ConfigItem {
first_key: format!("config1_first_{}", config1_first),
second_key: format!("config1_second_{}", config1_second),
value: 1,
},
ConfigItem {
first_key: "config2_first".to_string(),
second_key: "config2_second".to_string(),
value: 2,
},
];
let mut config_map: HashMap<(&str, &str), &ConfigItem> = HashMap::new();
for item in items.iter() {
// If we forget the & here, it won't compile.
config_map.insert((&item.first_key, &item.second_key), &item);
}
for ((first_key, second_key), value) in config_map {
println!(
"[{first_key}, {second_key}] = ({0}, {1}, {2})",
value.first_key, value.second_key, value.value
)
}
}
If we run that, then we get the expected results. Rust won’t even let us make
the mistake we did in C++, since it won’t try to convert the String
fields
into &str
tuple elements automatically. If you try to omit the &
when
doing config_map.insert()
, it will yell at you:
error[E0308]: mismatched types
--> src/main.rs:34:28
|
34 | config_map.insert((item.first_key, &item.second_key), &item);
| ^^^^^^^^^^^^^^
| |
| expected `&str`, found `String`
| help: consider borrowing here: `&item.first_key`
Conclusion
This whole episode confirms my suspicion that while writing C++, I just have to be nervous all the time. Seemingly-innocuous things like inserting into a map will silently read invalid memory. What’s worse is that I might have missed this problem entirely if that literal string optimization had taken effect in my unit tests. With all of the linters that we have for C++, it took running it in ASAN to uncover the problem and staring at template deductions to figure out why it was failing.
ChatGPT addendum
For fun, I tried asking ChatGPT. Here’s the dialog:
-
Me:
If I have two
std::string&
variables,k1
andk2
, is it okay to insert them into astd::unordered_map<std::pair<std::string_view, std::string_view>, int>
usingm[std::make_pair(k1, k2)] = 5
ChatGPT:
The code
m[std::make_pair(k1, k2)] = 5
will compile and run, but it may not produce the expected result.The reason is that
std::unordered_map
uses the hash value of its keys to store and retrieve values quickly. The hash function used forstd::pair<std::string_view, std::string_view>
relies on the hash functions of its components (std::string_view
in this case).However,
std::string
andstd::string_view
are different types.std::string_view
is a non-owning view of a sequence of characters, whilestd::string
owns the sequence and provides additional functionality.When you call
std::make_pair(k1, k2)
, you are creating a pair ofstd::string
s, notstd::string_view
s. This means that the hash function used bystd::unordered_map
will hash the entire contents of each std::string, which is inefficient and may produce collisions.To fix this issue, you can either use
std::string_view
as the key type of yourstd::unordered_map
, or createstd::string_view
s from yourstd::string
s before inserting them into the map. For example:std::unordered_map<std::pair<std::string_view, std::string_view>, int> m; m.emplace(std::make_pair(std::string_view(k1), std::string_view(k2)), 5);
This code creates
std::string_view
s fromk1
andk2
and uses them to construct the key of the map. This ensures that the hash function only hashes the underlying character sequences and not the entirestd::string
objects.Me:
Are there memory safety issues with my original version?
ChatGPT:
The original version
m[std::make_pair(k1, k2)] = 5
may not cause any memory safety issues, but it may cause performance issues and unexpected behavior.When you create a
std::pair<std::string, std::string> using std::make_pair(k1, k2)
, it creates two newstd::string
objects, which will allocate memory on the heap to store their contents. This may be inefficient if you only need to compare the contents of the two strings and not modify them.Additionally, if
k1
ork2
are modified after they have been inserted into thestd::unordered_map
, it may cause the hash value of the key to change and the key to no longer match its corresponding value. This may lead to unexpected behavior and bugs in your code.Using
std::string_view
instead ofstd::string
as the key type can avoid these issues, asstd::string_view
is a lightweight non-owning view of a sequence of characters and does not allocate any memory on the heap.Me:
std::make_pair
will resolve the pair type to create asstd::pair<std::string, std::string>
, which results in a temporary copy of the strings. Then, they get referenced when insertingstd::string_view
, resulting in pointers to invalid memory.ChatGPT:
You are correct, I apologize for my previous response.
std::make_pair(k1, k2)
creates a temporarystd::pair<std::string, std::string>
object, which contains copies of the contents ofk1
andk2
. Whenstd::string_view
objects are created from the temporarystd::string
objects, they reference invalid memory after the temporarystd::string
objects are destroyed.Therefore, using
std::make_pair(k1, k2)
to create the key of thestd::unordered_map
and then creatingstd::string_view
objects from it may cause undefined behavior.To avoid this issue, you can create
std::string_view
objects directly from thestd::string
objects and then use them to construct the key of thestd::unordered_map
. For example:std::unordered_map<std::pair<std::string_view, std::string_view>, int> m; auto k1_view = std::string_view(k1); auto k2_view = std::string_view(k2); m[std::make_pair(k1_view, k2_view)] = 5;
This way, the
std::string_view
objects reference the valid memory of thestd::string
objects, and the key of thestd::unordered_map
is constructed directly from thestd::string_view
objects.
All in all, not bad. It suggested a better version, but it didn't notice the undefined behavior until I pointed it out. My job is safe, for now.
No comments:
Post a Comment