Skip to content

Inventory read: report more fine-grained errors and retry on (some) failures#10595

Open
jgallagher wants to merge 2 commits into
mainfrom
john/inventory-read-better-errors
Open

Inventory read: report more fine-grained errors and retry on (some) failures#10595
jgallagher wants to merge 2 commits into
mainfrom
john/inventory-read-better-errors

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

We now return a "not found" error if attempting to read an inventory collection that doesn't exist or is deleting concurrently with our read. Fixes #8336:

% omdb db inventory collections show 3d7d9e30-611c-42db-8eb9-e0f1c7b60631
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using database URL postgresql://root@[::1]:45975/omicron?sslmode=disable
note: database schema version matches expected (265.0.0)
Error: fetching collection 3d7d9e30-611c-42db-8eb9-e0f1c7b60631

Caused by:
    Not found: inventory collection not found: 3d7d9e30-611c-42db-8eb9-e0f1c7b60631

Adds a single retry to inventory_get_latest_collection() in case it's called concurrently with a delete; this should fix #10573.

…ailures

We now return a "not found" error if attempting to read an inventory
collection that doesn't exist or is deleting concurrently with our read.
Fixes #8336.

Adds a single retry to `inventory_get_latest_collection()` in case it's
called concurrently with a delete; this should fix #10573.
Comment on lines +2759 to +2760
"prev-collection-id" => %collection_id,
"prev-collection-err" => InlineErrorChain::new(&err),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpicky: i feel like we almost always use _ in structured log record fields, right?

Suggested change
"prev-collection-id" => %collection_id,
"prev-collection-err" => InlineErrorChain::new(&err),
"prev_collection_id" => %collection_id,
"prev_collection_err" => InlineErrorChain::new(&err),

opctx,
new_collection_id,
)
.await?,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if errors returned here ought to have an internal_context noting that we attempted to read that inventory after retrying a failure to read an older inventory collection?

Comment on lines +2815 to +2816
// This method has quite a lot of error handling. Most fall into one of
// three categories:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing all of this down <3

Comment on lines +2855 to +2863
async fn err_checking_for_deletion_impl<
F1: FnOnce() -> String,
F2: FnOnce() -> Error,
>(
conn: &async_bb8_diesel::Connection<DbConnection>,
id: CollectionUuid,
make_internal_error_message: F1,
make_notfound_error: F2,
) -> Error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpicky: this could maybe be written a bit more concisely as:

Suggested change
async fn err_checking_for_deletion_impl<
F1: FnOnce() -> String,
F2: FnOnce() -> Error,
>(
conn: &async_bb8_diesel::Connection<DbConnection>,
id: CollectionUuid,
make_internal_error_message: F1,
make_notfound_error: F2,
) -> Error {
async fn err_checking_for_deletion_impl(
conn: &async_bb8_diesel::Connection<DbConnection>,
id: CollectionUuid,
make_internal_error_message: impl FnOnce() -> String,
make_notfound_error: impl FnOnce() -> Error,
) -> Error {

since I don't think we actually need to refer to F1 or F2 anyplace.

Comment on lines +2883 to +2889
macro_rules! err_checking_for_deletion {
($($arg:tt)+) => {
err_checking_for_deletion_impl(
&conn, id, || format!($($arg)*), make_notfound_err,
).await
};
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unimportant style nit: i might have put the macro definition before the err_checking_for_deletion_impl function so that the code reads topohraphically as a "here is the thing you call, and then here is the implementation details of that"?

.collect::<Result<BTreeMap<_, _>, ()>>()
else {
return Err(err_checking_for_deletion!(
"missing baseboard that we should have fetched",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i realize that this is basically reimplementing the previous code without changing the error message, but it occurs to me that it might be nice to include the missing baseboard ID here and in subsequent similar errors --- if we ever actually saw that happen i can imagine wanting to be able to track it down while debugging?

Comment on lines +4858 to +4868
// We expect to find exactly 1 collection. 0 means the collection
// doesn't exist (also reasonable); any other number means we got
// multiple rows with the same primary key - not possible.
match collections.len() {
0 => Ok(None),
1 => Ok(Some(collections.into_iter().next().unwrap())),
n => Err(Error::internal_error(format!(
"found {n} collections with id {id}: \
this should be impossible!"
))),
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, is this pattern of using limit(2) and checking for duplicate primary keys, rather than just using limit(1), a thing we should be doing elsewhere? is there a reason we ought to be concerned about this sort of thing happening in the wild, or is this just extreme defensive programming?

i'm asking because in FM, we have a very similar fm_sitrep_read_metadata_on_conn method that doesn't handle duplicate pkeys, and i'm wondering if i ought to be.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'm not sure! It looks like this dates back to when reading an inventory was first introduced (#4291) - @davepacheco might have an opinion? My current opinion is that this is an unnecessary (but pretty harmless, other than the confusion it produces when someone reads it) extreme defensive programming bit. If crdb allows duplicate primary keys, way too much is going to go off the rails for this one to matter much.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my thinking as well, which was why I asked. Well, that and my misreading the diff and thinking it was new in this branch. 😅

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pattern I used early on in Omicron (I'm surprised as late as this) and "extreme defensive programming" is a fair explanation. I had hoped at one point (very early on) that the DataStore's abstractions would just all do this without folks having to think about it everywhere (i.e., this kind of query would go through a common code path). I think that'd be worth it in an ideal world, but it's probably not worth having this confusing construction scattered all over the place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

helios/deploy failed in CI: collections.len() != 1 omdb db inventory collections show fails unhelpfully when collection is missing

3 participants