ctrl+f
I want to share a short tale about about an adventure with default & configuration properties, in the hopes I can prevent others for falling for the same issues my team did. This is a story about defaults-gone-wrong.
The Setup
A little bit of background is needed to have a comprehensive picture. This story revolves around our Product API
,
our Authorization API
, and the company’s Authorization Management System
or ams
for short.
The Product API
is how clients interact with product data, assuming they are authorized.
The company’s ams
is the source-of-truth on whether a client is authorized or not. The Authorization API
exists
to mediate interactions between our Product API and this source-of-truth.
The Authorization API
supports an in-memory cache of previously authorized clients, and a database fallback in case
the cache is empty and ams
is down.
So the auth flow is like this:
function isUserAuthorizedForResource(userID, resourceID) bool {
var isAuthorized bool
var err error
isAuthorized, err = inMemoryCache.getAuthorization(userID, resourceID)
if err == nil {
// user was in our cache, return their auth state
return isAuthorized
}
// user not in cache, check the auth system
isAuthorized, err = authSystem.getAuthorization(userID, resourceID)
if err == nil {
return isAuthorized
}
// could not get to auth system, fall back to database
isAuthorized, err = database.getAuthorization(userID, resourceID)
if err == nil {
return isAuthorized
}
// something really went wrong. cannot authorize user.
return false
}
The Problem
A client reached-out about an authorization error they were receiving. They had recently submitted a request to the
Authorization Management System
, ams
, to gain access to a particular API resource. But when attempting to access
said resource, they received an auth error.
Thinking over the list of possible explanations for the problem, I first checked our source-of-truth, the ams
. And yes, they did in fact have the correct authorization. So that’s not the problem.
Next, I replicated the request to our Authorization Server
. Bingo! Unauthorized.
The server’s in-memory cache must have their userID and available resources cached without the most recent addition. Simply clear the cache for that user, let it repopulate, and everything’s right with the world… Right?
Wrong.
Clearing the cache did not help. Is the ams
broken, reporting the correct authorization setup in the UI but not via the API?
Very unlikely.
That only leaves the database. It is periodically updated with userIDs and resources, and is only used when the in-memory
cache doesn’t have that userID + resource pair and the call to ams
fails. Which it doesn’t. So the database shouldn’t
be getting used.
Ah, but it is!
Debugging the issue locally revealed that the call to ams
would timeout after 10ms
, and would fallback to the database. Since the clients authorization change was so recent, it wasn’t captured in the database, so they’d get that
unauthorized error. But why is the timeout so short? Everyone on the team would know it should be at least 1s
,
possibly more.
The investigation
Finally, we get to the juicey parts. Why was the timeout to the ams
so short?
In our Product API
application, we specified an acceptable default for the timeout, like so:
func createAuthSystemClient(/* deps provided */) AuthSystem {
authSystem := AuthSystem{}
/* misc setup of authSystem ... etc */
authSystem.timeout = getPropertyOrDefault("auth-system.timeout", "5000")
return authSystem
}
So if the auth-system.timeout
property was not specified in a configuration file or environmental variable, it will
default to 5s
. Which it is obviously not doing.
Time to check the various configuration files, and any possible environmental variables. And as expected, I cannot
find anything. So it has to be an issue with the code that sets this default to 5s
, right?
But no matter how hard I looked, I could see no bug or fault with this getPropertyOrDefault
code. In desperation, I
ctrl+f
for the property.
Lo and behold, in plain site, auth-system.timeout = "10ms"
. It was encoded in the application’s embedded confirguation. This
configuration is supposed to be non-environmental, specifying core properties like the app version, license, etc. But for
some reason, someone decided to specify a timeout for the ams
.
Removing this code allowed the getPropertyOrDefault
to do its job, and use 5s
as the minimal timeout. After that,
everything worked.
Calls would succeed to the ams
, re-populate the in-memory cache with the correct authorization. Eventually the
fallback database will be udpated too.
All is right with the world.