Curtis La Graff

ctrl+f

I want to share a short tale about about an adventure with default & configuration properties, in the hopes I can prevent others for falling for the same issues my team did. This is a story about defaults-gone-wrong.

The Setup

A little bit of background is needed to have a comprehensive picture. This story revolves around our Product API, our Authorization API, and the company’s Authorization Management System or ams for short.

The Product API is how clients interact with product data, assuming they are authorized. The company’s ams is the source-of-truth on whether a client is authorized or not. The Authorization API exists to mediate interactions between our Product API and this source-of-truth.

The Authorization API supports an in-memory cache of previously authorized clients, and a database fallback in case the cache is empty and ams is down.

So the auth flow is like this:

function isUserAuthorizedForResource(userID, resourceID) bool {
  var isAuthorized bool
  var err error
  
  isAuthorized, err = inMemoryCache.getAuthorization(userID, resourceID)
  if err == nil {
    // user was in our cache, return their auth state
    return isAuthorized
  }
  
  // user not in cache, check the auth system
  
  isAuthorized, err = authSystem.getAuthorization(userID, resourceID)
  if err == nil {
    return isAuthorized
  }
  
  // could not get to auth system, fall back to database
  isAuthorized, err = database.getAuthorization(userID, resourceID)
  if err == nil {
    return isAuthorized
  }
  
  // something really went wrong. cannot authorize user.
  return false
}

The Problem

A client reached-out about an authorization error they were receiving. They had recently submitted a request to the Authorization Management System, ams, to gain access to a particular API resource. But when attempting to access said resource, they received an auth error.

Thinking over the list of possible explanations for the problem, I first checked our source-of-truth, the ams. And yes, they did in fact have the correct authorization. So that’s not the problem.

Next, I replicated the request to our Authorization Server. Bingo! Unauthorized.

The server’s in-memory cache must have their userID and available resources cached without the most recent addition. Simply clear the cache for that user, let it repopulate, and everything’s right with the world… Right?

Wrong.

Clearing the cache did not help. Is the ams broken, reporting the correct authorization setup in the UI but not via the API? Very unlikely.

That only leaves the database. It is periodically updated with userIDs and resources, and is only used when the in-memory cache doesn’t have that userID + resource pair and the call to ams fails. Which it doesn’t. So the database shouldn’t be getting used.

Ah, but it is!

Debugging the issue locally revealed that the call to ams would timeout after 10ms, and would fallback to the database. Since the clients authorization change was so recent, it wasn’t captured in the database, so they’d get that unauthorized error. But why is the timeout so short? Everyone on the team would know it should be at least 1s, possibly more.

The investigation

Finally, we get to the juicey parts. Why was the timeout to the ams so short?

In our Product API application, we specified an acceptable default for the timeout, like so:

func createAuthSystemClient(/* deps provided */) AuthSystem {
  authSystem := AuthSystem{}
  
  /* misc setup of authSystem ... etc */
  
  authSystem.timeout = getPropertyOrDefault("auth-system.timeout", "5000")
  
  return authSystem
}

So if the auth-system.timeout property was not specified in a configuration file or environmental variable, it will default to 5s. Which it is obviously not doing.

Time to check the various configuration files, and any possible environmental variables. And as expected, I cannot find anything. So it has to be an issue with the code that sets this default to 5s, right?

But no matter how hard I looked, I could see no bug or fault with this getPropertyOrDefault code. In desperation, I ctrl+f for the property.

Lo and behold, in plain site, auth-system.timeout = "10ms". It was encoded in the application’s embedded confirguation. This configuration is supposed to be non-environmental, specifying core properties like the app version, license, etc. But for some reason, someone decided to specify a timeout for the ams.

Removing this code allowed the getPropertyOrDefault to do its job, and use 5s as the minimal timeout. After that, everything worked.

Calls would succeed to the ams, re-populate the in-memory cache with the correct authorization. Eventually the fallback database will be udpated too.

All is right with the world.

Written on January 8, 2020
Email me feedback at: curtis@lagraff.me
or text me at: +1 (517) 672-2224