Digital identities in the wild

Over the years, I've worked across industries that use digital identities. It showed me how flexible those standards are and how tailored the implementation must be. Looking under the hood, there are quite some interesting differences between standard and implementation. It took a major incident at work for me to notice this slight but important variance.

To explore the difference I'm talking about, we will examine the standard, explore some use cases, and then dive into the implementation hurdles. Ultimately, I'll talk you through the outage I was part of at work.

Fundamentals #

Before I jump into some use cases, let's shine a light on the login process itself. From my bubble, the most common approaches towards authentication are based on OAuth and OpenID Connect.

The general idea is to unify and bundle all account-related actions and information in its entity, the identity provider. This scenario has two sides: the identity provider and the client app integration. The identity provider does all the heavy lifting of validating the user’s identity. All other feature-specific applications only need a thin layer. Frontends initiate a login flow and receive a token to work with. Backends need the certificate for token verification.

Identity provider manages sessions #

By registering or logging in, the identity provider verifies the user. It initialises a session and issues token sets. This will be used for all further requests and user-specific actions. It is a technical device/browser/app-specific solution and allows users to provide their credentials only once.

The identity provider is responsible for storing and managing all sessions. This includes creating new ones, issuing new token sets based on active sessions, terminating sessions, and deleting expired sessions.

From a bird's eye view, an identity provider owns and verifies the identity. The goal is for all other client apps to work with the provided tokens. It makes the integration simpler and enables single sign-on options.

Client integrations validate sessions #

The other side is the client app implementation. There are two patterns:

Token-only pattern: The identity provider issues signed temporary tokens that work for themselves. The backend processes a request when it includes a valid token (e.g. it verifies the token signature and expiration date). It skips the token check by the identity provider for every request. However, this backend still accepts valid tokens even when the identity provider terminates a session. Users can use the app if a once-signed token set hasn't expired.

Single-source pattern: Validate the token via the identity provider for every request. It'll catch all requests that might have a valid token but are from a terminated session. However, it's a single point of failure for the identity provider. This means all traffic to connected backends will be proxied to the identity provider.

Most often, reality lies somewhere between those patterns. I don’t want a single point of failure, but sometimes, I want to enforce a truly active session. Therefore, I separate actions by criticality. The critical ones, often involving altering or deleting your user account, require an active session. You can implement an even stricter action flow, such as requesting the user's password before these operations occur. It might be ok to rely on the given signed-token validation on the client for other requests. This way, both worlds work together.

With these concepts in mind, let's look at some use cases.

In Finance: Lock accounts immediately #

The first use case is in the financial industry. Think about a bank account. With the ability to use online banking, customers can sign in to a portal and transfer money.

The bank has to deal with abuse. Sometimes, bad things happen, and the bank has to lock an account. There's the requirement to disable an account within a second. It would be bad if a user was locked at the identity provider but still owned a valid token. This person could clear the account balance before the termination takes place for them.

Having to log in to the account more often is not seen as a problem. For some users, it might even feel "more secure".

In Entertainment: Users want to stay logged in #

In an entertainment app, users want to consume content. Everything that slows them down accessing the content is considered an annoying distraction. This imposes a risk on the entertainment company. Therefore, both parties are interested in logging in users once and staying logged in as long as possible.

The identity provider is often at the forefront and receives much traffic first. Many users join simultaneously, especially with live events in a certain region, causing a lot of traffic in a short window. A requirement the applications landscape must be able to handle.

When a user unsubscribes, it is accepted that there might still be a valid token, and the user can still watch content for some time (mostly minutes or hours). Of course, only as long as legal contracts allow it.

The standard handles various use cases #

The cool part about the standard is that it handles both use cases. Either way, I can use the standard and form a configuration.

For the financial-based use case:

The backend consults the identity provider for session validity. Session termination takes effect immediately.
Use short-session lifespans. There is less risk for session takeover, and it is straightforward for the identity provider to manage.

The entertainment app:

Use long-session lifespans. Users log in once, and the session covers the rest from now on.
Rely on token validation on the client to reduce overall traffic at the identity provider.

The implementation is where it falls apart #

Let's look at the identity provider implementation a bit closer. Besides the user flows like login, the server has to manage all the sessions it issues. There is no one-size-fits-all model for a technical session management solution. So, decisions will have tradeoffs.

The entertainment app: Users might have many active devices but only use a small proportion. All these sessions will last for a long time, though. It's important to have a data store that scales to many sessions. Having eventual consistency works fine. The optimisation is placed around fast session retrieval. The handling of lots of traffic forces making tradeoffs regarding flexibility.

Looking at the financial solution with short session live spans. Sessions exist when users actively use the portal. It might have a manageable session count. In case of a user termination, it must get and delete all user sessions immediately. This situation works better with a solution that offers more flexibility in retrieving necessary sessions. Limits towards scaling are fine.

So what's with the incident, then? #

I was working in a team that runs a big entertainment app. We used an open-source solution that handles various use cases, including financial-grade APIs. Therefore, the identity provider offers the feature to terminate all user sessions immediately. It uses a cache underneath (plus persistent storage), allowing quick and flexible session retrieval. There are certain scaling limits attached to it. For us, daily operations went fine. The server issues sessions, and the cache is scaled accordingly.

It went well up to the point where an unrelated issue forced the identity provider to restart. On startup, it loaded all persisted sessions in the cache but ran out of memory. This is where it all fell apart.

We discovered that terminating a user's sessions only works by having all sessions in the cache (remember the financial-grade requirement). Requiring sessions just in time from the persisted storage works fine for everything else.

Our incident resolution was to start with an empty persisted storage. After the successful startup, we did a storage failover. That way, it can retrieve all persisted sessions on each request later. Just the feature to terminate all sessions from a user wouldn't work (which we don't use anyway).

Let's say our mean time to recovery (MTTR) could've been better. Sometimes, we have to learn it the hard way. Indeed, we now have patches and multiple resiliency and fallback measures in place.

I can now scratch being on the title page of major newspapers from my bucket list, even though it wasn't listed there in the first place.

Conclusion #

Technical standards are great. They offer general concepts, architectures or abstractions for various use cases. We implement them across many sectors with even contradicting requirements. Thanks to the configuration options, it works everywhere. It's fascinating how a standard–an abstract concept–can be applied to many use cases.

Digital identities include user registration, login password reset, etc. It seems to be a solved problem, a commodity. Every app has it, and the standard is there. Though looking into the design & implementation, it might not be straightforward. Like banking, which relies on short session spans and the ability to terminate any session on any device. Or a streaming provider interested in logging in users once and keeping them logged in as long as possible. Comparing the two use cases, we see that they have contradicting requirements. But the standard handles both.

However, be cautious about open-source implementations. Technical decisions force tradeoffs. Ensure that ready-to-go solutions don't oppose limitations due to supporting other use cases.