Background: I work at Block/Square, on the team that owns (but didn't build) our internal Feature Flag system, and also have a lot of experience with using LaunchDarkly.
I like the idea of caching locally, although k8s makes that a bit more difficult since containers are typically ephemeral. People will use feature flags for things that they shouldn't, so eventually "falling back go default values" will cause production problems. One thing you can do to help with this is run proxies closer to your services. For example, LaunchDarkly has an open source "Relay".
Local evaluation seems to be pretty standard at this point, although I'd argue that delivering flag definitions is (relatively) easy. One of the real value-add of a product like LaunchDarkly is all the things they can do when your applications send evaluation data upstream: unused flags, only-ever-evaluated-to-the-default flags, only-ever-evaluated-to-one-outcome flags, etc.
One best practice that I'd love to see spread (in our codebases too) is always naming the full feature flag directly in code, as a string (not a constant). I'd argue the same practice should be taken with metrics names.
One of the most useful things to know (but seldom communicated clearly near landing pages) is a basic sketch of the architecture. It's necessary to know how things will behave if there is trouble. For instance: our internal system uses ZK to store (protobuf) flag definitions, and applications set watches to be notified of changes. LaunchDarkly clients download all flags[1] in the project on connection, then stream changes.
If I were going to build a feature flag system, I would ensure that there is a global, incrementing counter that is updated every time any change is made, and make it a fundamental aspect of the design. That way, clients can cache what they've seen, and easily fetch only necessary updates. You could also imagine annotating that generation ID into W3C Baggage, and passing it through the microservices call graph to ensure evaluation at a consistent point in time (clients would need to cache history for a minute or two, of course).
One other dimension in which feature flag services vary is by the complexity of the rules they allow you to evaluate. Our internal system has a mini expression language (probably overkill). LaunchDarkly's arguably better system gives you an ordered set of rules within which conditions are ANDed together. Both allow you to pass in arbitrary contexts of key/value pairs. Many open source solutions (Unleash, last I checked, some time ago) are more limited: some of them don't let you vary on inputs, some only a small set of prescribed attributes.
I think the time is ripe for an open standard client API for feature flags. I think standardizing the communication mechanisms would be constricting, but there's no reason we couldn't create something analogous to (or even part of) the Open Telemetry client SDK for feature flags. If you are seriously interested in collaborating on that, please get in touch. (I'm "zellyn" just about everywhere)
[1] Yes, this causes problems if you have too many flags in one project. They have a pretty nice filtering solution that's almost fully ready.
One more update. I spent a little time the other day trying to find all the feature flag products I could. I'm sure I missed a ton. Let me know in the comments!
Here's my first draft of the questions you'd want to ask about any given solution:
Questionnaire
- Does it seem to be primarily proprietary, primarily open-source, or “open core” (parts open source, enterprise features proprietary)?
- If it’s open core or open source with a service offering, can you run it completely on your own for free?
- Does it look “serious/mature”?
- Lots of language SDKs
- High-profile, high-scale users
- Can you do rules with arbitrary attributes or is it just on/off or on/off with overrides?
- Can it do complex rules?
- How many language SDKs (one, a few, lots)
- Do feature flags appear to be the primary purpose of this company/project?
- If not, does it look like feature flags are a first-class offering, or an afterthought / checkbox-filler? (eg. split.io started out in experimentation, and then later introduced free feature flag functionality. I think it’s a first-class feature now.)
- Does it allow approval workflows?
- What is the basic architecture?
- Are flags evaluated in-memory, locally? (Hopefully!)
- Is there a relay/proxy you can run in your own environment?
- How are changes propagated?
- Polling?
- Streaming?
- Does each app retrieve/stream all the flags in a project, or just the ones they use?
- What happens if their website goes down?
- Do they do experiments too?
- As a first-class offering?
- Are there ACLs and groups/roles?
- Can they be synced from your own source of truth?
- Do they have a solution for mobile and web apps?
- If so, what is the pricing model?
- Do they have a mobile relay type product you can run yourself?
- What is the pricing model?
- Per developer?
- Per end-user? MAU?
I would have thought so. But flagsmith apparently does primarily server-side eval. And even OpenFeature has `flagd`, which I guess is a sidecar, so a sort of hybrid approach.
And LaunchDarkly's Big Segments fetch segment inclusion data live from redis (although I believe they then cache it for a while).
I more or less know all the answers for LaunchDarkly (except pricing details), and for the internal feature flag service we're deprecating, but I haven't gone through and answered it for all the other offerings. It would be time-consuming, but very useful.
Also, undoubtedly contentious. If you want an amusing read, go check out LaunchDarkly's "comparison with Split" page and Split's "comparison with LaunchDarkly" page. It's especially funny when they make the exact same evaluations, but in reverse.
> One best practice that I'd love to see spread (in our codebases too) is always naming the full feature flag directly in code, as a string (not a constant).
Can you elaborate on this? As a programmer, I would think that using something like a constant would help us find references and ensure all usage of the flag is removed when the constant is removed.
One of the most common things you want to do for a feature flag or metric name is ask, "Where is this used in code?". (LaunchDarkly even has a product feature that does this, called "Code References".) I suppose one layer of indirection (into a constant) doesn't hurt too much, although it certainly makes things a little trickier.
The bigger problem is when the code constructs metric and flag names programmatically:
That kind of thing makes it very hard to find references to metrics or flags. Sometimes it's impossible, or close to impossible to remove, but it's worth trying hard.
Not OP but multiple code bases may refer to the same flag by a different constant. Having a single string that can be searched accross all repos in an organization is quite handy to find all places where it's referenced.
especially when you have different languages with different rules, `MY_FEATURE_FLAG` and `kMyFeatureFlag` and `@MyFeatureFlag` might all be reasonable names for what is defined as `"my_feature_flag"` in the configuration.
Using just the string-recognizable name everywhere is...better.
If you create your own service to evaluate a bunch of feature flags for a given user/client/device/location/whatever and return the results, for use in mobile clients (everyone does this), PLEASE *make sure the client enumerates the list of flags it wants*. It's very tempting to just keep that list server-side, and send all the flags (much simpler requests, right?), but you will have to keep serving all those flags for all eternity because you'll never know which deployed versions of your app require which flags, and which can be removed.
Well, it seems to be a common theme to build a server that uses the flag eval _server_ SDK to evaluate a bunch of flags and then pass them back to the client.
For example, a client may call myserver.com/mobile-flags?merchant=abcdef&device=123456&os=ios&os_version=15.2&app_version=6.1 and the server will pass back:
flag1: true
flag2: 39
flag3: false
flag4: green
This seems to be a common theme. For example, LaunchDarkly has a mobile client SDK, but they charge by MAU, which would be untenable. So folks tend to write a proxy for the mobile apps to call. If the client (as in my example above) doesn't specify which flags it wants, then the metrics are missing, whether you're using a commercial product or your own: it'll simply tell you that all the flags got used. (Of course, you could be collecting metrics from the client apps).
But based on our experience, you'd be better of having the mobile client pass in an explicit list of desired flags. Which will give accurate metrics.
> I'd argue that delivering flag definitions is (relatively) easy.
I'd argue that coming up with good UI that nudges developers towards safe behavior, as well as useful and appropriate guard rails -- in other words, using the feature flag UI to reduce likelihood of breakage -- is difficult, and one of the major value propositions of feature flag services.
I like the idea of caching locally, although k8s makes that a bit more difficult since containers are typically ephemeral. People will use feature flags for things that they shouldn't, so eventually "falling back go default values" will cause production problems. One thing you can do to help with this is run proxies closer to your services. For example, LaunchDarkly has an open source "Relay".
Local evaluation seems to be pretty standard at this point, although I'd argue that delivering flag definitions is (relatively) easy. One of the real value-add of a product like LaunchDarkly is all the things they can do when your applications send evaluation data upstream: unused flags, only-ever-evaluated-to-the-default flags, only-ever-evaluated-to-one-outcome flags, etc.
One best practice that I'd love to see spread (in our codebases too) is always naming the full feature flag directly in code, as a string (not a constant). I'd argue the same practice should be taken with metrics names.
One of the most useful things to know (but seldom communicated clearly near landing pages) is a basic sketch of the architecture. It's necessary to know how things will behave if there is trouble. For instance: our internal system uses ZK to store (protobuf) flag definitions, and applications set watches to be notified of changes. LaunchDarkly clients download all flags[1] in the project on connection, then stream changes.
If I were going to build a feature flag system, I would ensure that there is a global, incrementing counter that is updated every time any change is made, and make it a fundamental aspect of the design. That way, clients can cache what they've seen, and easily fetch only necessary updates. You could also imagine annotating that generation ID into W3C Baggage, and passing it through the microservices call graph to ensure evaluation at a consistent point in time (clients would need to cache history for a minute or two, of course).
One other dimension in which feature flag services vary is by the complexity of the rules they allow you to evaluate. Our internal system has a mini expression language (probably overkill). LaunchDarkly's arguably better system gives you an ordered set of rules within which conditions are ANDed together. Both allow you to pass in arbitrary contexts of key/value pairs. Many open source solutions (Unleash, last I checked, some time ago) are more limited: some of them don't let you vary on inputs, some only a small set of prescribed attributes.
I think the time is ripe for an open standard client API for feature flags. I think standardizing the communication mechanisms would be constricting, but there's no reason we couldn't create something analogous to (or even part of) the Open Telemetry client SDK for feature flags. If you are seriously interested in collaborating on that, please get in touch. (I'm "zellyn" just about everywhere)
[1] Yes, this causes problems if you have too many flags in one project. They have a pretty nice filtering solution that's almost fully ready.
[Update: edited to make 70% of it not italics ]