Feature Flags: A Complete Guide to Implementing Dynamic Application Control

Rashdi Chowdhury — Tue, 22 Oct 2024 17:47:00 +0000

TL;DR

Feature flags let us turn code on and off for specific users without redeploying our application.

We use dark launches and staged rollouts to minimize risk. Start with 5% of users, then move to 25%, 50%, and finally 100%.

We track one primary outcome and a few guardrails. We set clear stop and rollback rules before we start.

Every flag needs an owner, purpose, expiry date, and cleanup plan. This keeps flag debt from piling up.

We pair flags with ethics considerations like privacy, accessibility, and non-manipulative growth practices.

Key definitions (one line each)

Feature flag (toggle): A runtime switch that controls who sees a feature by percentage or rule.

Dark launch: Ship the code behind a flag at 0% exposure to test in production safely.

Staged rollout: Increase exposure in steps like 5%→25%→50%→100%.

Kill switch: A flag that instantly disables a risky component when problems occur.

Targeting rule: Conditions that decide exposure based on factors like plan, region, or device type.

Guardrail metric: A “do no harm” signal that tracks error rates, accessibility defects, or complaint rates.

Experiment flag: A flag that splits traffic for A/B tests with sticky assignment to users.

Why PMs should care

Flags turn launches into controlled experiments. We learn faster and reduce risk without waiting for deployments.

Poor governance creates flag debt and confusing user experiences. We need clear owners, metrics, guardrails, and cleanup processes to avoid compliance headaches.

Step-by-step playbook

1) Choose the right flag type

We pick flag types based on what we need. Release flags are for normal staged rollouts in continuous delivery.

Experiment flags help when we want causal evidence through A/B testing. Ops flags work as kill switches for high-risk systems like payments.

Permission flags gate features by user plan or role. We document permission flags carefully to avoid logic sprawl.

2) Define success and safety

We set one primary metric that should move. For example, activation increases by 2 percentage points in 14 days.

We establish 3-5 guardrails like error rate, crash-free percentage, and latency. Clear stop rules guide rollback decisions.

If error rate exceeds 0.5% for 10 minutes, we roll back immediately.

3) Pick the rollout path

Our feature rollout follows this path: dark launch at 0% → internal only → 5% → 25% → 50% → 100%.

We start with low-risk cohorts like new users or one region. We keep a 5-10% hold-back group as control when measuring lift in progressive delivery.

4) Instrument before exposure

We log exposure events with user ID, flag name, variant, and timestamp. Success and guardrail events use the same user ID for accurate tracking.

We check dashboards and alerts at 0% before starting feature management. This helps us avoid measurement issues later.

5) Launch and monitor per step

We verify exposures, primary metrics, guardrails, and support tickets at each step. Each step runs long enough for stable signals.

Error rates stabilize in hours. User behavior changes take days. We move forward only when all checks pass.

6) Decide quickly

We scale up when outcomes improve and guardrails hold. If results are neutral, we tweak copy or placement and try again.

We roll back right away if stop rules trigger. No debates during emergencies.

7) Clean up and record

We remove flags within two sprints after reaching 100%. If features become permanent policy, we move rules into permissions and delete the flag.

We jot down what shipped, metrics, and decisions in a short record.

8) Govern your flags

We keep a flag registry with name, owner, purpose, creation date, and cleanup ticket links. Naming conventions help us stay organized.

We use formats like area_feature_purpose such as onboarding_streaks_release. Monthly registry reviews help us retire stale feature flags.

Trade-offs at a glance

Choice	Pros	Cons	Use when
Staged rollout	Limits risk; real-world learning	Slower to 100%; config overhead	Most product features
Dark launch	Test infra safely	No behavioral data yet	Risky dependencies / perf
Big-bang	Simple	High blast radius; low diagnosability	Tiny, reversible tweaks only
A/B via flag	Causal read	Needs power & clean assignment	Pricing, onboarding, UX bets

Each deployment method balances speed against deployment risk.

Metrics you must watch

Track one primary outcome per feature flag. Focus on conversion, activation, time-to-X, or retention metrics.

Monitor technical guardrails like error rates, crash-free percentages, and p95 latency. Watch for CPU and memory spikes that signal trouble.

Check user trust metrics like complaint rates from exposed users, refund rates, and accessibility defects.

Track rollout operations through exposure data by cohort, step duration, and variant traffic splits.

Keep dashboards simple with three panels: Outcome charts, Guardrails monitoring, and Rollout tracking.

Worked example: how big should each step be?

We need to figure out the right sample size for each rollout step. The formula helps us find how many exposures we need per variant.

Let’s say we want a 5% conversion rate with ±0.5 percentage point accuracy. We use this formula:

n ≈ p(1−p) × (1.96 / ME)²

First, we calculate 1.96 divided by 0.005, which equals 392. When we square this, we get 153,664.

Next, we find p(1−p): 0.05 × 0.95 = 0.0475.

We multiply these together: 0.0475 × 153,664 ≈ 7,299 exposures per variant.

Now timing. If we have 10,000 visitors per day and 25% exposure rate, we get 2,500 exposures daily per variant.

This means each step takes about 2.9 days to reach our target sample size.

Realistic Examples

A B2B audit export feature used LaunchDarkly with a release flag and kill switch. We started with a 0% dark launch, then rolled to enterprise customers at 10%, 50%, and 100%.

The primary goal was reducing audit ticket rates by 30%. We set guardrails for export errors under 0.5%, complaints under 0.3%, and latency under 60 seconds.

At 50% rollout, we saw a 27% ticket reduction while keeping all guardrails green.

For a fitness app’s streak feature, we ran A/B tests using OpenFeature. We targeted iOS users at 10%, 25%, then 50% over seven-day periods.

Our goal was increasing D7 workout rates by 2 percentage points. We monitored accessibility defects and crash rates as guardrails.

The test delivered 2.6 percentage points improvement, so we expanded to Android with a 5% control group.

Targeting rules: do’s and don’ts

Do:

Target stable IDs like logged-in users, not just devices
Start with low-risk groups such as new users or single locations
Document all rules in the PRD using clear language

Don’t:

Target users based on health data or protected classes
Build complex nested rules that teams cannot review
Include staff or bots in our performance metrics

Privacy, accessibility, and ethics (non-negotiable)

We protect user privacy by avoiding personal data sharing with vendors. We use fake IDs and set clear data retention limits.

Users can delete their information when needed. Our accessibility testing includes screen readers and keyboard navigation.

We check color contrast and proper labels. We track accessibility issues as key metrics.

We practice fair growth through clear consent and honest pricing. Users can easily cancel or reverse changes.

We never withhold safety fixes from any user group. If these standards fail, we pause or roll back right away.

Pitfalls & better alternatives

Flag debt happens when old toggles pile up in our codebase. We should set expiry dates when we create flags.

Auto-ticket removal within two sprints of hitting 100% keeps our git repositories clean. Config drift across environments causes bugs.

We need to version flag configs and promote them like code through our github workflows. No clear ownership leads to abandoned flags.

Every flag needs an owner, purpose, metrics, and stop rules in our registry. Measuring views instead of exposure gives wrong data.

We must log exposure events and join them with outcomes for accurate results. Skipping dark launches for infrastructure changes creates risk.

Always run 0% tests on queues, caches, and services under real load first. Jumping straight to 100% rollout skips safety checks.

Use at least three steps and hold each phase for stable data collection.

Mini FAQ

How many flags should we maintain? Keep only the flags we can actively manage. We need a registry that tracks the owner, purpose, and expiry date for each flag.

We should retire flags promptly when they’re no longer needed.

Who controls rollout timing? Product managers and engineering teams share this responsibility. Product managers set the metrics and guardrails.

Engineering teams ensure reliability and maintain the kill switch.

Should we buy or build flag tooling? We should buy existing tools for speed and governance features. Only build custom solutions if we have strict data residency requirements or need custom routing.

We must include total cost of ownership and on-call expenses in our decision.

How long should we keep control groups? We should maintain 5-10% of users in the control group for 1-2 weeks after reaching 100% rollout. This helps us catch any regressions before retiring the flag completely.

Copy-ready rollout checklist

Before 0% dark launch:

Set up success and guardrail events with dashboards.
Arm all alerts.
Test the kill switch to make sure it works.
Write down the rollback playbook.
Get the privacy and accessibility review sign-offs.

At each rollout step (5% / 25% / 50%):

Check exposures and outcome trends.
Review guardrails and support tickets.
Wait until signals look stable before going further.
Record a quick decision: scale, iterate, or roll back.

After reaching 100%:

Keep a 5-10% control group for 1-2 weeks.
Delete the feature flag and targeting rules.
Log an ADR with the full results.

We use feature flags to ship quietly and learn fast.

Clear metrics matter, as do tight guardrails, named owners, and disciplined cleanup.

Start small. Keep an eye on the right numbers.

And honestly, always make sure you can reverse a launch—moving fast shouldn’t mean breaking what matters.

Delivery & platform – Product Blueprint