Jari Ängeslevä: “More than just evaluating, new frameworks must give clear signals: scale, redesign, or stop new services and policies”

This interview, conducted for the GTF Governance Handbook, currently in preparation by the GTF Governance Lab Geneva team, covers the important improvements suggested for the initial version of the Pathways of Progress methodology.

Jari P. Ängeslevä for GTF Government Tomorrow Summit Governance Lab Geneva

Jari P. Ängeslevä

A leading expert in public‑sector productivity and digital transformation. His current work focuses on turning pilots into measurable, scalable operational change through data, interoperability, and AI, with a strong emphasis on decision discipline, scale‑or‑kill governance, and outcomes over optics.

GTF: During the Lab discussion, it has been suggested that on top of pure evaluation, governments need “a framework to force hard decisions.” What does the initial, pre-Lab version of the Pathways of Progress framework concretely need that it currently lacks?


Jari P. Ängeslevä: If we take that reframing seriously, the framework has to stop being just an evaluative lens and become a gating system. In other words, it should reliably produce Scale / Redesign / Stop outcomes at specific moments. What it currently lacks are four concrete elements:

First, pre‑decided trade‑off rules. When dimensions conflict, the framework must say what wins. For example, in rule‑of‑law systems, equality‑under‑law has to override personalization unless legislation explicitly allows individualized treatment. Otherwise, tensions remain rhetorical rather than actionable.

Second, minimum thresholds (“good enough”). Each dimension needs explicit pass/fail thresholds. If those thresholds aren’t met, scaling is not an option.

Third, explicit kill criteria. I am talking about clear conditions that override sunk costs, such as: “not demonstrably better than the offline alternative,” “creates second‑ or third‑order harms without funded mitigation,” or “dependency risk is unchanged despite sovereignty claims.”

Fourth, a binding governance hook. The framework has to live inside procurement approvals, budget tranche releases, and renewals. If it sits outside those processes, it will remain advisory and will certainly be absorbed by inertia.

In practice, the decision logic could be very simple and revolve around five questions:

  1. Does it work end‑to‑end?

  2. Is it better than the alternative (including offline)?

  3. What breaks elsewhere if it succeeds?

  4. What dependencies does it create?

  5. What stops if this succeeds?

Each gate leads to one of three outcomes: Scale, Redesign, or Stop. There shouldn’t be a “continue exploring” option.

GTF: If the five‑point test you are suggesting becomes a flagship tool, who is it for, when is it used, and what needs to change?
JA: If published, the primary user is not “policy professionals” but senior decision‑makers (ministers, mayors, agency heads) who can stop or scale initiatives. People who manage budgets, and have the kill-switch under their fingers. A secondary user is a service owner who must justify continuation and renewal.

To avoid pilot purgatory, gates should be calendar‑based and budget‑based, with funding released only on evidence.
— Jarri Angesleva

There are several important moments in the new service or policy’s lifetime where these people’s decisions are crucial. First, before procurement, they have to block “digitize the broken process” initiatives. Then,after a pilot, before scale, to prevent pilot purgatory. Finally, at annual renewal, to stop quietly failing services from continuing by default.

To make it land, we must add a simple scoring logic (for example, colors—red/amber/green—or simply pass/fail) so it drives an outcome. We must also ask one forcing question: “What will stop if this succeeds?” If nothing stops, the result is layering, not progress.

And another incredibly important change: all methodological language has to go. Keep the methodology to one page, with one concrete example per question. Its strength is that it can be used in under a minute. If it’s complex and described in bureaucratic lingo, it will most likely never work as intended.

GTF: During the Governance Lab Geneva meeting, one participant suggested we view the State across four layers: sovereign functions; services to citizens; mixed services such as health or education; and economic functions, from regulation to fiscal policy. What is your take on this under the service evaluation angle?

JA: The four‑level typology would be most powerful as the backbone of the evaluation chapter of your future Handbook, because it prevents applying one universal service logic to everything—a danger very well underscored by the Lab meeting. I would structure it as a combination of Non‑negotiables, Acceptable trade‑offs, and Metrics—with an example for each level to ensure better understanding.

  • Sovereign functions (defense, justice).
    Non‑negotiables: legality, security, resilience, auditability.
    Metrics: continuity under stress, integrity incidents, time‑to‑restore, traceability.
    Example: services can be seamless for users, but accountability and audit trails must always be explicit.

  • Social/citizen services (benefits, permits).
    Non‑negotiables: accessibility, language, transparency of decisions.
    Metrics: time‑to‑decision, appeal/overturn rates, drop‑off rates, inclusion.
    Example: proactively identifying eligible citizens is progress; focusing only on fraud control is not.

  • Mixed services (health, education).
    Non‑negotiables: safety, outcome quality, human oversight.
    Metrics: waiting times, avoidable demand, staff time recovered, outcome measures.
    Example: AI that reduces documentation burden and improves triage is legitimate; AI that replaces judgment without guardrails is not.

  • Economic functions (regulation, fiscal policy).
    Non‑negotiables: predictability, market trust, anti‑corruption.
    Metrics: compliance costs, time‑to‑comply, enforcement quality, volatility of policy signals.
    Example: fast regulation that increases uncertainty is worse than slower regulation that is stable.

GTF: The “what should stop” question is one of the strongest points in your feedback. What are some concrete categories of public-sector work that the framework should be willing to recommend stopping, even where there are sunk costs and political careers attached?

JA: If the framework can’t recommend stopping anything, it becomes ceremonial. To not let that happen—and we must absolutely not let that happen!—I suggest breaking the key “targets for demolition” down into three categories.

The first one is something I can call “digital skin on a broken process.” It is basically front‑end digitization with unchanged back‑end logic, something we see far too often… This produces a faster broken process, not an improvement.

The second category is sovereignty theatre procurement. We all know “sovereign” solutions that leave dependency risk unchanged (e.g., foreign control planes, updates, or legal exposure). If dependency doesn’t change, the claim should be challenged.

Finally, we must think about metric‑vanity programmes. These are initiatives optimized for a single visible metric (coverage, number of services, PR) with unmanaged second- and third-order effects and no clear owner for overall outcomes. Definitely a good target for destruction!

GTF: What would scale‑or‑kill gates look like for government pilots?

JA: To avoid pilot purgatory, gates should be calendar‑based and budget‑based, with funding released only on evidence:

  • Gate 0 (Week 0): Baseline locked, success metrics defined, and “what stops if successful” agreed.

  • Gate 1 (Week 6–8): Feasibility — works end‑to‑end in one environment, fallback tested.

  • Gate 2 (Month 3): Net improvement versus offline baseline demonstrated. If not, default outcome is stop or redesign.

  • Gate 3 (Month 6): Second‑ and third‑order effects quantified and mitigation funded. If harms exceed mitigation capacity, stop scaling.

  • Gate 4 (Month 9–12): Scale readiness — operations, training, procurement in place, and unit cost decreases with scale. If unit cost rises, it’s bespoke bureaucracy in software form.

GTF: Can you give some examples of technical patterns that unlock proactive, cross‑service evaluation?

JA: Two patterns that turn technology from a risk factor into an enabler are the much talked about Event‑driven life‑event architecture and Policy‑as‑code with audit‑first decisioning.

The former allows administration to model key life events—job loss, childbirth, illness…—as events that services subscribe to. This makes second‑order effects across agencies visible and measurable instead of anecdotal.

The latter encodes eligibility rules and thresholds as versioned policy artifacts, and requires every automated decision to produce a human‑readable audit log. This enables explainability, consistency across agencies, and faster policy change without rewriting systems.

GTF Content Team

Government Tomorrow Forum content team

Next
Next

Winston Ma: “It’s critical for young leaders to develop expertise in all three fields – finance, tech innovation, and government regulations”