Architecture Overview

What is Subscription Vending?

ITL Subscription Vending is a FastAPI microservice that automatically provisions new Azure subscriptions the moment they are created. Rather than requiring manual post-creation steps, the service hooks into Azure Event Grid and runs a deterministic provisioning workflow for every new subscription.


High-level flow

Azure Portal / IaC
     │
     │  creates subscription via
     │  Microsoft.Subscription/aliases/write
     ▼
Azure Event Grid
  (Microsoft.Resources.Subscriptions topic)
     │
     │  HTTP POST  aeg-event-type: Notification
     ▼
ITL Subscription Vending  (POST /webhook/)
     │
     ├─ 0. Read subscription tags
     ├─ 1. Move to management group
     ├─ 2. Attach ITL Foundation Initiative (via Authorization service)
     ├─ 3. Create RBAC role assignments
     ├─ 4. Assign default Azure Policies
     ├─ 5. Create cost-budget alert  (if itl-budget tag present)
     ├─ 6. Publish outbound notification event  (if VENDING_EVENT_GRID_TOPIC_ENDPOINT set)
     ├─ [Custom steps — auto-discovered from extensions/, topologically ordered]
     └─ [Lifecycle events — STARTED / COMPLETED / SUCCEEDED / FAILED]

Internal architecture

The service is designed as a Microkernel / Plugin pipeline engine. The core framework is fixed; all provisioning logic is pluggable.

handlers/                  driving adapters (FastAPI routers)
     │
infrastructure/queue/      retry strategy: none / queue / dead_letter
     │
workflow/engine.py         WorkflowEngine — built-in steps 1–6 + toposort
     │
core/registry.py           step + gate registries
     │
core/base.py               BaseStep ABC — plugin contract
     │
core/context.py            StepContext, ProvisioningResult — pure dataclasses
     │
infrastructure/azure/      Azure SDK adapters
     │
core/events.py             lifecycle bus: STARTED / COMPLETED / SUCCEEDED / FAILED
     │
extensions/                auto-discovered plugins

Layer responsibilities

Layer / folder Responsibility
core/ Pure domain + framework skeleton: BaseStep ABC, StepContext, ProvisioningResult, Settings, step registry, lifecycle event bus, port contracts (protocols.py), exception hierarchy (exceptions.py). No business logic, no I/O.
schemas/ Pydantic HTTP surface contracts. Only imported by handlers/.
infrastructure/azure/ All Azure SDK calls. No workflow orchestration.
infrastructure/queue/ Retry strategy dispatcher and Azure Storage Queue client.
extensions/ Auto-discovered plugins. Each module self-registers at import. Only modules starting with __ are excluded from discovery.
handlers/ FastAPI routers, each as a sub-package (event_grid/, worker/, preflight/, replay/, mock/).
core/config.py All settings via Pydantic BaseSettings. Use get_settings() — never instantiate Settings() directly.
workflow/ Package: engine.py hosts WorkflowEngine; steps.py defines built-in steps 1–6.

Component breakdown

FastAPI application (src/subscription_vending/)

Module Responsibility
main.py Application factory; mounts routers; exposes /health; calls autodiscover() inside lifespan
core/config.py Pydantic-settings Settings class; get_settings() singleton (@lru_cache); loads VENDING_* env vars
core/context.py StepContext and ProvisioningResult — pure domain dataclasses with no I/O or framework dependencies
core/job.py ProvisioningJob dataclass — payload written to / read from the retry queue
core/enums.py RetryStrategy enum (none, queue, dead_letter)
core/base.py BaseStep ABC — base class for all custom provisioning steps
core/events.py Lifecycle event bus — LifecycleEvent, on(), emit()
core/registry.py Step and gate registries (_EXTRA_STEPS, _GATE_STEPS), register_step(), register_gate(), _toposort()
core/protocols.py Port contracts: ManagementGroupPort, RbacPort, PolicyPort, NotificationPort, TagReaderPort — all @runtime_checkable Protocol
core/exceptions.py Typed exception hierarchy: AppError → ProvisioningError (GateCheckFailed, StepFailed) \| AzureIntegrationError (ManagementGroupError, RbacError, PolicyError, NotificationError) \| ConfigurationError \| AuthorizationError
schemas/event_grid.py HTTP surface contracts: EventGridEvent, EventGridEventData, WebhookResponse, HealthResponse
workflow/engine.py WorkflowEngine class — run() method orchestrates gates + built-in steps + custom steps + lifecycle events.
workflow/steps.py Built-in provisioning steps 1–6, each decorated with @register_step
handlers/event_grid/ POST /webhook/ — receives Event Grid deliveries, validates SAS key, dispatches to WorkflowEngine
handlers/worker/ POST /worker/process-job — dequeues and processes a ProvisioningJob from Azure Storage Queue
handlers/preflight/ POST /webhook/preflight — dry-run plan: runs gates + simulates steps, no Azure mutations
handlers/replay/ POST /webhook/replay — manually re-triggers the workflow for a given subscription ID
handlers/mock/ POST /webhook/test — mock endpoint (enabled when VENDING_MOCK_MODE=true)
extensions/ Auto-discovered extension modules; each self-registers at import time via module-level code
extensions/webhook_notify.py Optional step: POST result as JSON to a plain HTTPS webhook (VENDING_WEBHOOK_URL)
extensions/api_notify.py Optional step: POST result as JSON to a REST API with Bearer token auth (VENDING_API_NOTIFY_URL)
extensions/servicenow_check.py Optional gate: validate a ServiceNow ticket before provisioning starts (VENDING_SNOW_INSTANCE)
extensions/servicenow_feedback.py Optional step: PATCH the ServiceNow ticket with the provisioning outcome (VENDING_SNOW_INSTANCE)
infrastructure/azure/management_groups.py Moves a subscription under a target management group
infrastructure/azure/notifications.py Publishes an outbound ITL.SubscriptionVending.SubscriptionProvisioned event to an Azure Event Grid Custom Topic (Step 6). Enabled only when VENDING_EVENT_GRID_TOPIC_ENDPOINT is set.
infrastructure/azure/rbac.py Creates initial RBAC role assignments on the subscription scope
infrastructure/azure/policy.py Assigns default Azure Policies; attaches the ITL Foundation Initiative via the Authorization service
infrastructure/azure/tags.py Reads subscription tags from Azure and converts them to a SubscriptionConfig dataclass
infrastructure/azure/credential.py Selects ClientSecretCredential or ManagedIdentityCredential based on settings
infrastructure/queue/dispatcher.py Routes a completed/failed job to the configured retry strategy
infrastructure/queue/azure_queue.py Azure Storage Queue client — enqueue and dequeue ProvisioningJob messages

Azure infrastructure (infra/)

The Bicep templates deploy:

Kubernetes (k8s/)

An alternative deployment path for teams running the service on AKS or any Kubernetes cluster:


Credential strategy

The service supports two authentication modes, selected automatically at startup:

Scenario Credential used
VENDING_AZURE_CLIENT_ID and VENDING_AZURE_CLIENT_SECRET are both set ClientSecretCredential (service principal)
Either variable is empty ManagedIdentityCredential

Managed Identity is the recommended option for Azure-hosted deployments.


External dependencies

Dependency Purpose
Authorization service (VENDING_AUTHORIZATION_SERVICE_URL) Attaches the ITL Foundation Policy Initiative to new subscriptions
Azure Management Groups API Moves subscriptions between management groups
Azure Authorization Management API Creates RBAC role assignments
Azure Resource Management API Assigns Azure Policy definitions
Azure Subscription API Reads subscription tags
Azure Consumption API Creates Cost Management budgets

Error handling

Every step in the provisioning workflow is wrapped in an independent try/except block. Failures in individual steps are logged and appended to ProvisioningResult.errors, but they do not abort the workflow — subsequent steps still execute. This ensures that a transient failure in one step (e.g. the Authorization service being temporarily unavailable) does not prevent management group placement or RBAC assignments.

Event Grid retries are also suppressed: the webhook always returns 200 OK so that Event Grid does not resend the event. Errors are surfaced exclusively through structured logs.