Please note: This post is part of a two-part series. This article focuses on the design and architecture of the solution, while the next part will be a deep dive into building the implementation.

Around seven years ago, when I started working as a developer, my mentality was purely: "Ship fast, break fast."

While I still partially agree with that sentiment, my perspective shifted significantly after I switched to consultancy around three years into my career. Consultancy landed me a project at a large corporate client where, for the first time, I encountered a separate Compliance Department.

I was initially fascinated by the granular detail of their work. However, that fascination quickly turned into frustration. Every week, I would hit the same deployment wall at least two or three times:

The problem

The compliance department enforced strict policies. While valid and necessary for security and governance, they became a bottleneck. My workflow for resolving these errors was repetitive and tedious:

Deploy resources.
Fail due to a policy error.
Investigate why the policy declined the changes.
Attempt a fix and redeploy.
Repeat 4-5 times until successful.

Because this process was so cumbersome, I had a realization: I’m a developer. I can automate this.

Design

Current situation

Before automating the solution, I needed to map out the manual process I was trying to replace. It looked straightforward but required manual intervention at every step. I needed a tool that could notify me instantly when a deployment failed, so I wouldn't have to manually monitor deployments or dig through logs.

That is where I discovered Azure Event Grid System Topics.

What are Event Grid System Topics?

To summarize, Azure Event Grid is a publish-subscribe service for message distribution. It allows services to send events (publish) and receive events (subscribe).

Source: Microsoft Learn

These events are sent through a Topic, essentially a queue where events wait to be consumed. A Subscriber (the event handler) listens to a topic via a Subscription. You can add filters to these subscriptions so you only receive the events you actually care about.

A topic is called a System Topic when Azure Services themselves (like your Azure Subscription or Resource Group) are the ones publishing the events.(see https://learn.microsoft.com/en-us/azure/event-grid/system-topics for a full list).

Automation design

Now that we understand the tools, let’s look at the automation architecture.

We can create a System Topic at the Subscription level. This topic exposes several events, including Microsoft.Resources.ResourceWriteFailure. This event is published exactly when a resource deployment fails, this is what happens when a policy denies my changes.

When we look at the raw event payload, it looks something like this (Note: The structure below is a ResourceWriteSuccess example, but the schema adheres to the CloudEvents spec and is identical for failures):

[
    {
        "subject": "/subscriptions/{subscription-id}/resourcegroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-name}",
        "topic": "/subscriptions/{subscription-id}",
        "type": "Microsoft.Resources.ResourceWriteSuccess",
        "time": "2018-07-19T18:38:04.6117357Z",
        "id": "4db48cba-50a2-455a-93b4-de41a3b5b7f6",
        "data": {
            "authorization": {
                "scope": "/subscriptions/{subscription-id}/resourcegroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-name}",
                "action": "Microsoft.Storage/storageAccounts/write",
                "evidence": {
                    "role": "Subscription Admin"
                }
            },
            "claims": {},
            "correlationId": "{ID}",
            "resourceProvider": "Microsoft.Storage",
            "resourceUri": "/subscriptions/{subscription-id}/resourcegroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-name}",
            "operationName": "Microsoft.Storage/storageAccounts/write",
            "status": "Succeeded",
            "subscriptionId": "{subscription-id}",
            "tenantId": "{tenant-id}"
        },
        "specversion": "1.0"
    }
]

This JSON gives us the event type, the data (what we were trying to do), and, most importantly, a correlationId.

Correlation IDs

There is a catch, the Event Grid event tells us that a failure happened, but it doesn't strictly tell us why ( which specific policy was violated).

Luckily, Azure logs everything. If we look at the Activity Logs of the subscription, we can see the full policy violation details. The magic key here is the CorrelationId. The ID in the Event Grid message matches the ID in the Activity Log.

The solution workflow

By combining these elements, we can build a fully automated loop:

Event Grid detects a ResourceWriteFailure.
An Azure Function receives the event and uses the CorrelationId to query the Azure Activity Log.
The Function extracts the PolicyDefinitionId and AssignmentId from the log, and fetches the relevant policy definitions and assignments in Azure.
The function fetches all infrastructure as code files from the GitHub repository.
All this data (Policy, Error, IaC code) is sent to Azure AI Foundry. The AI analyzes the conflict and generates a GitHub issue and assigns it to GitHub Copilot.

Whats next?

We now have a solid architectural blueprint. We know how to capture the error signal via Event Grid, how to bridge that signal to the detailed Activity Logs using the Correlation ID, and how we intend to use AI to generate the fix.

However, a design is only as good as its execution.

In Part 2 of this series, we will stop planning and start coding. I will walk you through the actual implementation, including:

Setting up the Event Grid System Topic in Azure.
Writing the Azure Function.
Integrating Azure AI Foundry to automatically generate a detailed GitHub issue for our policy violations and assign them to Github Copilot.

Stay tuned for the code deep dive!

Part 1 - Design: Automate Azure Policy Remediation directly in your repository with Event Grid System Topics

The problem

Design

Current situation

What are Event Grid System Topics?

Automation design

Correlation IDs

The solution workflow

Whats next?

Comments

More from this blog

Three ways to handle third-party components when writing bUnit tests

Learn to Build a Chat Application Using Microsoft Orleans

Learn the Basics of Microsoft Orleans

Command Palette

The problem

Design

Current situation

What are Event Grid System Topics?

Automation design

Correlation IDs

The solution workflow

Whats next?

Comments

More from this blog