Part 1 - Design: Automate Azure Policy Remediation directly in your repository with Event Grid System Topics
Hi, I’m Hans, a .NET developer from the Netherlands with over six years of experience, mostly working with cloud technologies. I’m certified in Azure (AZ-104, AZ-204, AZ-400, AZ-305, AI-102), CKAD, and PSM1, and I’m also a Microsoft Certified Trainer (MCT), so I enjoy sharing what I’ve learned along the way.
I’m interested in just about every new tech that comes my way, which keeps me constantly learning. It’s a bit of a blessing and a curse, but it keeps things interesting.
Outside of work, I’m really into strength training and try to hit the gym four times a week. It helps me stay focused and balanced. On this blog, I’ll be sharing my thoughts, tips, and lessons learned from my work in .NET and Azure, and hopefully, it’ll be useful to anyone on a similar path.
Please note: This post is part of a two-part series. This article focuses on the design and architecture of the solution, while the next part will be a deep dive into building the implementation.
Around seven years ago, when I started working as a developer, my mentality was purely: "Ship fast, break fast."
While I still partially agree with that sentiment, my perspective shifted significantly after I switched to consultancy around three years into my career. Consultancy landed me a project at a large corporate client where, for the first time, I encountered a separate Compliance Department.
I was initially fascinated by the granular detail of their work. However, that fascination quickly turned into frustration. Every week, I would hit the same deployment wall at least two or three times:

The problem
The compliance department enforced strict policies. While valid and necessary for security and governance, they became a bottleneck. My workflow for resolving these errors was repetitive and tedious:
Deploy resources.
Fail due to a policy error.
Investigate why the policy declined the changes.
Attempt a fix and redeploy.
Repeat 4-5 times until successful.
Because this process was so cumbersome, I had a realization: I’m a developer. I can automate this.
Design
Current situation

Before automating the solution, I needed to map out the manual process I was trying to replace. It looked straightforward but required manual intervention at every step. I needed a tool that could notify me instantly when a deployment failed, so I wouldn't have to manually monitor deployments or dig through logs.
That is where I discovered Azure Event Grid System Topics.
What are Event Grid System Topics?
To summarize, Azure Event Grid is a publish-subscribe service for message distribution. It allows services to send events (publish) and receive events (subscribe).

Source: Microsoft Learn
These events are sent through a Topic, essentially a queue where events wait to be consumed. A Subscriber (the event handler) listens to a topic via a Subscription. You can add filters to these subscriptions so you only receive the events you actually care about.
A topic is called a System Topic when Azure Services themselves (like your Azure Subscription or Resource Group) are the ones publishing the events.(see https://learn.microsoft.com/en-us/azure/event-grid/system-topics for a full list).

Automation design
Now that we understand the tools, let’s look at the automation architecture.
We can create a System Topic at the Subscription level. This topic exposes several events, including Microsoft.Resources.ResourceWriteFailure. This event is published exactly when a resource deployment fails, this is what happens when a policy denies my changes.
When we look at the raw event payload, it looks something like this (Note: The structure below is a ResourceWriteSuccess example, but the schema adheres to the CloudEvents spec and is identical for failures):
[
{
"subject": "/subscriptions/{subscription-id}/resourcegroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-name}",
"topic": "/subscriptions/{subscription-id}",
"type": "Microsoft.Resources.ResourceWriteSuccess",
"time": "2018-07-19T18:38:04.6117357Z",
"id": "4db48cba-50a2-455a-93b4-de41a3b5b7f6",
"data": {
"authorization": {
"scope": "/subscriptions/{subscription-id}/resourcegroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-name}",
"action": "Microsoft.Storage/storageAccounts/write",
"evidence": {
"role": "Subscription Admin"
}
},
"claims": {},
"correlationId": "{ID}",
"resourceProvider": "Microsoft.Storage",
"resourceUri": "/subscriptions/{subscription-id}/resourcegroups/{resource-group}/providers/Microsoft.Storage/storageAccounts/{storage-name}",
"operationName": "Microsoft.Storage/storageAccounts/write",
"status": "Succeeded",
"subscriptionId": "{subscription-id}",
"tenantId": "{tenant-id}"
},
"specversion": "1.0"
}
]
This JSON gives us the event type, the data (what we were trying to do), and, most importantly, a correlationId.
Correlation IDs
There is a catch, the Event Grid event tells us that a failure happened, but it doesn't strictly tell us why ( which specific policy was violated).
Luckily, Azure logs everything. If we look at the Activity Logs of the subscription, we can see the full policy violation details. The magic key here is the CorrelationId. The ID in the Event Grid message matches the ID in the Activity Log.
The solution workflow
By combining these elements, we can build a fully automated loop:
Event Grid detects a
ResourceWriteFailure.An Azure Function receives the event and uses the
CorrelationIdto query the Azure Activity Log.The Function extracts the
PolicyDefinitionIdandAssignmentIdfrom the log, and fetches the relevant policy definitions and assignments in Azure.The function fetches all infrastructure as code files from the GitHub repository.
All this data (Policy, Error, IaC code) is sent to Azure AI Foundry. The AI analyzes the conflict and generates a GitHub issue and assigns it to GitHub Copilot.

Whats next?
We now have a solid architectural blueprint. We know how to capture the error signal via Event Grid, how to bridge that signal to the detailed Activity Logs using the Correlation ID, and how we intend to use AI to generate the fix.
However, a design is only as good as its execution.
In Part 2 of this series, we will stop planning and start coding. I will walk you through the actual implementation, including:
Setting up the Event Grid System Topic in Azure.
Writing the Azure Function.
Integrating Azure AI Foundry to automatically generate a detailed GitHub issue for our policy violations and assign them to Github Copilot.
Stay tuned for the code deep dive!

