Agent P and the Case of the Crashing Cloud: Deploying AI SRE Agents for Root Cause Analysis

AI Cloud Operations

Introductory and overview

Perry the Platypus may appear to be just an ordinary semi-aquatic mammal, but beneath that fedora lies the greatest secret agent in Danville — and your cloud infrastructure's most overbooked defender. With villains multiplying faster than Dr. Doofenshmirtz can build -inators, Perry faces a crisis familiar to every engineering team: there are simply too many incidents, too many alerts, and not enough Agent P to go around. The solution? Perry does what any resourceful secret agent would do — he builds a squad of AI Site Reliability Engineer (SRE) agents, each purpose-built and GitHub Copilot-powered, to monitor the cloud, triage alerts, and hunt down root causes across distributed systems while he handles the missions only he can take. Just like Perry, today's engineering teams must scale their incident response beyond human bandwidth, and AI agents are the new field operatives making that possible.

In this session, we'll follow Perry as he designs, deploys, and coordinates his AI SRE agent network — each agent assigned a specialty: log analysis, trace correlation, anomaly detection, and post-mortem drafting. We'll walk through a real-world multi-service failure scenario where the AI agents triage the incident, surface the most likely root causes, and brief the human Squad in plain language so engineers can make fast, confident decisions rather than drowning in dashboards. Powered by GitHub Copilot, these agents don't just surface data — they reason about it, suggest next investigative steps, and even draft the blameless post-mortem before Perry's had his morning platypus kibble. This talk is equal parts mission briefing and practical blueprint, applicable to any cloud platform, and designed to send you home ready to build your own AI-assisted incident response operation.

Attendees Will Learn

- Why AI SRE agents are the next evolution in cloud incident response and root cause analysis
- How to design purpose-built AI agents for specific RCA tasks: log triage, trace analysis, anomaly detection, and post-mortem generation
- How GitHub Copilot powers agent reasoning to go beyond data retrieval and into actionable insight
- How to orchestrate a human-AI Squad model where engineers make decisions and agents do the heavy lifting
- Techniques for correlating logs, metrics, and distributed traces across any cloud platform
- How to distinguish symptoms from root causes using AI-assisted dependency mapping and timeline analysis
- Best practices for keeping humans in the loop without creating bottlenecks in high-pressure outages
- How to write a blameless post-mortem that actually prevents future incidents — with a little AI help
- Strategies for reducing mean time to resolution (MTTR) by scaling your response capability with AI agents

Session prerequisites and resources may be available. Sign in to access

Sean Whitesell President of SkyForge Consulting, Chief Cloud Architect, & President of Tulsa .NET User Group

SkyForge Consulting

Microsoft MVP

Sean is the president of SkyForge Consulting and is a Microsoft MVP. He has been the President of Tulsa .NET User Group since 2009. Sean has been programming and playing with electronics for over 20 years. He also has multiple black belts in martial arts.

View Full Profile

Agent P and the Case of the Crashing Cloud: Deploying AI SRE Agents for Root Cause Analysis

Speaker