Agent P and the Case of the Crashing Cloud: Deploying AI SRE Agents for Root Cause Analysis
In this session, we'll follow Perry as he designs, deploys, and coordinates his AI SRE agent network — each agent assigned a specialty: log analysis, trace correlation, anomaly detection, and post-mortem drafting. We'll walk through a real-world multi-service failure scenario where the AI agents triage the incident, surface the most likely root causes, and brief the human Squad in plain language so engineers can make fast, confident decisions rather than drowning in dashboards. Powered by GitHub Copilot, these agents don't just surface data — they reason about it, suggest next investigative steps, and even draft the blameless post-mortem before Perry's had his morning platypus kibble. This talk is equal parts mission briefing and practical blueprint, applicable to any cloud platform, and designed to send you home ready to build your own AI-assisted incident response operation.
Attendees Will Learn
- Why AI SRE agents are the next evolution in cloud incident response and root cause analysis
- How to design purpose-built AI agents for specific RCA tasks: log triage, trace analysis, anomaly detection, and post-mortem generation
- How GitHub Copilot powers agent reasoning to go beyond data retrieval and into actionable insight
- How to orchestrate a human-AI Squad model where engineers make decisions and agents do the heavy lifting
- Techniques for correlating logs, metrics, and distributed traces across any cloud platform
- How to distinguish symptoms from root causes using AI-assisted dependency mapping and timeline analysis
- Best practices for keeping humans in the loop without creating bottlenecks in high-pressure outages
- How to write a blameless post-mortem that actually prevents future incidents — with a little AI help
- Strategies for reducing mean time to resolution (MTTR) by scaling your response capability with AI agents