Event: AI Engineer Summit 2025: Agents at Work
When: February 19 - 22, 2025
Where: New York
Focus: Practical implementation of AI agents in production environments
 
                
                
            The 2025 AI Engineer Summit marked a pivotal transition from theoretical AI capabilities to practical implementation reality. We've had unprecedented advances in foundation models, reasoning capabilities, and infrastructure creating what some called "the perfect storm for AI agents," but let's be real – there's a massive gap between what these models can theoretically do and what they reliably deliver in production. The most successful implementations are focused on specific domains with clear business value rather than general-purpose solutions, with consistent reports that even advanced reasoning models achieve only 70-80% accuracy on complex real-world tasks. That sounds impressive until you realize it means every fifth request is completely wrong. The industry's entering a consolidation phase where implementation expertise, reliability engineering, and domain knowledge are becoming more valuable than raw model capabilities. We've got the rocket, and it (mostly) doesn't blow up, but to get to Mars, there are a hundred other things we need to figure out.
This is supposedly "the year of agents" (yeah, I know, we're all tired of hearing it already). Things still don't actually work reliably yet. That said, it's only February and we have 10 fast market months of AI development ahead of us!
There's a serious reality gap where models performing well in controlled settings regularly fail in complex real-world environments, with error rates of 15-20% being common – when did we abandon basic machine learning principles?
Evals matter more than ever – you need multi-dimensional evaluation frameworks that consider accuracy, cost, latency, and reliability rather than relying on benchmark performance that rarely translates to the real world.
Domain-specific models consistently outperform general-purpose solutions for enterprise use cases, crossing the threshold of sufficient reliability where general models still fall short.
Enterprise adoption requires addressing security, governance, and compliance concerns through multi-layered frameworks that most AI demos conveniently ignore.
The most future-proof systems are designed to scale with intelligence, improving automatically as underlying models get better rather than requiring constant reengineering.
Voice and multimodal agents represent the next frontier but come with massive hurdles for latency, reliability, and user experience that aren't worth chasing until the tech catches up.
Effective human-AI collaboration remains central to successful implementations, with the best systems amplifying human expertise rather than replacing it.
The conference featured an exceptionally diverse and high-quality set of perspectives spanning the complete AI implementation ecosystem – all the real ones were there. Enterprise practitioners included leaders from financial services (Bloomberg, Jane Street, Ramp, Method Financial), pharmaceuticals (Pfizer), media (Thomson Reuters), technology (LinkedIn, Datadog), and travel (Booking.com). Major AI labs were represented by speakers from OpenAI, Anthropic and Google Gemini, providing insights into frontier model capabilities. The venture capital perspective came from Grace Isford (Lux Capital) and Heath Black (SignalFire), while infrastructure specialists included Paul Gilbert (Arista Networks) and Don Bosco Durai (Privacera). Emerging startups showcasing specialized tools included Windsurf, BrightWave, Sierra, OpenPipe, Writer and Contextual AI. Academic and research perspectives came from Will Brown (Morgan Stanley), Sayash Kapoor (AI Snake Oil), and Stefania Druga (Google). This blend created a comprehensive view of the state of AI implementation across different domains and organizational contexts.
Everyone's talking about 2025 as "the perfect storm for AI agents" with converging advancements in reasoning models, test-time compute, engineering optimizations, hardware costs, and infrastructure investments. Yet as Grace Isford from Lux Capital put it, "we're seeing a lot of thunder, a lot of great momentum, but we haven't seen that lightning strike." This tension between theoretical capabilities and practical implementation challenges dominated every track.
The gap is real and it's massive. Sayash Kapoor (AI Snake Oil) highlighted how many headline-grabbing agent capabilities fail spectacularly in real-world scenarios. Waseem Alshikh (Writer) revealed that even leading reasoning models achieve only 81% combined accuracy on real-world financial scenarios, meaning "every hundred requests, 20 of them are just completely wrong." Anju Kambadur (Bloomberg) emphasized that in finance, "precision, comprehensiveness, speed, throughput, and availability" are non-negotiable requirements that current agent technologies don't consistently deliver. This pattern revealed a consistent truth across presentations: the gap between AI's capabilities in controlled settings and its performance in complex real-world environments remains substantial.
Think of it in an analogy of getting to Mars – the LLM is the rocket, but to get to the Mars, there are a hundred other things we need to figure out. We've got a rocket that (mostly) doesn't blow up, but that's it. Space is unforgiving, and so (to a lesser extent) are production environments.
The path forward emphasized systematic evaluation and incremental improvement. Kyle Corbitt (OpenPipe) demonstrated how Method Financial scaled to 500 million agent deployments by carefully measuring error rates, latency, and cost across different models before fine-tuning a small, reliable model optimized for their specific use case. Barry Zhang (Anthropic) advised a simple architecture focused on three components—environment, tools, and system prompt—that could be iteratively improved rather than trying to build complex systems from the start. Multiple speakers recommended starting with single-purpose agents before attempting multi-agent systems, focusing on specific high-value domains rather than attempting to build general-purpose solutions.
A clear consensus emerged that domain-specific approaches consistently outperform general-purpose solutions in enterprise settings. Douwe Kiela (Contextual AI) advocated explicitly for "specialization over AGI," arguing that commercial applications require focus on specific domains where AI can deliver measurable value. Jonathan Lowe demonstrated how Pfizer's GraphRAG approach incorporates domain-specific relationships in pharmaceutical data, outperforming standard approaches by understanding connections between entities like compounds, proteins, and diseases.
Multiple speakers emphasized that proprietary data and domain knowledge provide the true competitive advantage in an era of rapidly commoditizing foundation models. Grace Isford dropped one of the conference bangers when she noted that "foundation models are the fastest depreciating asset class on the market right now," suggesting that organizations should focus less on model selection and more on developing proprietary data advantages. John Crepezzi described how Jane Street built custom models for working with OCaml code because off-the-shelf models were insufficient for their specialized domain.
I don't fully buy the "data is your moat" claim, but there's something to be said for domain-specific applications crossing the threshold of sufficiency where general models still fall short. Kyle Corbitt shared how Method Financial discovered that fine-tuned 8B parameter models outperformed much larger models for their specific use case while dramatically reducing costs and latency. Bruno Passos explained how Booking.com's AI coding tools leverage the context of their codebase to generate more relevant solutions than general-purpose tools. The bitter lesson still applies – over time, bigger general models with more compute will likely win, but in the near term, specialized models deliver the reliability enterprises need.
When did we all abandon basic machine learning principles? You can't even train a model without a test set, but somehow we collectively forgot that evals matter when it comes to generative AI, let a lone AI agents! A consistent theme throughout both conference tracks was the critical importance of robust evaluation frameworks, security measures, and reliability engineering for production AI systems.
Aparna Dhinkaran (Arize) presented a comprehensive framework for evaluating AI agents at multiple levels, emphasizing that "evals aren't just at one layer of your trace" and that rigorous testing across all components is essential for production-ready systems. Sayash Kapoor criticized over-reliance on static benchmarks, noting that impressive benchmark performance "very rarely translates into the real world." He advocated for multi-dimensional metrics that consider not just accuracy but also cost and reliability, showing how these metrics revealed that Claude 3.5 performed as well as GPT-4o on some tasks at 1/10th the cost.
Security concerns received significant attention, with Don Bosco Durai outlining a multi-layered approach to AI agent security addressing the unique vulnerabilities created by agents running in a single process. His framework consisted of pre-deployment evaluation, runtime enforcement, and continuous monitoring, addressing different aspects of security, safety, and compliance at each layer. Anju Kambadur explained that Bloomberg's agents undergo rigorous testing with "remediation workflows and circuit breakers" to catch errors before they impact financial data.
Most speakers agreed that while 100% accuracy is unattainable, organizations need robust observability, attribution, and audit trails to handle cases where things go wrong. Mike Conover described how BrightWave's research agent includes detailed citation tracking and "receipts" so users can validate information sources. Diamond Bishop explained how Datadog's AI agents generate postmortems and maintain full visibility into their decision-making processes, making them accountable even when operating autonomously.
A forward-looking theme emerged around designing systems that automatically improve as underlying models get better. Rahul Sengottuvelu (Ramp) presented a framework distinguishing "classical compute" from "fuzzy compute" (neural networks), arguing that systems should maximize the latter since "if you did nothing, absolutely nothing... the big labs are still working, spending billions of dollars making those models better." He demonstrated an experimental email client where the LLM itself acted as the backend, rendering UI and handling user interactions without traditional software engineering.
This is where you need to catch the wave. That wave is barely a swell right now – it's way out there. It's not worth chasing yet. You've got to wait till that wave starts to crest, and then you catch it. But when you do, you're riding the momentum of exponential improvement in foundation models.
Will Brown (Morgan Stanley) discussed how reinforcement learning is becoming crucial for agent development, describing how models like DeepSeek's R1 demonstrate that "the long chain of thought... actually emerges as a byproduct" of training with appropriate reinforcement signals. The results of reinforcement learning are astounding – it's a completely intelligible methodology, you can totally get it, and that it works is borderline miraculous!
John Crepezzi described Jane Street's "AID" (AI Developer) framework which provides a unified backend for multiple editor integrations, allowing them to swap in new models or context-building strategies without changing frontends. Similarly, Kevin Hou (Windsurf) explained how their agent architecture was designed to "scale with intelligence" so that "if the models get better, our product gets better," including removing chat interfaces in favor of pure agent interactions. These approaches collectively point toward a future where AI systems continuously improve through reinforcement learning rather than requiring constant human reengineering.
Voice AI emerged as an important frontier for agent development across multiple presentations, but let's be real – the latency issues with frontier models totally kill the voice agent experience. They're going to be great when run at full power, full bandwidth and locally, but the practical applications of remotely deployed voice models are still struggling.
Nick Karyotakis (SuperDial) outlined the challenges of building reliable voice agents, noting that modern approaches have shifted from "prescriptive to descriptive development." He described how SuperDial builds phone agents that handle insurance verification calls by traversing phone trees, extracting information, and escalating to humans when needed.
Zack Reneau-Wedeen (Sierra) shared how Sierra's voice assistants handle customer service for brands by creating a "responsive design" approach where "it's the same agent code" operating across different channels and modalities. He emphasized that for voice agents, "latency all of a sudden matters so much more" than with text interfaces, creating new engineering challenges.
Multimodal capabilities also received significant attention. Mukund Sridhar and Aarush Selvan (Google) presented Gemini Deep Research as a multimodal research agent that "can browse the web as much as it needs" to produce comprehensive answers. Karina Nguyen (OpenAI) discussed how multimodal capabilities in Canvas enable more natural collaboration between humans and AI on creative tasks, suggesting that "the kind of interface to AGI is blank canvas that kind of self-morphs into your intent."
This is a great example of the wrong end of the spectrum to chase right now. Yes, it's the next frontier, but there are big hurdles that make it impractical. Don't waste time and effort here – the tech has to bridge the gap further before we get there. Focus where the gap is narrow and closest to sufficiency – that's where you should experiment.
Despite the focus on automation, speakers consistently emphasized the importance of thoughtful human-AI collaboration. Colin Flaherty described how their AI coding agent has written over 90% of its own codebase, but with human supervision at critical junctures. Diamond Bishop explained that Datadog's AI agents help human engineers by running investigations automatically but present results in ways that build trust and facilitate learning.
Karina Nguyen outlined an evolution from "models trained on RL and chain of thought using real-world tools" toward "co-innovators" that collaborate with humans on creative tasks. Kevin Hou demonstrated how Windsurf tracks a "unified timeline" of both human and AI actions, allowing their agent to understand what developers are doing and continue work seamlessly.
The conference closed with perspectives on democratizing AI engineering. Stefania Druga presented Cognimates, a platform that teaches children about AI by allowing them to train their own models and build applications. She emphasized that "kids are actually like little scientists" who can formulate and test hypotheses about how AI works, and that early exposure helps "demystify the intelligence" of AI systems. Multiple speakers touched on making agent development more accessible, with Will Brown sharing a simple reinforcement learning implementation that went viral because "it was one file of code... really simple... and it invited modification."
Hugging Face - GitHub for machine learning (mentioned by Isford)
Together AI - Open source AI cloud infrastructure (mentioned by Isford)
OpenPipe - Platform for building, training, and deploying fine-tuned open source models (presented by Corbitt)
Arize - AI observability platform for agent evaluation (presented by Dhinkaran)
WindSurf - Agentic editor with background understanding of development context (presented by Hou)
PipeCat - Open source voice AI orchestration framework by Daily (mentioned by Karyotakis)
LangFuse - Self-hostable observability platform for LLM applications (mentioned by Karyotakis)
DSPy - Framework for optimizing prompts through automation (mentioned by Brown)
Tensor Zero - Structured and typed LLM endpoints for production (mentioned by Karyotakis)
Cursor - Named the #1 AI tool that engineers love (from Frontier Feud)
Page.ai (PAIG) - Privacera's open-sourced solution for AI safety and security (presented by Durai)
Apache Ranger - Open-source data governance project for big data (mentioned by Durai)
FailSafe - Writer's evaluation framework for testing models (presented by Alshikh)
DeepSeek-R1 - DeepSeek's open source reasoning model inspired by OpenAI's o1 (discussed by Brown)
Llama 3.1 - Meta's open source language model (mentioned by multiple speakers)
GRPO Algorithm - Simple RL algorithm for fine-tuning LLMs (presented by Brown)
"Introducing Contextual Retrieval" - Post by Anthropic (mentioned by Bricken)
"Towards Monosemanticity" and "Scaling Monosemanticity" - Anthropic interpretability papers (mentioned by Bricken)
Rubric Engineering - Will Brown's approach to designing reward functions for LLM fine-tuning
gr.inc - Ross Taylor's project addressing reasoning gaps in open models (mentioned by Soumith Chintala)
Bits AI - Datadog's AI assistant for DevOps (presented by Bishop)
Contextual AI - Specialized RAG agents for enterprise use cases (presented by Kiela)
Writer - Domain-specific LLMs for enterprises (presented by Alshikh)
BrightWave - Knowledge agent for due diligence and financial research (presented by Conover)
Sierra - Voice AI platform for customer service and conversation automation (presented by Reneau-Wedeen)
Augment Code - Building AI coding agents (presented by Flaherty)
Arista Networks - Specialized networking infrastructure for AI data centers (presented by Gilbert)
Ultra Ethernet Consortium - Next-generation ethernet standard for AI workloads (mentioned by Gilbert)
Model Context Protocol - Anthropic's open source protocol for language models to interact with data sources
Cognimates - Platform for teaching children about AI through visual programming (presented by Druga)
AI Literacy - EU AI Act now requires AI literacy as part of regulation (mentioned by Druga)
Scratch - Visual programming environment for children with 100+ million users (mentioned by Druga)
Beacon - SignalFire's AI ML platform tracking 650M+ employees and 80M+ companies (presented by Black)