AT&T Rewrites Model Orchestration, Cuts Costs by 90%

InsightsWire News2026

Context and chronology

AT&T confronted a throughput problem when internal usage climbed to roughly 8 billion tokens per day, forcing a rethink of where heavy compute runs. The company’s chief data officer, Andy Markus, led a shift away from funneling all tasks into large reasoning models toward a layered orchestration approach. Mr. Markus’s team assembled a multi-agent stack that places compact, task-focused workers beneath a controlling super-agent tier, prioritizing latency and cost per transaction. This architecture was integrated with Microsoft Azure and includes a graphical workflow builder for internal teams.

Design principles and operational trade-offs

Engineers chose interchangeable model components rather than committing to one monolithic model, allowing rapid substitution as capabilities evolve. The orchestration uses retrieval-enhanced methods and a vector-backed search layer to keep decision logic anchored in AT&T’s own data, with human oversight retained as a governance control. That combination trimmed response time and reduced inference spend, with reported savings up to 90% on select workloads. The team emphasizes measuring three core properties—accuracy, cost, and responsiveness—before promoting agentic automation into production.

Adoption, use cases, and measured outcomes

The workflow tool has reached more than 100,000 employees, and usage metrics show durable daily engagement for a majority of active users. Reported productivity uplifts on some tasks reached as high as 90%, while complex engineering flows are being decomposed into chains of smaller agents that correlate telemetry, file logs, and change histories. The company offers both a no-code visual path and a pro-code path driven by Python, with surprisingly high uptake of the low-code option even among technical participants. Operational design preserves audit trails, enforces role-based access, and keeps humans on the loop during multi-step handoffs.

Developer productivity and downstream effects

By treating coding as a series of function-specific archetypes, teams produce near-production quality artifacts in far fewer iterations—an internal example cut what was a six-week build into roughly twenty minutes. Mr. Markus frames this approach as ‘AI-fueled coding,’ where focused generation replaces iterative back-and-forth, compressing delivery timelines and increasing the velocity of production-grade outputs. The approach reduces costly context switching for engineers and enables nontechnical stakeholders to prototype solutions in plain language. Taken together, these elements create a repeatable pattern for large enterprises wrestling with scale, cost, and governance.

Source: VentureBeat.

PREMIUM ANALYSIS

Read Our Expert Analysis

Create an account or login for free to unlock our expert analysis and key takeaways for this development.

By continuing, you agree to receive marketing communications and our weekly newsletter. You can opt-out at any time.

Free Access

No Payment Needed

Join Thousands of Readers

Recommended for you

Startups & Venture

Microsoft VP: Agentic AI Will Cut Startup Costs and Reshape Operations

Microsoft’s Amanda Silver says deployed, multi-step agentic systems can lower capital and labor barriers for startups much like the cloud did, citing Azure Foundry and Copilot-driven workflows that reduce developer toil and incident load — but realizing those gains depends on projection-first data, auditable execution traces, and platform primitives that make automation reversible and measurable.

AI & Technology

Anthropic Sonnet 4.6 Delivers Opus-level Results at One-Fifth the Token Cost

Anthropic released Sonnet 4.6, a mid-tier model that approaches flagship accuracy while charging roughly one-fifth the per-token rate, making continuous agentic deployments far cheaper to run. The model complements Opus 4.6's platform and agent primitives—both releases together reshape procurement decisions by separating raw capability (context, primitives) from marginal inference economics.

Startups & Venture

MiniMax’s M2.5 slashes AI costs and reframes models as persistent workers

Shanghai startup MiniMax unveiled M2.5 in two flavors, claiming near–state-of-the-art accuracy while cutting consumption costs dramatically and enabling sustained, low-cost agent deployments. The release couples a sparse Mixture-of-Experts design and a proprietary RL training loop with aggressive pricing, but licensing and weight availability remain unresolved.

AI & Technology

Blackwell delivers up to 10x inference cost cuts — but software and precision formats drive the gains

Nvidia-backed production data shows that pairing Blackwell GPUs with tuned software stacks and open-source models can lower inference costs by roughly 4x–10x. The largest savings come from adopting low-precision formats and model architectures that exploit high-throughput interconnects rather than hardware improvements alone.

Startups & Venture

Commotion launches AI OS with NVIDIA Nemotron to operationalize enterprise AI

Commotion unveiled an AI OS built with NVIDIA Nemotron and backed by Tata Communications , aiming to turn copilots into governed, autonomous "AI Workers". Early deployments report 30–40% autonomous resolution , faster interactions, and enterprise-grade governance.

Cybersecurity

SOC Workflows Are Becoming Code: How Bounded Autonomy Is Rewriting Detection and Response

Security operations centers are shifting routine triage and enrichment into supervised AI agents to manage extreme alert volumes, while human analysts retain control over high-risk containment. This architectural change shortens investigation timelines and reduces repetitive workload but creates new governance and validation requirements to avoid costly mistakes and canceled projects.

Startups & Venture

Inception unveils Mercury 2 to speed and cut cost of text AI

Inception is launching Mercury 2, a text model that applies diffusion techniques to process multiple tokens at once, targeting lower latency and inference cost for chat agents. The approach challenges autoregressive sequencing and could pressure cloud inference economics and LLM infrastructure in the next 6–12 months.

AI & Technology

AI Concentration Crisis: When Model Providers Become Systemic Risks

A late-2025 proposal by a leading AI developer for a government partnership exposed how few firms now control foundational AI layers. The scale of infrastructure spending, modest funding for decentralized alternatives, and high switching costs create a narrow window to build competitive, interoperable options before dominant platforms lock standards and markets.