"ECD — Software Translation"
02 — Translating Extreme Co-Design to Software Engineering
The question is not whether this applies to software. The question is what the silicon is.
The Translation Problem
Nvidia’s Extreme Co-Design is visibly physical. You can point to the connectors, measure the bandwidth, hold the cooling manifold. The co-design surfaces are literal interfaces between chips on a PCB.
Software has no PCB. Services communicate over HTTP. Types are text files. Boundaries are invisible — enforced by convention, or not enforced at all. This invisibility is exactly the problem.
The hardware world cannot accidentally skip co-design. If components use incompatible connectors, the system does not boot. The incompatibility is immediate, total, and impossible to ship around. Software engineers do not have this forcing function. Mismatched contracts between services surface as runtime bugs — after deployment, after the user is affected.
Extreme Co-Design for software is the discipline of making invisible surfaces explicit and designing them before implementing either side.
The Core Translation
The performance of a software platform is not determined by the quality of its individual services. It is determined by the quality of the surfaces where those services touch each other.
What “surfaces” means in software:
| Hardware Surface | Software Equivalent | What happens when not co-designed |
|---|---|---|
| Interconnect connector spec | Type contracts between services | Entity drift — same concept defined N ways, no canonical truth |
| Lane allocation | API port contracts | Services couple directly to internal details of each other |
| Power backplane | Deployment template | N different deployment systems for the same tier of service |
| SSO fabric | Auth tier contract | Separate auth silos, no SSO |
| Instrumentation bus | Observability conventions | No correlation IDs, failures invisible across service boundaries |
| Thermal envelope | Runtime constraints | Cold starts, timeouts, circuit breakers — undefined at design time |
Two Insights ECD Adds to Software Architecture
Insight 1 — The Temporal Dimension
Most architectural frameworks describe what a well-formed system looks like at rest. They tell you the correct structure. They do not tell you when the critical decisions need to happen.
ECD adds the temporal answer: co-design happens at surface definition time, not implementation time.
If the surface is wrong, no amount of excellent implementation recovers it. You can build the most elegant, well-tested service — if it receives data in a shape that conflicts with what the consumer expects, the system is broken. The bug is not in the code. It is in the surface that was never co-designed.
If the surface is right, mediocre implementation can be safely replaced. A weak implementation behind a well-defined surface can be refactored independently, tested in isolation, and deployed without affecting its consumers.
Applied to software: The type contract, the API schema, the event payload structure — these are the product. The service, the component, the database — these are implementation details. Most teams get this exactly backwards.
Insight 2 — The Organizational Dimension
The second insight is about people, not code.
At Nvidia, architects from every layer are in the same design sessions before any commitment decision. A surface decision cannot be made by one team alone, because the downstream impact lands on teams that were not in the room.
Applied to software: When two services need to interact, both teams need to be present when the surface is defined. What actually happens: Team A implements with an internal data model. Team B discovers it and adapts. Team B’s adaptation is a hidden dependency on Team A’s implementation, not their interface. Team A refactors internals. Team B breaks. This is not a testing problem. It is a co-design problem.
What the “Rack” Is in Software
Nvidia changed the unit of compute from the chip to the rack. The GB200 NVL72 is not 72 independent GPUs in an enclosure. It is one distributed system that happens to be implemented as 72 physical dies.
The equivalent decision in software: what is the unit of delivery?
Most teams default to the service. A Lambda function is done when it deploys. A React component is done when it renders.
Under ECD, the unit is the domain — a cluster of services, types, schemas, contracts, deployment artifacts, and observability instrumentation that form one coherent capability. A domain is “done” when all its layers are co-designed and green:
- Business logic implemented and tested
- Type contracts published to the shared types package
- API contract documented and validated
- Deployment template applied consistently
- Auth tier integrated correctly
- Observability instrumented with correlation IDs
- CI/CD pipeline active
Shipping a service without its observability layer is like shipping a function without its type signature — structurally incomplete.
What Does Not Translate
Physical irreversibility: A chip can’t be redesigned after tape-out. Software can always be patched. This means co-design discipline is entirely opt-in — no physical constraint forces it. The methodology must compensate with architecture gate tests and process rules.
Measurement fidelity: Hardware interfaces can be measured with picosecond precision. Software interfaces are measured by runtime behavior under production load, which is noisy and delayed. Software co-design requires explicit contracts (typed interfaces, schemas) to substitute for what hardware designers can measure directly. If the contract is not written down, it doesn’t exist.
Team scale: Nvidia has hundreds of interface architects. Most software teams have 5-50 engineers. The methodology must be proportionate — lightweight enough to adopt, rigorous enough to matter.
The Synthesis
ECD applied to software engineering:
- Name the surfaces. Every place where two components interact is a surface. Give it a name, a type, a schema.
- Define before implementing. Both sides of a surface must be present when it is defined. No unilateral surface decisions.
- Change the unit. Measure delivery at the domain level, not the service level. A domain is done when all its layers are co-designed and in production.
- Enforce structurally. Use architecture gate tests to make surface violations impossible to ship, not just wrong to write.
- Treat the shared surface as the product. The implementation on either side of a port is an implementation detail. The port is the product.