菜单

LB
发布于 2025-09-20 / 0 阅读
0

When Unit Tests Meet Large Models

  • "LLM-written code passes first time?"

  • Don't believe the demo videos.

Over the past year we have shipped 3 official releases, 247 user stories and about 140k lines of code—almost all generated by a large language model.

Yet only 78% of the functions passed our automated gates on the first try. The remaining 22% failed: one-third on syntax, one-third on missing imports, and one-third on unit tests. In a fully unattended pipeline that 22% would have blocked the merge and killed the whole idea of AI-native delivery.

We therefore built a three-stage "rebirth" safeguard inside LBAI's internal CI:

1. Syntax

2. Semantics

3. Business baseline

After 30 days of production traffic the pass rate climbed to 99.7% with zero roll-backs and zero human clicks. Below is a plain-language tour of how each gate works, the real failure patterns we met, and the rebirth mechanism you can copy tomorrow.

Stage 1 – Syntax Gate: make the code compile

Metric

Value

Runs on

every function

Avg. duration

~2s

First-attempt failure

1.1%

Retry limit

1

What we do

- Spin up a minimal container that holds only the declared runtime dependencies—no "works on my laptop" surprises.

- Parse the file and try to import every module it mentions. Any syntax or import error aborts the pipeline immediately.

- Match the error message against a small set of high-frequency repair templates (undefined variable, missing import, wrong decorator order, etc.). The template feeds a corrective hint back to the model, which emits a new version within seconds.

Example

A missing helper wrapper caused a name-error. The matched template asked the model to add the import; second version passed in 1.8s. Cost: a fraction of a cent.

The syntax gate keeps "compile-time" defects from ever reaching the main branch.

Stage 2 – Semantics Gate: unit tests & API contract

Metric

Value

Runs on

every function

Avg. duration

~9s

First-attempt failure

0.8%

Retry limit

3

What we do

- Run only the 5–20 unit tests that directly exercise the changed function—feedback inside ten seconds.

- Compare the live response of each endpoint with a previously recorded "golden" snapshot; any missing field or type shift fails the gate.

- On failure we package the log, the source and the test case into a prompt, asking the model to rewrite only the body while keeping the signature intact. Most issues clear on the second attempt.

Example

A retry decorator used an exponential back-off that doubled too quickly. The test expected a 3-second total but measured 7s. After seeing the timing graph in the prompt the model halved the multiplier; next build passed.

Stage 3 – Business Baseline Gate: end-to-end journey

Metric

Value

Runs on

every function

Avg. duration

~45s

First-attempt failure

0.3%

Retry limit

3

What we do

- Drive the UI with an automated browser that logs in, checks out and queries records—hitting 30+ critical DOM nodes.

- Fail if P95 latency grows >10% or if memory/CPU crosses a budget line.

- Return flame graphs and diff summaries to the model, requesting a fix that stays within resource limits. Up to three rebirths are allowed before human review.

Example

A new ORM loading strategy removed the N+1 problem but ballooned peak memory by 38%. With the budget attached to the prompt the model switched to a lighter fetch mode on the third try and passed.

30-day production numbers

- Function-level submissions: 14.2k

- Syntax failures: 1.1% → 0.17% after rebirth

- Semantic failures: 0.8% → 0.07% after rebirth

- Baseline failures: 0.3% → 0.02% after rebirth

- Final pass rate: 99.7%

- Unattended roll-backs: 0

In other words, only 3 out of every 1,000 AI-written functions need human eyes; the rest merge automatically.

Three ready-to-use gifts from LBAI

1. Function-level patch toolbox – slices source into hot-swappable units with one-second roll-back.

2. Syntax-repair micro-model – shipped with de-identified error samples for offline deployment.

3. One-click init scaffold – spawns a containerised triple-gate template that plugs into existing CI in ten minutes.

Closing thought

The gap between 78% and 99.7% is not twenty-one percentage points—it is the chasm between "demo toy" and "production asset".

Treat the large model as a probabilistic compiler, surround it with graded gates, evidence chains and cost caps, and probability turns into predictability.

At LBAI we are productising this pipeline so that every line of AI code becomes traceable, auditable and rollback-ready.

Grab our AI-Native CI/CD Whitepaper and let's make 99.7% the industry norm.