Reliability and Design Safety Factors

The difference between a system that delivers its business case and one that disappoints isn’t usually the technology. It’s whether the engineer who designed it understood series reliability, modeled recovery throughput, and sized buffers against 80th-percentile repair times — not means.

The 1.25–1.5× Peak Rule

Standard integrator practice: size conveyor, sortation, and automation systems at 1.25–1.5× the design peak hour throughput. Clients push back on this as padding. It’s not — it derives from three independently necessary compounding factors.

1. Forecast Error Premium (+10–15% above design peak)

Your design peak is based on a volume projection — typically the 95th-percentile hour from 2–3 years of historical data, or a growth projection. That projection has error. Promotional surges, carrier cut-off clustering, and unanticipated order patterns routinely push actual peak hours 10–15% above forecast. A system sized exactly at design peak will fail during its first real promotional event.

2. Recovery Throughput

If capacity = exactly nameplate = peak demand, the system can never recover from a backlog. A 5-minute stop at peak creates a queue; running at 100% capacity afterward, you process the backlog at exactly 0 units/hr above normal demand — the backlog never clears.

Worked example:

System designed for 10,000 units/hr peak
15-minute stop creates a 2,500-unit backlog

At 1.5× capacity (15,000 units/hr):

Recovery rate above normal: 5,000 units/hr
Time to burn 2,500-unit backlog: 30 minutes

At 1.0× capacity (10,000 units/hr):

Recovery rate above normal: 0
Backlog never clears during the peak window

3. Long-Term Degradation

Conveyor belts wear. Drive motors heat-cycle. Belt tension drifts. Sort accuracy degrades as mechanical components fatigue. A system sized at 1.0× peak performs at 85–90% of nameplate within 5 years of continuous operation without proactive maintenance. The 1.25–1.5× factor absorbs real-world performance degradation over a 15–20 year operating life.

Series Reliability Math

A conveyor system — or any automation system built as a linear sequence of subsystems — is a series reliability system. If any single subsystem fails, the system stops.

$$A_{\text{system}} = \prod_{i=1}^{n} A_i$$

This compounds brutally:

Individual Availability	Subsystems	System Availability
99.5%	10	95.1%
99.5%	20	90.5%
99.5%	30	86.0%
99.8%	30	94.2%
99.9%	30	97.0%

A DC sort-to-ship system with 30 subsystems at 99.5% each = 86% system uptime = 1.1 hours of downtime per 8-hour shift = 3–4 hours per week = 30,000–40,000 units per week of lost throughput at 10,000 units/hr.

The only mitigations: extremely high individual component reliability, or breaking the series chain with buffers. Buffers convert a series system into a partially parallel one — upstream processes continue filling the buffer while a downstream segment is being repaired.

MTBF, MTTR, and Availability

$$\text{MTBF} = \frac{\text{Total operating time}}{\text{Number of failures}}$$

$$\text{MTTR} = \frac{\text{Total repair time}}{\text{Number of repair events}}$$

$$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}$$

Example: Conveyor zone runs 6,000 hr/yr; 12 failures with 36 hr total downtime.

MTBF = 6,000 / 12 = 500 hr
MTTR = 36 / 12 = 3 hr
A = 500 / (500 + 3) = 99.4%

The MTTR Nuance Most Models Miss

MTTR in logistics automation is not just active repair time. Total elapsed time to restoration:

Component	Typical Duration
Alarm detection and response	5–15 min
Technician travel to fault	5–20 min
Diagnosis	10–30 min
Active repair	10–60 min
Testing and restart	5–15 min

Active repair is only 30–40% of total elapsed MTTR. If an integrator’s spec sheet shows MTTR = 30 min, clarify whether that’s active repair time or total elapsed time. The difference is a factor of 2–3×.

Your buffers must be sized against total elapsed MTTR — not active repair time.

Buffer Sizing: Use the 80th-Percentile MTTR

The minimum buffer required to protect downstream operations from an upstream stop:

$$\text{Buffer capacity (units)} = \text{Upstream rate (units/min)} \times \text{MTTR}{P{80}} \text{ (min)}$$

Why P80, not the mean: repair time distributions are right-skewed. Most repairs are quick (belt slip: 10 min), but a few are slow (motor replacement: 4 hr). Mean MTTR might be 45 min; P80 MTTR might be 90–120 min. A buffer sized for the mean is inadequate for 20% of repair events — in a high-volume DC, that means the buffer runs dry several times per week.

Multi-Stage Buffer Placement

For a system A → B → C:

Buffer A→B: absorbs A’s downtime from B’s perspective. Size for A’s P80 MTTR × B’s throughput rate.
Buffer B→C: absorbs B’s downtime from C’s perspective. Size for B’s P80 MTTR × C’s throughput rate.
Place the largest buffer before the system bottleneck.
Place secondary buffers before the most unreliable stages (sorter, AS/RS crane, first-generation robotic cells).

Spiral conveyors are the most cost-effective buffer medium for high-speed lines: 200–500 units of buffer capacity from a modest floor footprint.

Practical Uptime Benchmarks

System Type	Typical System Uptime
Simple conveyor sort (10 subsystems)	95–97%
Mid-complexity DC (20 subsystems)	90–95%
Full automation line (30+ subsystems)	85–92%
Regulated/pharma with redundancy	97–99%

A system specified at “97% uptime” without modeling the series architecture of 30 subsystems is a business case built on a false number. If each of those 30 subsystems needs to individually achieve 99.9% availability to produce 97% system uptime, that’s the specification to write into the contract — not “97% system uptime” without the component-level backing.

Storage Utilization: The 85% Rule

Above 85% storage fill, three problems emerge simultaneously:

Slot availability: WMS search time for compliant put-away locations grows, degrading put-away throughput during receiving windows.
Traffic congestion: Dense storage forces equipment to travel longer, detour around occupied locations, and wait for equipment clearances.
DC-mode pairing (AS/RS): Fewer free locations means the crane’s dual-command pairing efficiency drops — more single-command cycles.

The 85% rule is not a rule of thumb; it’s the operational ceiling above which measurable throughput degradation begins.

Source: 2.6-advanced-automation-design

Basic content

Subscribe to read the rest

This article is part of our Basic library — practitioner-level guidance, frameworks, and decision tools written from real projects.

Start a 3-day free trial See plans & pricing

$9/mo Basic · $13/mo Pro · cancel anytime