Skip to content
Search

Reliability and Design Safety Factors

The difference between a system that delivers its business case and one that disappoints isn’t usually the technology. It’s whether the engineer who designed it understood series reliability, modeled recovery throughput, and sized buffers against 80th-percentile repair times — not means.

Standard integrator practice: size conveyor, sortation, and automation systems at 1.25–1.5× the design peak hour throughput. Clients push back on this as padding. It’s not — it derives from three independently necessary compounding factors.

1. Forecast Error Premium (+10–15% above design peak)

Section titled “1. Forecast Error Premium (+10–15% above design peak)”

Your design peak is based on a volume projection — typically the 95th-percentile hour from 2–3 years of historical data, or a growth projection. That projection has error. Promotional surges, carrier cut-off clustering, and unanticipated order patterns routinely push actual peak hours 10–15% above forecast. A system sized exactly at design peak will fail during its first real promotional event.

If capacity = exactly nameplate = peak demand, the system can never recover from a backlog. A 5-minute stop at peak creates a queue; running at 100% capacity afterward, you process the backlog at exactly 0 units/hr above normal demand — the backlog never clears.

Worked example:

  • System designed for 10,000 units/hr peak
  • 15-minute stop creates a 2,500-unit backlog

At 1.5× capacity (15,000 units/hr):

  • Recovery rate above normal: 5,000 units/hr
  • Time to burn 2,500-unit backlog: 30 minutes

At 1.0× capacity (10,000 units/hr):

  • Recovery rate above normal: 0
  • Backlog never clears during the peak window

Conveyor belts wear. Drive motors heat-cycle. Belt tension drifts. Sort accuracy degrades as mechanical components fatigue. A system sized at 1.0× peak performs at 85–90% of nameplate within 5 years of continuous operation without proactive maintenance. The 1.25–1.5× factor absorbs real-world performance degradation over a 15–20 year operating life.

A conveyor system — or any automation system built as a linear sequence of subsystems — is a series reliability system. If any single subsystem fails, the system stops.

$$A_{\text{system}} = \prod_{i=1}^{n} A_i$$

This compounds brutally:

Individual AvailabilitySubsystemsSystem Availability
99.5%1095.1%
99.5%2090.5%
99.5%3086.0%
99.8%3094.2%
99.9%3097.0%

A DC sort-to-ship system with 30 subsystems at 99.5% each = 86% system uptime = 1.1 hours of downtime per 8-hour shift = 3–4 hours per week = 30,000–40,000 units per week of lost throughput at 10,000 units/hr.

The only mitigations: extremely high individual component reliability, or breaking the series chain with buffers. Buffers convert a series system into a partially parallel one — upstream processes continue filling the buffer while a downstream segment is being repaired.

$$\text{MTBF} = \frac{\text{Total operating time}}{\text{Number of failures}}$$

$$\text{MTTR} = \frac{\text{Total repair time}}{\text{Number of repair events}}$$

$$A = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}$$

Example: Conveyor zone runs 6,000 hr/yr; 12 failures with 36 hr total downtime.

  • MTBF = 6,000 / 12 = 500 hr
  • MTTR = 36 / 12 = 3 hr
  • A = 500 / (500 + 3) = 99.4%

MTTR in logistics automation is not just active repair time. Total elapsed time to restoration:

ComponentTypical Duration
Alarm detection and response5–15 min
Technician travel to fault5–20 min
Diagnosis10–30 min
Active repair10–60 min
Testing and restart5–15 min

Active repair is only 30–40% of total elapsed MTTR. If an integrator’s spec sheet shows MTTR = 30 min, clarify whether that’s active repair time or total elapsed time. The difference is a factor of 2–3×.

Your buffers must be sized against total elapsed MTTR — not active repair time.

Buffer Sizing: Use the 80th-Percentile MTTR

Section titled “Buffer Sizing: Use the 80th-Percentile MTTR”

The minimum buffer required to protect downstream operations from an upstream stop:

$$\text{Buffer capacity (units)} = \text{Upstream rate (units/min)} \times \text{MTTR}{P{80}} \text{ (min)}$$

Why P80, not the mean: repair time distributions are right-skewed. Most repairs are quick (belt slip: 10 min), but a few are slow (motor replacement: 4 hr). Mean MTTR might be 45 min; P80 MTTR might be 90–120 min. A buffer sized for the mean is inadequate for 20% of repair events — in a high-volume DC, that means the buffer runs dry several times per week.

For a system A → B → C:

  • Buffer A→B: absorbs A’s downtime from B’s perspective. Size for A’s P80 MTTR × B’s throughput rate.
  • Buffer B→C: absorbs B’s downtime from C’s perspective. Size for B’s P80 MTTR × C’s throughput rate.
  • Place the largest buffer before the system bottleneck.
  • Place secondary buffers before the most unreliable stages (sorter, AS/RS crane, first-generation robotic cells).

Spiral conveyors are the most cost-effective buffer medium for high-speed lines: 200–500 units of buffer capacity from a modest floor footprint.

System TypeTypical System Uptime
Simple conveyor sort (10 subsystems)95–97%
Mid-complexity DC (20 subsystems)90–95%
Full automation line (30+ subsystems)85–92%
Regulated/pharma with redundancy97–99%

A system specified at “97% uptime” without modeling the series architecture of 30 subsystems is a business case built on a false number. If each of those 30 subsystems needs to individually achieve 99.9% availability to produce 97% system uptime, that’s the specification to write into the contract — not “97% system uptime” without the component-level backing.

Above 85% storage fill, three problems emerge simultaneously:

  1. Slot availability: WMS search time for compliant put-away locations grows, degrading put-away throughput during receiving windows.
  2. Traffic congestion: Dense storage forces equipment to travel longer, detour around occupied locations, and wait for equipment clearances.
  3. DC-mode pairing (AS/RS): Fewer free locations means the crane’s dual-command pairing efficiency drops — more single-command cycles.

The 85% rule is not a rule of thumb; it’s the operational ceiling above which measurable throughput degradation begins.

Source: 2.6-advanced-automation-design

Basic content

Subscribe to read the rest

This article is part of our Basic library — practitioner-level guidance, frameworks, and decision tools written from real projects.

$9/mo Basic · $13/mo Pro · cancel anytime