top - download
⟦43e7f8c06⟧ Wang Wps File
Length: 24832 (0x6100)
Types: Wang Wps File
Notes: AIR CANADA PROPOSAL
Names: »2065A «
Derivation
└─⟦9a531dff6⟧ Bits:30006257 8" Wang WCS floppy, CR 0157A
└─ ⟦this⟧ »2065A «
WangText
(…06…'…0f…&…08……86…1
…02… …02…
…02… …02…
CHAPTER
7
Page
#
DOCUMENT
III
TECHNICAL
PROPOSAL
Apr.
29, 1982
LIST OF CONTENTS Page
7. RMA
2
7.1 Introduction
2
7.2 RMA Analysis
4
7.3 Reliability Models and Block Diagrams
6
7.3.1 Reliability Models for PU's
8
7.4 Availability of a Single Node 12
7.5 Availability Access to destination 14
7.5.1 Availability Access to destination
including alternative rouling 15
7.5.2 Availability End User to VIA Host
7.6 EMH availability 17
7.7 NMH availability 18
7.8 Equipment Mean Time Between Failures (MTBF) 19
7.9 Equipment Maintainability (MTTR) 22
7 R̲E̲L̲I̲A̲B̲I̲L̲I̲T̲Y̲,̲M̲A̲I̲N̲T̲A̲I̲N̲A̲B̲I̲L̲I̲T̲Y̲ ̲A̲N̲D̲ ̲A̲V̲A̲I̲L̲A̲B̲I̲L̲I̲T̲Y̲ ̲A̲N̲A̲L̲Y̲S̲I̲S̲
This chapter provides the detailed analysis of the
reliability and maintainability provided by the proposed
equipment. Emphasis has been given to include the analysis
for the range covered by the proposed system architecture.
Furthermore, detailed information with respect to failure
rates and repair times is provided for the various
components and modules included in the architecture.
7.1 I̲N̲T̲R̲O̲D̲U̲C̲T̲I̲O̲N̲
The availablity of the proposed equipment is very high
due not only to a high reliability of individual system
elments, but not least due to the chosen CR80 computer
configuration, where functional like elements automatically
substitute each other in case of failure. Overall system
availability has been calculated.
The high system availability has been achieved by use
of highly reliable modules, redundant processor units
and line termination units, and automatic reconfiguration
facilities. Care has been taken to ensure that single
point errors do not cause total system failure.
The reliability criteria imposed on the computer systems
have been evaluated and the proposed hardware/software
operational system analysed to determine the degree
of availability and data integrity provided. In this
chapter reliability is stated in numerical terms and
the detailed predictions derived from mathematical
models presented.
The availability predictions are made in accordance
with system reliability models and block diagrams corresponding
to the proposed configuration. This procedure involves
the use of module level and processor unit level failure
rates, or MTBF, (mean time between failure) and repair
times or MTTR, (mean time to repair); these factors
are used in conjunction with a realistic modelling
of the configuration to arrive at system level MTBF
and availability.
Tabulated results of the analysis are presented including
the reliabilty factors: system MTBF and repair time
MTTR.
The basic elements of the proposed system architecture
are constituted by standard CR80 units. Reliabilty
and maintainability engineering was a significant factor
in guiding the development of the CR80.
The CR80 architecture is designed with a capability
to achieve a highly reliable computer system in a cost-effective
way. It provides a reliable set of services to the
users of the system, because it may be customised to
the actual availability requirements. The CR80 fault
tolerant computers are designed to avoid single point
errors of all critical system elements by provision
of redundancy paths, processor capabilities and power
supplies.
The architecture reflects the fact that the reliability
of peripheral devices is lower than that of the associated
CR80 device controllers. This applies equally well
to communication lines where modems are used as part
of the transmission media. Thus, the peripheral devices,
modems, communication lines, etc., impact the system
availability much more than the corresponding device
controllers.
To assure this very highly reliable product, several
criteria were also introduced on the module level:
An extensive use of hi-rel, mil-spec components,
ICs are tested to the requirements of MIL-STD 883
level B or similar.
All hardware is designed in accordance with the
general CR80 H/W design principles. These include
derating specifications, which greatly enhance
the reliability and reduce the sensibility to parameter
variations.
Critical modules feature a Built-In(BIT) capability
as well as a display of the main states of the
internal process by Light Emitting Diodes on the
module front plate. This greatly improves module
maintainability, as it provides debug and trouble
shooting methods, which reduce the repair time.
A high quality production line, which includes
high quality soldering, inspection, burn-in and
an extensive automatic functional test.
7.2 R̲M̲A̲ ̲A̲n̲a̲l̲y̲s̲i̲s̲
This section provides information with respect to RMA
analysis of a system. It includes the detailed formulas
which apply as part of the RMA calculations.
The RMA analysis of a system provides information on
how much of the time the system provides a given set
of required functional capabilities, i.e. provides
operative availability. It shows how many times the
system is not operative during a given period and for
how long. A system may be operative even with one
or more elements of the total system down or taken
off-line for the purpose of repairing and/or replacement
of delect modules/units. Note that this is operative
as seen by a user of the functional capabilities, not
as seen by maintenance personnel.
The basis for determining the system level availability
is an RMA model of serial and parallel system elements.
Each of these elements defines a specific subset of
the total system with a well defined state either functioning
or not. Serial elements refer to elements all of which
have to be available for that set to be available.
Parallel elements describes those sets where not all
elements need to be available, the number determined
by the required service level or the redundancy provided.
The subsequent section introduces the basic RMA building
stones.
7.2.1 S̲e̲r̲i̲e̲s̲ ̲E̲l̲e̲m̲e̲n̲t̲
The mean time between failures of a series of n different
RMA elements is made up as follows:
MTBF 6
5 = ̲ ̲1̲0̲ ̲ ̲ ̲ ̲
LAMBDA
5
where the series failure rates is determined by
the sum of the failure rates of the elements:
LAMBDA…0f…5…0e… = LAMBDA…0f…1…0e… + LAMBDA…0f…2…0e…+...+LAMBDA…0f…i…0e…+....+LAMBDA…0f…n…0e…
where LAMBDA…0f…i…0e… denotes the failure rate of the i'th
element.
The availability of a system of n different serial
RMA elements is determined by:
A = A…0f…1…0e…*A…0f…2…0e…*....*i*....*A…0f…n…0e…
7.2.2 P̲a̲r̲a̲l̲l̲e̲l̲ ̲E̲l̲e̲m̲e̲n̲t̲s̲
When RMA elements are in parallel, it is required that
one or more of the parallel units are operative simultaneously
to obtain the required system performance. The actual
number of parallel units required is dependent on the
actual models. Assuming operational redundancy and
neglible recovery time, the calculation rules are:
a. M̲e̲a̲n̲ ̲T̲i̲m̲e̲ ̲B̲e̲t̲w̲e̲e̲n̲ ̲F̲a̲i̲l̲u̲r̲e̲
When the parallel elements have defined MTBF and
MTTR values the following rules apply:
1̲ ̲o̲f̲ ̲2̲ ̲e̲q̲u̲a̲l̲ ̲p̲a̲r̲a̲l̲l̲e̲l̲ ̲e̲l̲e̲m̲e̲n̲t̲s̲ ̲
2
Element MTBF = 2̲ ̲*̲ ̲M̲T̲B̲F̲ ̲*̲ ̲M̲T̲T̲R̲ ̲+̲ ̲M̲T̲B̲F̲ , or
E 2 x MTTR
2
MTBF = ̲M̲T̲B̲F̲ ̲ ̲ Provided MTTR MTBF
E 2xMTTR
n̲ ̲o̲f̲ ̲n̲+̲1̲ ̲E̲q̲u̲a̲l̲ ̲P̲a̲r̲a̲l̲l̲e̲l̲ ̲E̲l̲e̲m̲e̲n̲t̲s̲
2
Element MTBF = (̲n̲+̲1̲)̲*̲M̲T̲B̲F̲*̲M̲T̲T̲R̲ ̲+̲ ̲M̲T̲B̲F̲ ̲,̲ or
E n(n+1)MTTR
2
MTBF = ̲M̲T̲B̲F̲ ̲ ̲ ̲ provided (n + 1)*MTTR MTBF
E n(n + 1)MTTR
b. M̲e̲a̲n̲ ̲T̲i̲m̲e̲ ̲T̲o̲ ̲R̲e̲p̲a̲i̲r̲
The element mean time to repair, MTTR…0f…E…0e…, corresponds
to the period where more than n out of the n+1
units are not available i.e. the element is not
fully operative.
1̲ ̲o̲f̲ ̲2̲ ̲P̲a̲r̲a̲l̲l̲e̲l̲ ̲E̲l̲e̲m̲e̲n̲t̲s̲ ̲
MTTR = M̲T̲T̲R̲
E 2
n̲ ̲o̲f̲ ̲n̲ ̲+̲ ̲1̲ ̲P̲a̲r̲a̲l̲l̲e̲l̲ ̲E̲l̲e̲m̲e̲n̲t̲s̲
MTTR = M̲T̲T̲R̲ ̲
E 2
c. A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲
The availability corresponds to the ratio between
the MTBF and the total operative time, which is
equal to the sum of MTBF and MTTR for the element
thus:
MTBF
A = ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲E̲ ̲ ̲ ̲ ̲ ̲
E
MTBF + MTTR
E E
7.3 R̲E̲L̲I̲A̲B̲I̲L̲I̲T̲Y̲ ̲M̲O̲D̲E̲L̲S̲ ̲A̲N̲D̲ ̲B̲L̲O̲C̲K̲ ̲ ̲D̲I̲A̲G̲R̲A̲M̲S̲
The computer system is partitioned into system elements
and the models used for reliability and availabilty
predictions show how the proposed equipment provides
the high degree of reliability required.
The system reliability characteristics for the system
are stated in numerical terms by mathematical models;
the supporting detailed predictions are presented in
this chapter. The system models are partitioned into
modular units and system elements that reflect the
redundancy of the configuration; it accounts for all
interconnections and switching points. The MTBF and
MTTR for the individual elements used in the calculations
were obtained from experience with similar equipment
on the NICS-TARE, FIKS and CAMPS programmes. The figures
quoted on peripheral equipment are based on data supplied
by the manufacturers.
The equipment has been partitioned and functions apportioned
so that system elements can have only two states -
operable or failed. System elements are essentially
stand-alone and free of chain failures.
Careful attention has been paid in the design to eliminate
series risk elements. Redundant units are repairable
without interruption of service. Maintenance and reconfiguration
is possible without compromising system performance.
The primary source selected for authenticated reliability
data and predictions is the MIL-HDBK-217. The failure
rate data are primarily obtained from experience from
previous programmes and continuously revised as part
of the maintenance programme on concurrent programmes.
The relialibility models which apply to the proposed
configurations are identified in the figures shown
on the following pages.
7.3.1 R̲e̲l̲i̲a̲b̲i̲l̲i̲t̲y̲ ̲M̲o̲d̲e̲l̲s̲ ̲f̲o̲r̲ ̲P̲r̲o̲c̲e̲s̲s̲i̲n̲g̲ ̲E̲l̲e̲m̲e̲n̲t̲s̲ ̲
The reliability models MTBF and availability predictions
for the Processing Units are shown in the figure below:
N̲o̲d̲a̲l̲ ̲S̲w̲i̲t̲c̲h̲ ̲P̲r̲o̲c̲e̲s̲s̲o̲r̲ ̲U̲n̲i̲t̲ ̲(̲N̲S̲P̲)̲
MTBF = 1305 Hours
MTTR = 30 min.
Avail = 44.962%
= 766
Fig. III 7.2.1.1
N̲o̲d̲a̲l̲ ̲C̲o̲n̲t̲r̲o̲l̲ ̲P̲r̲o̲c̲e̲s̲s̲o̲r̲ ̲(̲N̲C̲P̲)̲
The reliability model for the processing part of the
Nodal Control is shown below
N̲o̲d̲a̲l̲ ̲C̲o̲n̲t̲r̲o̲l̲ ̲P̲r̲o̲c̲e̲s̲s̲o̲r̲ ̲(̲N̲C̲P̲)̲
MTBF = 1161 Hours
MTTR = 30 min.
Avail = 99.957%
= 861
Fig. III 7.2.1.2
Network Management Processor (NMP)
MTBF = 1241 Hours
MTTR = 30 min
Avail = 99.960%
= 806
Fig. III 7.2.1.3
E̲l̲e̲c̲t̲r̲o̲n̲i̲c̲ ̲M̲a̲i̲l̲ ̲P̲r̲o̲c̲e̲s̲s̲o̲r̲
MTBF = 1453 Hours
MTTR = 30 min.
Avail = 99.995%
= 688
Fig. III 7.2.1.4
7.4 A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲ ̲o̲f̲ ̲a̲ ̲S̲i̲n̲g̲l̲e̲ ̲N̲o̲d̲e̲
Shown below is the availability model for a single
node, which includes the dual NCC.
The following Criteria are used in the calculations
* The Nodal Switch LTU's are partioned in groups
of 36 LTU's of which only 1 may have failed.
* A nodal switch processer will still work, even
if the V24 connection to the Nodal Control Processor
does not work.
NCP CU…0f…CP…0e… NSP Nodal Nodal Nodal Nodal
1 of 2 5 of 6 LTU LTU LTU LTU
grp. 1 grp. 2 grp. 3 grp.
4
0.74 1.59 8.81 5.2 5.2 5.2
5.2
MTBF = 31.308 Hours
MTTR = 30 Min.
Avail = 99.9984
==================
A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲ ̲f̲o̲r̲ ̲t̲h̲e̲ ̲N̲C̲C̲ ̲C̲U̲
DISK
CTRL DISK
CU 54.4 250
Crate
ASS
1.4 DISK DISK
CTRL
54.4 250
MTBF = 630.915 Hours
MTTR = 30 Min.
= 1.59
A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲ ̲f̲o̲r̲ ̲N̲S̲P̲ ̲ ̲L̲T̲U̲ ̲G̲r̲o̲u̲p̲
CU CU CU LTU LIA-N
Crate Crate Crate
Assy. Assy. Assy.
1.4 1.4 1.4 36.4 0.1
35 of 36
1
MTBF = 192.308 Hours
MTTR = 15 Min.
= 5.20
7.5 A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲ ̲A̲c̲c̲e̲s̲s̲ ̲t̲o̲ ̲d̲e̲s̲t̲i̲n̲a̲t̲i̲o̲n̲
Shown below is the reliability model for access point
to access point in the primary path.
MTBF = 9.777 Hours
MTTR = 30 Min.
Avail = 99.9949%
7.5.1 A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲ ̲a̲c̲c̲e̲s̲s̲ ̲t̲o̲ ̲d̲e̲s̲t̲i̲n̲a̲t̲i̲o̲n̲,̲ ̲i̲n̲c̲l̲u̲d̲i̲n̲g̲ ̲a̲l̲t̲e̲r̲n̲a̲t̲i̲v̲e̲
̲r̲o̲u̲t̲i̲n̲g̲ ̲
The availability of access to access point to destination
point is not improved by use of alternative routing.
This is due to the small network, i.e. the two access
nodes and the two access LTU's still have to work,
and nearly all the unavailability is associated with
these four components.
7.5.2 A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲ ̲E̲n̲d̲ ̲U̲s̲e̲r̲ ̲t̲o̲ ̲V̲I̲A̲ ̲H̲o̲s̲t̲
The reliability model for an end User's access to the
VIA Host in shown below
MTBF = 44.803 Hours
MTTR = 30 Min.
Avail = 99.9989%
Legend Non CR,
estimated
*) Note that the availability is calculated
with the 1991 Node Configuration
7.6 E̲l̲e̲c̲t̲r̲o̲n̲i̲c̲ ̲M̲a̲i̲l̲ ̲H̲o̲s̲t̲ ̲A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲
The Electronic Mail Host (EMH) availability model is
shown in the figure below:
MTBF = 249.513 Hours
MTTR = 30 Min.
A̲v̲a̲i̲l̲ ̲=̲ ̲ ̲9̲9̲.̲9̲9̲9̲8̲%̲
7.7 N̲e̲t̲w̲o̲r̲k̲ ̲M̲a̲n̲a̲g̲e̲m̲e̲n̲t̲ ̲H̲o̲s̲t̲ ̲A̲v̲a̲i̲l̲a̲b̲i̲l̲i̲t̲y̲
Shown below is the availability model for the Network
Management Host:
MTBF = 1129 Hour
MTTR = 30 Min.
Avail = 99.9557%
7.8 E̲Q̲U̲I̲P̲M̲E̲N̲T̲ ̲M̲E̲A̲N̲ ̲T̲I̲M̲E̲ ̲B̲E̲T̲W̲E̲E̲N̲ ̲F̲A̲I̲L̲U̲R̲E̲S̲(̲M̲T̲B̲F̲)̲
The high reliability of the proposed equipment is achieved
through use of proven failure rate equipment similar
to that supplied by Christian Rovsing for the NICS-TARE,
FIKS and CAMPS programmes.
Early in the design phase, a major objective for each
module is to achieve reliable performance. CR80 modules
make extensive use of carefully chosen components;
most of the IC's are tested to the requirement of MIL-STD
883 level B.
The inverse of MTBF representing failure rate which
applies to system elements and modules is listed in
Table 7-8 entitled CR80 Reliability Factors.
The MTBF data has been derived from reliability data
maintained on the NICS-TARE and CAMPS and similar programmes.
Inherent MTBF values are in general derived from the
reliability predictions accomplished in accordance
with the U.S. MIL-HDBK-217 "Reliable Predictions of
Electronic Equipment". This document, adopted by Christian
Rovsing through their involvement with NICS-TARE, is
used extensively on current military and aerospace
programmes.
Failure rate data for terminal and periphal equipment
is generally provided by the vendor in accordance with
the subcontract specifications.
R & M VALUES FOR MODULES AND PERIPHALS
Table 7-8 (Cont'd)
R & M Values for Modules and Periphals
Table 7-8 (Cont'd)
7.9 E̲Q̲U̲I̲P̲M̲E̲N̲T̲ ̲M̲A̲I̲N̲T̲A̲I̲N̲A̲B̲I̲L̲I̲T̲Y̲ ̲(̲M̲T̲T̲R̲)̲
The proposed network is designed for ease of maintenance.
Each system built of modules each comprising a complete
well-defined function. Replacement of modular units
results in minimum repair time. Software and firmware
diagnostic routines rapidly isolate faulty modules;
repair can then be performed by semi-skilled maintenance
personnel and usually without special tools.
The proposed network, composed of redundant CR80 elemements,
meets the objective of ease of maintenance. All units
and system elements are of a modular construction so
that any defective module can be isolated and replaced
in a minimum amount of time.
In the design of the CR80, careful attention was given
to ease of maintenance without requiring special tools,
so that the maintenance could be performed by semi-skilled
maintenance personnel.
Fault detection and isolation to the system element,
in some cases module level, is inherent in the software
residing in the various processors. In peripheral devices,
the fault detection and isolation is accomplished by
a combination of on-line, software, built-in tests,
and operator observations.
In case the correct function of the system is extremely
critical, the CR80 will have built-in, on-line, diagnostic
programmes. Even though the CR80 is highly reliable,
failures can occur; usage of the off-line diagnostics
minimises the downtime for a system.
An off-line diagnostics software package is employed
to ease the diagmostics in case of error. Normally,
this software package is stored on disc. After initiation,
the programme will test all modules forming the system
amd print the name and address of the erroneous module
on the operator's console. Having replaced the erroneous
module, the CR80 is ready for operation again. The
operator might, if necessary, run the off-line diagnostics
programme once more to verify that the system is now
working without errors.
The command interpreter module of the diagnostic package
enables the operator to initiate any or all of the
test programmes for the specific subsystem off-line,
to assist in trouble shooting and to verify the repair.
Examples of modules tested are: LTU's, CPU and RAM
modules, etc.
The diagnostic package will also assist in fault isolation
of the peripherals. However, common and special test
equipment might have to be used to isolate the faulty
module.
The Mean-Time-To-Repair for the equipment is derived
from two sources. The first is actual experience data
on the equipment proposed for the front-end system.
The other source is from predictions generated in accordance
with MIL-HDBK-472 or similar documents. As an example,
the MTTR for the Disc Storage Unit was derived from
repair times measured by the supplier. The repair times
of other units were derived by a time-line analysis
of the tasks associated with fault detection, isolation,
repair, and verification. These repair times were weighted
by the MTBF of each module to derive the unit MTTR.
The calculation of the Mean-Time-To-Repair (MTTR) is
done by weighting the individual module repair times
by the MTBF of the individual module. The MTTRs of
the major CR80 equipments are presented in Table 7-8.
The predicted MTTR values are from experience with
modules of the NICS-TARE, FIKS and CAMPS programmes.
The predicted MTTR assumes that all tools, repair parts,
manpower, etc., required for maintenance are continuously
available.