top - download
⟦36054ccaa⟧ Wang Wps File
Length: 21778 (0x5512)
Types: Wang Wps File
Notes: IDCN - EL-BIT VOL. II
Names: »4298A «
Derivation
└─⟦2d59e51b9⟧ Bits:30006026 8" Wang WCS floppy, CR 0397A
└─ ⟦this⟧ »4298A «
WangText
…00……00……00……00……1c……02……00……00……1c…
…1c……86…1 …02… …02… …02… …02…
IDCN - VOLUME II SYS/83-11-18
TECHNICAL PROPOSAL Page
5 SYSTEM AVAILABILITY ..........................
5.1 GENERAL CONSIDERATIONS ...................
5.2 RECOVERY PROCEDURES ......................
5.3 FALLBACK PROCEDURES ......................
5.4 RECOVERY TIMES ...........................
5.5 MEAN-TIME-BETWEEN-FAILURE (MTBF) .........
5.6 MEAN-TIME-TO-REPAIR (MTTR) ...............
5.7 OVERALL SYSTEM AVAILABILITY ..............
5 S̲Y̲S̲T̲E̲M̲ ̲A̲V̲A̲I̲L̲A̲B̲I̲L̲I̲T̲Y̲
The availability of the proposed equipment is very
high due to not only a high reliability of individual
system elements, but mainly due to the chosen CR80
computer configuration, where functionally identical
elements substitute each other automatically in case
of failure.
The actual availability will be very close to 100%,
due to the exceptional design of the CR80 configuration
for the IDCN.
5.1 G̲E̲N̲E̲R̲A̲L̲ ̲C̲O̲N̲S̲I̲D̲E̲R̲A̲T̲I̲O̲N̲S̲
The high system availability has been achieved by the
use of highly reliable modules, redundant processor
units and automatic reconfiguration facilities. Care
has been taken to ensure that single point errors do
not cause total system failure.
The reliability criteria imposed on the computer systems
have been evaluated and the proposed hardware/software
operational system analysed to determine the degree
of availability and data integrity provided. In this
chapter reliability is stated in numerical terms and
the detailed predictions derived from mathematical
models presented.
The availability predictions are made in accordance
with system reliability models and block diagrams corresponding
to the proposed configuration. This procedure involves
the use of module level and processor unit level failure
rates, or MTBF, (mean time between failures) and MTTR
(meantime to repair); these factors are used in conjunction
with a realistic modeling of the configuration to arrive
at system level MTBF and availability.
Tabulated results of the analysis are presented including
the reliability factors: system MTBF and MTTR.
The basic elements of the proposed system architecture
are composed of standard CR80 units. Reliability and
maintainability engineering was a significant factor
in guiding the development of the CR80.
The CR80 architecture is designed with a capability
to achieve a highly reliable computer system in a cost-effective
way. It provides a reliable set of services to the
users of the system because it may be customized to
the actual availability requirements. The CR80 fault
tolerant computers are designed to avoid single point
errors of all critical system elements by provision
of redundancy paths, multi-processor capabilities and
dual power supplies.
The architecture reflects the fact that the reliability
of peripheral devices is lower than that of the associated
CR80 device controllers. This applies equally well
to communication lines where modems are used as part
of the transmission media. Thus, the peripheral devices,
modems, communication lines, etc., impact the system
availability much more than the corresponding device
controllers.
To assure this very highly reliable product, several
criteria were also introduced on the module level:
- An extensive use of hi-rel, mil-spec components,
ICs are tested to the requirements of MIL-STD 883
level B or similar
- All hardware is designed in accordance with the
general CR80 H/W design principles. These include
derating specification, which greatly enhance the
reliability and reduce the sensibility to parameter
variations
- Critical modules feature a Built-In Test (BIT)
capability as well as a display of the main states
of the internal process by Light Emitting Diodes
on the module front plate. This greatly improves
module maintainability, as it provides debugging
and trouble shooting methods, which reduce the
repair time
- A high quality production line, which includes
high quality soldering, inspection, burn-in and
an extensive automatic functional test
- Software reliability is another aspect which will
be incorporated in achieving high over all availability
- Data has been replicated in order to increase system
availability
- Automatic and manual facilities are provided to
perform quick reconfiguration in case of errors
- Extensive M & D, maintenance and diagnostic software
can be used to minimize down times.
5.2 R̲E̲C̲O̲V̲E̲R̲Y̲ ̲P̲R̲O̲C̲E̲D̲U̲R̲E̲S̲
Flexible variation in the size and structure of the
CR80 system used for the IDCN are permitted by the
unusual degree of hardware and software modularity.
The hardware essentially consists of fast transfer
buses joined to each other by adapters which allow
units on one bus to access those on another. Dualization
at the internal level and multiple redundancy at the
system level provide a CR80 hardware architecture which
is exploited by the XAMOS software operating system
and programs to survive operational failure of individual
components.
Reliability, which is increasingly becoming of concern
in real-time and distributed network applications,
is achieved in the CR80 computer systems by applying
unique architectural concepts. The CR80 hardware/ software
architecture treats all multiprocessors as equal elements
not absolutely dedicated to a specific role. Fault
tolerance and backup are achieved through a redundance
scheme without preassignment of system functions to
specific processors. This is in marked contrast to
the more common rigid dualized configurations often
encountered in dedicated applications with on-line
master/slave arrangements, or off-line backup with
switchover facility.
All redundant equipment is under control of a watchdog
micro-computer, which constantly receives information
on all subsystems status. This strategy ensures that
all units are ready to operate if any reconfiguration
is needed.
The IDCN has been sized to deal with the required data
volumes by use of primary hardware only.
Performance degradation may result from the occurrence
of a failure if it happens durings peak load, because
systems resources are used to recover from errors.
As an example, consider the mirrored disc. If a head
crash occurs on one of the discs, then a fresh blank
disc must be inserted, and all information must be
moved from the non-failed disc to the fresh disc. This
requires more disc activity than normal operational
use, so it might affect performance levels during peak
load situations. Of course the operator can choose
to wait with disc restoration till after peak load,
but this must be considered unrecommendable, because
the system is not able to recover from the next failure.
Similarly, when errors occur in one of the two processing
units, the system can continue operation with reduced
facilities in a graceful degradation mode, but it must
be realized than following errors might be catastrophic.
Various degradation strategies can be programmed in
the watchdog, which initiates all automatic reconfigurations.
The system operator may override this by enabling/disabling
various devices and he may also perform physical reconfiguration
by removing/replacing the various hardware modules.
This can be done without taking the power off the system.
In principal, users will be automatically recovered
from hardware errors, when they occur on a fully redundant
system, but in some instances it may be necessary to
ask the users to reenter his last input transaction.
5.3 F̲A̲L̲L̲B̲A̲C̲K̲ ̲P̲R̲O̲C̲E̲D̲U̲R̲E̲S̲
As described earlier, the CR80 configuration for IDCN
has been designed to provide maximum availability.
This means that several fallback procedures have been
implemented at the hardware and system software level.
Logical addressing is used throughout the system, which
make it possible to access the system from an alternative
terminal or print out on an alternative hardcopy device
subject to security constraints.
In excess of the standard fall back procedures implemented
in hardware and system software, like the mirrored
disc concept, procedural fall back procedures may be
implemented and enforced by the system.
5.4 R̲E̲C̲O̲V̲E̲R̲Y̲ ̲T̲I̲M̲E̲S̲
Recovery times are minimized throughout the system
by using automatic recovery wherever possible. This
approach eliminates all operator reaction time, which
is normally several magnitudes greater than automated
procedures. The actual recovery times depends very
much on the circumstances.
Reintroducing modules as part of restoring a failed
system under system operator control, will be dominated
by operator reaction time, but good procedural rules
and guidelines can minimize the time required.
The system operator can advise users of any planned
system facilities reduction.
5.5 M̲E̲A̲N̲-̲T̲I̲M̲E̲-̲B̲E̲T̲W̲E̲E̲N̲-̲F̲A̲I̲L̲U̲R̲E̲ ̲(̲M̲T̲B̲F̲)̲
The high reliability of the proposed equipment is achieved
through use of proven failure rate equipment similar
to that supplied for other programs.
Early in the design phase, a major objective for each
module is to achieve reliable performance. CR80 modules
make extensive use of carefully chosen components;
most of the IC…08…s are tested to the requirement of MIL-STD
883 level B.
The inverse of MTBF representing failure rate which
applies to system elements and modules is listed.
The MTBF data has been derived from reliability data
maintained on similar programs. Inherent MTBF values
are in general derived from the reliability predictions
accomplished in accordance with the U.S. MIL-HDBK-217
"Reliable Predictions of Electronic Equipment".
Failure rate data for terminal and peripheral equipment
is generally provided by the vendor in accordance with
the subcontract specifications.
The MTBF and MTTR figures are supplied in the following table
for equipment which might be a part of the IDCN:
…06…1 …02… …02… …02… …02… …02… …02… …02… …02… …02… …02…
Module Item Description MTBF FPMH MTTR
N̲o̲.̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲(̲h̲r̲s̲)̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲(̲m̲i̲n̲u̲t̲e̲s̲)̲
8002 CPU, SCM 36500 27.4 30
8003 CPU, CACHE 26100 38.3 30
8009 EPM
172400
5.8 30
8013 EPROM 91700 10.9 30
8016 RAM 128K/64K 17000/29600
58.8/33.8 30
8020 MAP 19400 51.6 30
8021 STI 32800 30.5 30
8037 UNIVAC I/F 33200 30.1 30
8039 IBM CH, I/F 32400 30.9 30
8044 DISC CTRL
DUAL/SINGLE 30200/39400
33.1/25.4 30
8045 TAPE CTRL 16K 35700 28.0 30
8046 DUAL PAR.CTRL 35700 28.0 30
8047 ST.FD.CTRL
DUAL/SINGLE 55500/84700
16.8/11.5 30
8050 POWER SUPPLY 26800 37.3 30
8055 MBT
285700
3.5 30
8059 MBE 10000000
0.1 30
8066 LTU DUAL/SINGLE 27600/45000
36.9/22.2 30
8070 CSA
769000
1.3 30
8071 MIA 85500 11.7 30
8072 SBA 90100 11.1 30
8073 TIA
117600
8
5 30
8074 EPA
256000
3.9 30
8078 IBA 21600 46.2 30
8079 UIA 15600 64.0 30
8081 CIA A & B 71400 14.0 30
8082 LIA-N 10000000
0.1 30
8083 LIA-S (Switch +
Common) 534759/3571428 1.87/0.28 30
8084 DCA 46900 21.3 30
8085 TCA
128200
7.8 30
Module Item Description MTBF FPMH MTTR
N̲o̲.̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲(̲h̲r̲s̲)̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲(̲m̲i̲n̲u̲t̲e̲s̲)̲
8086 PCA 185200 5.4 30
8087 SFA 10000000 0.1 30
8088 EIA A & B 113600 8.8 30
8106 MAINS FILTER
DISTRIBUTION 625000 1.6 30
8115 Minicrate 26300 38 60
8125/PC PU-CRATE 200000 5.0 60
8124/AB CU-CRATE 703630 1.4 60
Peripheral Item Description MTBF FPMH MTTR
N̲o̲.̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲(̲h̲r̲s̲)̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲(̲m̲i̲n̲u̲t̲e̲s̲)̲
8300/--- DISC DRIVE,
SMD, 40-300MB 4000 250.0 90
8301/--- DISC DRIVE,
CMD (16-48)+16MB 4000 250.0 90
8302/--- DISC DRIVE, MMD,
12-80MB 8000 125.0 60
8307/--- FLOPPY DRIVE,
dual/single 8000 125.0 30
sided
8320/001 TAPE STATION,
Pertec FT 8000 8000 125.0 60
8320/002 TAPE STATION,
Pertec FT 5000 2500 400.0 60
5.6 M̲E̲A̲N̲-̲T̲I̲M̲E̲-̲T̲O̲ ̲R̲E̲P̲A̲I̲R̲ ̲(̲M̲T̲T̲R̲)̲
The proposed system is designed for ease of maintenance.
The system is built of modules each comprising a complete
well-defined function. Replacement of modular units
result in minimum repair time. Software and firmware
diagnostic routines rapidly isolate faulty modules;
repair can then be performed by semi-skilled maintenance
personnel and usually without special tools.
The proposed system, composed of redundant elements,
meets the objective of ease of maintenance. All units
and system elements are of a modular construction so
that any defective module can be isolated and replaced
in a minimum amount of time.
In the design of the System Elements, careful attention
was given to ease of maintenance without requiring
special tools, so that the maintenance could be performed
by semi-skilled maintenance personnel.
Fault detection and isolation to the system element,
in som cases module level, is inherent in the software
residing in the various processors. In peripheral devices,
the fault detection and isolation is accomplished by
a combination of on-line software, built-in test, and
operator observations.
In case the correct function of the system is extremely
critical, the Processors will have built-in, on-line,
diagnostic programs. Even though the Processors are
highly reliable, failures can occur; usage of the off-line
diagnostics minimizes the downtime for a system.
An off-line diagnostics software package can be employed
to ease the diagnostics in case of error. Normally,
this software package is stored on disc. After initiation,
the program will test all modules forming the system
and print the name and address of the erroneous module
on the operator…08…s console. Having replaced the erroneous
module, the Processor is ready for operation again.
The operator might, if necessary, run the off-line
diagnostics program once more to verify that the system
is now working without errors.
The command interpreter module of the diagnostic package
enables the operator to initiate any or all of the
test programs for the specific subsystem off-line,
to assist in trouble shooting and to verify the repair.
Examples of modules tested are LTU…08…s, CPU and RAM modules,
etc.
The diagnostic package will also assist in fault isolation
of the peripherals. However, common and special test
equipment might have to be used to isolate the faulty
module.
The Mean-Time-To-Repair for the equipment is derived
from two sources. The first is actual experience data
on the equipment proposed for the system. The other
source is from predictions generated in accordance
with MIL-HDBK-472 or similar documents. As an example,
the MTTR for the Disk Storage Unit was derived from
repair times measured by the supplier. The repair times
of other units were derived by a time-line analysis
of the tasks associated with fault detection, isolation,
repair, and verification. These repair times were weighted
by the MTBF of each module to derive the unit MTTR.
The calculation of the Mean-Time-To-Repair (MTTR) is
done by weighting the individual module repair times
by the MTBF of the individual module. The MTTRs of
the major CR80 equipments are presented.
The predicted MTTR values are from experience with
modules of other programs. The predicted MTTR assumes
that all tools, repair parts, manpower, etc., required
for maintenance are continuously available.
The following figure shows a typical fault isolation
and replacement sequence, when skilled people are used.
Figure 5.6-1…01…Typical Fault Isolation and…01…Replacement Sequence
5.7 O̲V̲E̲R̲A̲L̲L̲ ̲S̲Y̲S̲T̲E̲M̲ ̲A̲V̲A̲I̲L̲A̲B̲I̲L̲I̲T̲Y̲
The IDCN system has been designed with the objective
of providing an extremely highly available system.
The computer system is partitioned into system elements
and the model used for reliability and availability
prediction shows how the proposed quipment provides
the high degree of reliability required.
The reliability characteristics for the system are
stated in numerical terms by a mathematical model.
The supporting detailed prediction is presented in
this chapter. Figures and tables 5.7-1 to 5.7-5 show
an extract of the model. For ease of calculations an
MTTR of 1 hour is used as standard although 30 minutes
are more realistic for nearly all modules, see preceeding
tables. The system model is partitioned into modular
units and system elements that reflect the redundancy
of the configuration; it accounts for all interconnections
and switching points. The MTBF and MTTR for the individual
elements used in the calculations were obtained from
experience with similar equipment on other programs.
The equipment has been partitioned and functions apportioned,
so that system elements can have only two states -
operable or failed. System elements are essentially
stand-alone and free of chain failures.
Careful attention has been paid in the design to eliminate
series risk elements. Redundant units are repairable
without interruption of service. Maintenance and reconfiguration
are possible without compromising system performance.
The primary source selected for authenticated reliability
data and predictions is the MIL-HDBK-217. The failure
rate data are primarily obtained from experience from
previous programs and continously revised as part of
the maintenance program on concurrent programs.
The reliability model which applies to the proposed
configurations is identified in the figure shown in
the following.
The model that has been calculated covers the basic
operational system. In order to improve availability
for the minimal system and the Communication Handling
system to an even higher degree, you can ensure higher
spare part availability on important modules, which
can be easily introduced as part of a fall back procedure.
Figure 5.7-1…01…Reliability Model for IDCN…01…PROCESSOR UNIT
Lambda
U…0f…S…0e…
Availa-
NO ITEM M of N MTBF Lambda MTTR equiv M
of
N bility
DESCRIPTION Req'd (hours) (fpm) (hour) (fpm) (hours) (Equiv)
̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲
̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲
1 PU CRATE 1 of 1 5.0 1
2 FAN 1 of 1 1.2 1
3 MAIN SW 1 of 1 1.0 1
4 PWR DISTR.
PNL. 1 of 1 0.1 1
5 CABLES 1 of 1 1.0 1
6 PWR SUPP 2 of 2 2 x 37.3 1
7 128K RAM 4 of 4 4 x 58.8 1
8 CPU CACHE 2 of 2 2 x 38.3 1
9 MAP 1 of 1 51.6 1
10 MIA 1 of 1 11.7 1
11 STI 1 of 1 30.5 1
12 TIA 2 of 2 2 x 8.5 1
13 MBT 4 of 4 4 x 3.5 1
PU 1 of 1 1925 519.5
PU 1 of 2 0.54 185E4 0.99999
T̲A̲B̲L̲E̲ ̲5̲.̲7̲-̲2̲ ̲ ̲R̲E̲L̲I̲A̲B̲I̲L̲I̲T̲Y̲ ̲M̲O̲D̲E̲L̲ ̲1̲:̲…01…P̲R̲E̲D̲I̲C̲T̲E̲D̲ ̲R̲E̲L̲I̲A̲B̲I̲L̲I̲T̲Y̲ ̲F̲O̲R̲ ̲P̲R̲O̲C̲E̲S̲S̲O̲R̲
̲U̲N̲I̲T̲
Figure 5.7-3…01…RELIABILITY MODEL 4B …01…SERVICE TO INDIVIDUAL EXTERNAL CHANNELS
Lambda
U…0f…S…0e…
Availa-
NO ITEM M of N MTBF Lambda MTTR equiv M
of
N bility
DESCRIPTION Req'd (hours) (fpm) (hour) (fpm) (hours) (Equiv)
̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲
̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲ ̲
1 WATCHDOG 1 of 1 172.8 1 15.00
2 PU 1 of 2 519.5 1 0.54
3 CU+DISKS 1 of 1 4.8 1 4.76
4 LTU 1 of 1 36.9 1 36.90
5 LIA-N 1 of 1 0.1 1 0.10
6 V.24 L/L 1 of 1 32.2 1 32.20
ITEMS 1 - 6 89.50 11173 0.99991
T̲A̲B̲L̲E̲ ̲5̲.̲7̲-̲4̲ R̲E̲L̲I̲A̲B̲I̲L̲I̲T̲Y̲ ̲M̲O̲D̲E̲L̲ ̲4̲B̲:̲…01…P̲R̲E̲D̲I̲C̲T̲E̲D̲ ̲R̲E̲L̲I̲A̲B̲I̲L̲I̲T̲Y̲ ̲F̲O̲R̲ ̲S̲E̲R̲V̲I̲C̲E̲ ̲T̲O̲ ̲…01…I̲N̲D̲I̲V̲I̲D̲U̲A̲L̲
̲E̲X̲T̲E̲R̲N̲A̲L̲ ̲C̲H̲A̲N̲N̲E̲L̲S̲
Figure 5.7-5…01…RELIABILITY MODEL FOR IDCN…01…USER TERMINAL POSITION
The equivalent calculated overall availability will
be above
.9999
=====
For safety reasons MTTR figures used for a calculations
are very conservative, typically 30 minutes, but a
much better result can be obtained when operators and
maintenance people are carefully instructed and trained.
The following figure shows a typical fault isolation
and replacement sequence, when skilled people are used.