top - download
⟦a906f5108⟧ Wang Wps File
Length: 17427 (0x4413)
Types: Wang Wps File
Notes: CPS/SDS/001 SYSTEM DESIGN
Names: »0535A «
Derivation
└─⟦c2ca659c9⟧ Bits:30006002 8" Wang WCS floppy, CR 0037A
└─ ⟦this⟧ »0535A «
WangText
…00……00……00……00……00……1a……0a……00……86…1 …02… …02… …02…
…02…CPS/SDS/001
…02… FH/810115…02……02…
CAMPS SYSTEM DESIGN SPECIFICATION
…02……02…CAMPS
T̲A̲B̲L̲E̲ ̲O̲F̲ ̲C̲O̲N̲T̲E̲N̲T̲S̲
4.11 ERROR AND BACKLOCK HANDLING .............
4.11.1 Error Processing Mechanisms .........
4.11.1.1 Error Reception/reporting .......
4.11.1.2 Error Display/Printout ..........
4.11.2 Error Detection and Localization ....
4.11.2.1 Error Detection/localization
Analysis ........................
4.11.2.2 Errors Detected by a PU .........
4.11.2.2.1 PU Hardware Errordetection ..
4.11.2.2.2 PU Firmware Errordetection ..
4.11.2.2.3 PU Software Errordetection ..
4.11.3 Error Fix-up ........................
4.11.3.1 PU or IO Bus Error Fix-up .......
4.11.3.2 TDX System Error ................
4.11.3.4 Offline or Floppy Disk Error ....
4.11.3.5 LTU System Error ................
4.11.3.6 LTUX System Error ...............
4.11.3.7 Watchdog System Error ...........
4.11.3.8 Power Down ......................
4.11.3.9 Hardware Resource Error .........
4.11.3.10 Security or Access Control Error
4.11.3.11 Software Resource Error .........
4.11.3.12 Miscellaneous ...................
4.11.4 Backlock Handling ...................
4.11.4.1 Deadlock ........................
4.11.4.2 Overload ........................
4.11.4.2.1 Queue Overload ..............
4.11.4.2.2 Intermediate Storage ........
4.11.4.2.3 Short Term Storage ..........
4.11 E̲R̲R̲O̲R̲ ̲A̲N̲D̲ ̲B̲A̲C̲K̲L̲O̲C̲K̲ ̲H̲A̲N̲D̲L̲I̲N̲G̲
This section addresses the processing of technical
errors. Technical errors are hardware errors and the
software errors related to system software use. Errors
due to e.g. ACP127 message format analysis or syntax
errors in user input are not handled.
The section is divided into four subsections, which
describe:
a) The error processing mechanisms provided by the
Kernel
b) Error detection facilities and localization of
an erroneous module
c) Errortypes and corresponding error fix-up actions
d) Backlock handling facilities
The section only handles the occurence of a single
error. Multiple errors may imply a total system error,
in which case a WARM2 start-up is to be executed. However,
a total system error may be disastrous, if both invoked
disks are corrupted (due to head-landing). In this
case a DEAD2 start-up is to be executed.
4.11.1 E̲r̲r̲o̲r̲ ̲P̲r̲o̲c̲e̲s̲s̲i̲n̲g̲ ̲M̲e̲c̲h̲a̲n̲i̲s̲m̲s̲
The Kernel contains a table, which defines an error
to error type relation. It is possible for a process
to specify error types for which it will take over
the error handling. The take over is implemented via
an application process defined procedure, which is
automatically invoked, of an error of the specified
type occurs. The application error fix-up process has
access to an extended error information, which in detail
defines the error (e.g. to hardware module level).
It is, however, possible for the parent of the process
to inhibit a child from specifying certain error types
(e.g. security error).
Errors not handled by a process are given to the parent
of the process and the process is stopped.
4.11.1.1 E̲r̲r̲o̲r̲ ̲R̲e̲c̲e̲p̲t̲i̲o̲n̲/̲R̲e̲p̲o̲r̲t̲i̲n̲g̲
COPSY receives error reports subsequent to the detection
of an error from:
- On line diagnostics programs
- Application software, which has not specified an
error fix-up procedure
- PU firmware detected hardware errors
- Kernel detected software errors
- The watchdog having monitored a hardware error
Application processes receive error reports subsequent
to an error from the system software which on behalf
of the application operates on lines, files, queues,
areas, etc.:
- I/O system
- Message Monitor
- Queue Monitor
4.11.1.2 E̲r̲r̲o̲r̲ ̲D̲i̲s̲p̲l̲a̲y̲/̲P̲r̲i̲n̲t̲o̲u̲t̲
A process, which handles an error locally, reports
the result of the error fix-up to the SSC. The SSC
prints the report at the operator printer and if appropriate
updates the system status display at the operator VDU.
If the watchdog fails then error reports are directed
to the supervisor report printer.
4.11.2 E̲r̲r̲o̲r̲ ̲D̲e̲t̲e̲c̲t̲i̲o̲n̲ ̲a̲n̲d̲ ̲L̲o̲c̲a̲l̲i̲z̲a̲t̲i̲o̲n̲
This section is divided into two subsections, which
contains:
a) The principles for an analysis, which will be accomplished
during detailed design, of error detection/localization
facilities provided by CAMPS to meet:
- Requirements to integrity of operation
- Error reporting requirements derived from the
MTTR requirements
b) A description of specific PU error detection facilities,
directly required in the SRS.
4.11.2.1 E̲r̲r̲o̲r̲ ̲D̲e̲t̲e̲c̲t̲i̲o̲n̲/̲L̲o̲c̲a̲l̲i̲z̲a̲t̲i̲o̲n̲ ̲A̲n̲a̲l̲y̲s̲i̲s̲
The CAMPS system will be broken down into hardware
and software subsystems. For each subsystem it will
be described how the subsystem reacts as a result of
an internal error. It will also be described how the
error is detected due to:
- Hardware traps (e.g. parity check)
- Firmware traps (e.g. LTU and TDX system protocols)
- Software traps (e.g. online diagnostics and validity
checks)
- Manual observation (e.g. some VDU errors, LED indication)
- Watchdog monitoring (e.g. power down)
Error detection is either direct (e.g. parity check
during memory access) or indirect (e.g. a CPU calculation
error may be detected via an illegal memory access).
Also, an error can be caused by a number of modules
or by either software or hardware. It is the objective
of the error isolation facilities to isolate an error
to one of the groups defined in section 4.11.3.
4.11.2.2 E̲r̲r̲o̲r̲ ̲D̲e̲t̲e̲c̲t̲i̲o̲n̲ ̲b̲y̲ ̲a̲ ̲P̲U̲
This section describes PU error detection facilities
specifically required in the SRS.
4.11.2.2.1 P̲U̲ ̲H̲a̲r̲d̲w̲a̲r̲e̲ ̲E̲r̲r̲o̲r̲ ̲D̲e̲t̲e̲c̲t̲i̲o̲n̲
a) Trapping of unassigned instructions. Execution
of any illegal code or bit pattern is detected
and results in an invokation of the Kernel.
b) The MAP module prevents programs from being able
to write in memory occupied by the operating system
or by other programs.
c) Instructions are separated into two classes, one
for privileged use and one for application use.
4.11.2.2.2 P̲U̲ ̲F̲i̲r̲m̲w̲a̲r̲e̲ ̲E̲r̲r̲o̲r̲ ̲D̲e̲t̲e̲c̲t̲i̲o̲n̲
The CPUs and the MAP contains a Built-in testprogram
(BITE).
4.11.2.2.3 P̲U̲ ̲S̲o̲f̲t̲w̲a̲r̲e̲ ̲E̲r̲r̲o̲r̲ ̲D̲e̲t̲e̲c̲t̲i̲o̲n̲
a) At system start-up all programs and data files
loaded into memory will carry block parity check
sum to allow the detection of converted data.
b) The memory resident read only part of the system
software will be checked (via check sum) on the
supervisor's request and periodically.
c) High level external line protocols (e.g. continuity
and self-addressed service messages).
d) Parameter validity check, when receiving data.
e) Security and access control as specified in section
4.9.
f) On-line diagnostics programs. The programs execute
as low priority tasks or based upon build in tests.
The on-line diagnostics programs will detect an
error either prior to the error having had any
effect or subsequent to. In the latter case the
on-line diagnostics programs may prevent (if the
error is not detected by other means) an avalanche
of corrupted messages or transactions.
4.11.3 E̲r̲r̲o̲r̲ ̲F̲i̲x̲-̲U̲p̲
Errors are divided into groups, which define a type
of errors to which a common error fix-up action exists.
The following hardware error types are foreseen:
- PU or IO BUS error
- TDX-BUS system (TDX-BUS + TDX-I/F + TDX-CTR) error
- Mirrored disk system (DISK-CTR + DISK DRIVE + VOLUME)
- Off-line or floppy disk system
- LTU system (LTU + external line) error
- LTUX system (LTUX + BSM-X + terminal equipment)
error
- Watchdog or operator VDU or operator printer error
- Resource error (e.g. paper out)
- Power down
All error types except for resource errors are non-recoverable.
This is due to the fact that the FMS and THS give a
high level interface to peripherals e.g.
- performs repetition of operation (handles intermittent
errors)
- allocates a new disk sector, when a sector is bad
The following software error types are foreseen:
- Security or access error
- Resource error
- Miscellaneous
4.11.3.1 P̲U̲ ̲o̲r̲ ̲I̲O̲ ̲B̲U̲S̲ ̲E̲r̲r̲o̲r̲ ̲F̲i̲x̲-̲U̲p̲
A PU or IO BUS error will imply an ordered or an emergency
switch over to the stand by PU as described in section
4.3.1.4.
If the stand by PU is unavailable the active PU is
disabled and a WARM2 start-up of the off-line PU can
be performed.
4.11.3.2 T̲D̲X̲ ̲S̲y̲s̲t̲e̲m̲ ̲E̲r̲r̲o̲r̲
A TDX system error is handled by the Terminal Handling
System (THS) in the IOC and by the SSC in common. The
SSC switches LTUXs (vis the watchdog) to the appropriate
TDX-BUS and updates the Configuration table. The THS
performs a TDX-system switch over transparent to the
application (i.e. TEP and THP).
If the stand by TDX-BUS is used for off-line operation
a total system error exists, and the PU is disabled.
After insertion of a TDX BUS a WARM2 start-up can be
performed.
4.11.3.3 M̲i̲r̲r̲o̲r̲e̲d̲ ̲D̲i̲s̲k̲ ̲S̲y̲s̲t̲e̲m̲ ̲E̲r̲r̲o̲r̲
An error in one of the mirrored disks is handled by
the FMS in the IOC. The FMS switches to the stand by
disk system transparent to the users. The SSC is notified
and updates the Configuration Table.
If both mirrored disks fail, a total system error exists
and the PUs will be disabled. A WARM2 start-up can
be executed, when the disk system is repaired.
4.11.3.4 O̲f̲f̲-̲L̲i̲n̲e̲ ̲o̲r̲ ̲F̲l̲o̲p̲p̲y̲ ̲D̲i̲s̲k̲ ̲S̲y̲s̲t̲e̲m̲ ̲E̲r̲r̲o̲r̲
An error in the off-line or the floppy disk system
implies a removal of the disk system in question. The
packages involved in an error fix-up are:
a) SSC and TEP during the following operations:
- back-up of system parameter file
- off-loading of messages
- trace information storage
- copying of modified software to off-line disk
b) SSC and TEP and SAR during:
- retrieval of off-loaded messages
c) SSC during:
- load of start-up data
- copying of modified software to the on-line
disks
- memory dump
d) SSP during:
- off-line operations
During on-line operation, the SSC de-assigns devices
and dismounts the mirrored and the floppy disk
volumes, whereas the TEP dismounts the off-line
disk. Also, the SSC updates the Configuration table.
4.11.3.5 L̲T̲U̲ ̲S̲y̲s̲t̲e̲m̲ ̲E̲r̲r̲o̲r̲
An LTU line error is handled by the THP and the SSC:
The THP closes the line activities, whereas the SSC
deletes the THP instance, which handles the line. Also,
the SSC updates the Configuration table.
An LTU error affects up till 2 lines. Per line the
error fix-up is as described above.
4.11.3.6 L̲T̲U̲X̲ ̲S̲y̲s̲t̲e̲m̲ ̲E̲r̲r̲o̲r̲
An error in the terminal equipment is handled by the
TEP and the SSC. The TEP cancels the ongoing transaction(s),
whereas the SSC deletes the TEP instance and updates
the Configuration Table.
An error in a TRC line is handled by the THP and the
SSC. The THP closes the line activities, whereas the
SSC deletes the THP instance and updates the Configuration
table.
A spare LTUX exists to enable a hardware patch of line
equipment. The SSC controls the switching of assignment
of LTUX lines to the spare LTUX. The control is initiated
by an operator command. After a switching of terminal
equipment users have to sign-in.
An LTUX error involves the TEP or THP instances using
the LTUX. Per instance the fix-up is as described above.
A BSM error involves the TEP or THP instances using
the two LTUXs, which are controlled by the BSM. Per
terminal instance the fix-up is as described above.
4.4.3.7 W̲a̲t̲c̲h̲d̲o̲g̲ ̲S̲y̲s̲t̲e̲m̲ ̲E̲r̲r̲o̲r̲
If the watchdog fails, the operator VDU and printer
can be directly connected to one of the PUs for continuing
operator control.
When an active PU has no operator position, then the
SSC directs error messages to the supervisor printer.
If the operator VDU fails, then watchdog operation
not involving the VDU continues.
If the operator printer fails, then the SSC directs
active PU error messages to the supervisor printer.
The non-active PU will not be able to perform print-out.
The watchdog continues operations not involving the
printer.
4.11.3.8 P̲o̲w̲e̲r̲ ̲D̲o̲w̲n̲
If a power down is detected in the active PU an emergency
switch-over to the stand by PU is automatically executed.
If a power down is detected in the non-active PU, then
the PU is disabled.
The IO-crate contains two power supplies:
- one supply per IO BUS
- dualized supply per IO BUS device
A single power down has effects identical to a PU power
down.
A power down in a TDX crate implies that the two LTUXs
in the crate is taken out of service.
A power down in the watchdog implies that the watchdog
is taken out of service. Refer to section 4.11.3.7.
4.4.3.9 H̲a̲r̲d̲w̲a̲r̲e̲ ̲R̲e̲s̲o̲u̲r̲c̲e̲ ̲E̲r̲r̲o̲r̲
Hardware resource errors include:
- Paper out
- No more file space
This error type may be handled by an application. Further
discussion is deferred to detailed design.
4.11.3.10 S̲e̲c̲u̲r̲i̲t̲y̲ ̲o̲r̲ ̲A̲c̲c̲e̲s̲s̲ ̲C̲o̲n̲t̲r̲o̲l̲ ̲E̲r̲r̲o̲r̲
Reactions subsequent to the detection of an error during
security- or access control is defined in section 4.9.
4.11.3.11 S̲o̲f̲t̲w̲a̲r̲e̲ ̲R̲e̲s̲o̲u̲r̲c̲e̲ ̲E̲r̲r̲o̲r̲
Software resource errors include:
- Queue full
- No more file control blocks (FCBs)
This type of error may be handled by an application,
which for example may:
- wait until resource is free
- wait specified period for resource to become free
- stop input
Further details are deferred to detailed design. Refer
to section 4.11.4, where some overload situations are
handled.
4.11.3.12 M̲i̲s̲c̲e̲l̲l̲a̲n̲e̲o̲u̲s̲
This type of errors include:
- Semantic error in input parameter
- Time out during process communication
- Back lock (refer section 4.11.4)
These errors are due to a programming error. Further
details are provided during detailed design.
4.11.4 B̲a̲c̲k̲ ̲L̲o̲c̲k̲ ̲H̲a̲n̲d̲l̲i̲n̲g̲
Back lock handling refers to actions to
- avoid system overload
- avoid dead lock
4.11.4.1 D̲e̲a̲d̲ ̲L̲o̲c̲k̲
Dead lock refers to a situation, where processes demand
a number of shared resources to perform a function.
If for instance process A and B both require a reader
and a printer, and A has reserved the reader, and B
the printer, then a deadlock situation exists, is neither
A or B relinquishes their reserved resource.
It is a design aim to prevent dead locks. However,
this will introduce an overhead. During detailed design
it will be decided whether this overhead can be tolerated
or supervisor commands shall be defined to handle dead
locks, which
- are unlikely to occur and which
- it will imply a considerable overhead to avoid
4.11.4.2 O̲v̲e̲r̲l̲o̲a̲d̲
Overload will be prevented by the following general
method:
a) Issue a warning to the supervisor, when a resource
"warning threshold" is exceeded (refer figure 4.11.4.2-1).
A new warning will not be issued until the resource
consumption has been below the "warning enable
threshold".
b) Provide the supervisor with commands, which can
remedy the situation.
c) If the supervisor does not use his commands, then
an automatic ordered close down, where input is
inhibited, and where the system slowly dies out,
will be performed, when the "critical threshold"
is exceeded.
During start-up subsequent to the ordered close-down,
the supervisor is the first to be allowed to sign-in.
He will be provided with status describing various
resource consumptions and he can by means of the
commands in b) remove an overload situation prior
to start-up.
Resource Consumption
100%
Critical threshold
Warning threshold
Warning enable threshold
0%
FIGURE 4.11.4.2-1…01…RESOURCE THRESHOLDS
Overload situations will be described during detailed
design. At this level the following overload situations
are foreseen:
- queues
- intermediate storage
- short term storage
4.11.4.2.1 Q̲u̲e̲u̲e̲ ̲O̲v̲e̲r̲l̲o̲a̲d̲
Queues are allowed to expand on disks. However, the
thresholds in figure 4.11.4.2-1 are applied. Also a
warning is provided, when a queue element has resided
in a queue longer than a specified time.
The supervisor will be given commands to move queues
to the MDCO.
4.11.4.2.2 I̲n̲t̲e̲r̲m̲e̲d̲i̲a̲t̲e̲ ̲S̲t̲o̲r̲a̲g̲e̲
The general overload concept is followed. The supervisor
can offload messages/areas to the off-line disk.
4.11.4.2.3 S̲h̲o̲r̲t̲ ̲T̲e̲r̲m̲ ̲S̲t̲o̲r̲a̲g̲e̲
The general overload concept is followed. Also, a warning
is provided when a message has resided in the short
term storage longer than a specified period.
The supervisor will be given commands to delete old
messages.