top - download
⟦af2072e5f⟧ Wang Wps File
Length: 38319 (0x95af)
Types: Wang Wps File
Notes: FIX/1000/PSP/0038
Names: »5264A «
Derivation
└─⟦c5670ecfe⟧ Bits:30006140 8" Wang WCS floppy, CR 0516A
└─ ⟦this⟧ »5264A «
WangText
D…07…C…0f…C…00…C…06…B…09…B…0a…B…01…B…86…1
…02…
…02… …02…
5264A/rt
…02…FIX/1000/PSP/0038
…02…APE/850529…02……02…
FIKS SYSTEM
SPECIFICATION
FIKS
4.3.4 D̲u̲a̲l̲ ̲N̲O̲D̲E̲/̲M̲E̲D̲E̲ ̲D̲e̲s̲i̲g̲n̲
To make the operating of the FIKS system more independent
of failures in the hardware, some of the most malfunction
sensitive hardware elements has been duplicated. If
the hardware component used (the ACTIVE) becomes eroneous,
then the counterpart (the STANDBY) shall be ready to
take over the operations to be performed. To be able
to execute this SWITCHOVER of hardware with minimum
effort of the system operational management and inconvenience
for the user of the FIKS system, special design has
been made to handle those procedures. The design has
been implemented taken the following as guide lines/requirements.
- No narrative message must be last.
- The security operational modes must not be affected.
- Internal Node/MEDE routing information shall be
maintained.
- The Data Users must not be affected.
- The Node/MEDE shall be inoperable minimum of time
(less than 2 minutes).
4.3.4.1 D̲e̲s̲i̲g̲n̲ ̲O̲v̲e̲r̲v̲i̲e̲w̲
In fig. 4.3.4.1 is shown a simplified Dual Node/MEDE
hardware configuration with especially emphasizing
on the redundant elements. It is noted that:
- The CR80-computer hardware including user- and
file processors and TDX-hosts are dualized. This
is in the following denoted as a BRANCH.
- The TDX-controllers are dualized.
- The system is equipped with a Dual Disk System.
This can be accessed from both of the branches.
In ease of a fatal error in the active branch, i.e.
an error which makes it impossible for the branch to
continue operations, a SWITCHOVER from the active to
the standby branch has to be performed. The standby
branch has to be RECOVER'ed: This means that it must
be brought into a position from where the former active
failed. The standby branch has in advance been loaded
with all the necessary software modules, ready to start
executing of instructions. The disk system and the
TDX-system can immediately be used. The vital thing
missing to start the operations is the data placed
in the CR80-memory of the former active branch. These
data are recovered by use of CHECKPOINT's. Checkpoints
are data records stored outside the CR80-memory, and
that define the states and substates of the system
e.g.
- state of message being processed.
- state of trunk.
- state of terminal.
The records are in the FIKS-system stored in the disk
system, from where they can be retrieved by the standby
branch. After the processes are started in the standby
branch, the checkpoints are processed in the RESTART
procedure to reestablish the data structures in the
CR80-memory as close as possible to the original content
in the former active branch. The operations can now
continue.…86…1 …02… …02… …02… …02…
Figure 4.3.4.1…01…Dual Node/MEDE Hardware
4.3.4.2 W̲a̲t̲c̲h̲d̲o̲g̲
The Watchdog is a separate microcomputer, with the
capability to switch a relay board and to communicate
with the elements, it controls. Namely:
- 2 x Node/MEDE branches
- 2 x 2 TDX Bus controllers.
In addition the Watchdog manages the use of the console,
ref. fig. 4.3.4.2.1.
W̲A̲T̲C̲H̲ ̲O̲F̲ ̲N̲o̲d̲e̲/̲M̲E̲D̲E̲-̲b̲r̲a̲n̲c̲h̲e̲s̲.̲
The controlling is based upon the following inputs:
- Answers (or missing answers) upon requests for
'Alive Status Reports' sent to the operational
system (ESP) with regular time intervals.
- Error reports sent from the FIKS application processes.
- Input from the system operator.
Normally one of the branches will be the ACTIVE and
the other the STANDBY. If the Watchdog senses a fatal
error in the active branch, based upon Alive Status
Reports and error reports, it shall arrange and manage
switchover of the branches. I.e. let the formerly standby
branch become the new active and see to that the formerly
active stops the erroneous processing (be CLOSED).
To do this, the Watchdog has certain actions at the
disposal:
- Asking the system operator for allowance to switchover
with the console printout.
"SWITCHOVER FROM P1 TO P2 ALLOWED?"
If the operator accepts or he does not answer,
the switchover will be executed.
- Issuing of commands to the Node/MEDE operational
system (ESP).
The commands are as follows:
CLOSE A controlled termination of
the operations in the branch
shall be performed.
RECOVER The branch shall prepare to
take over the active operations
and then do it.
STANDBY ON/ A report is send to the
STANDBY OFF active branch telling that
the standby branch is available/not
available for taking over
operation. This information
is passed on to the SCC.…86…1
…02… …02… …02… …02…
Figure 4.3.4.2.1…01…Watchdog Interface Diagram
C̲o̲n̲t̲r̲o̲l̲l̲i̲n̲g̲ ̲o̲f̲ ̲t̲h̲e̲ ̲T̲D̲X̲-̲c̲o̲n̲t̲r̲o̲l̲l̲e̲r̲s̲
The controlling is based upon a polling of the TDX-controllers.
If a polling indicates an error in the active controller
then (if possible) switchover to the standby is performed.
The system operator has the opportunity to initiate
the switchover by himself.
M̲a̲n̲a̲g̲e̲m̲e̲n̲t̲ ̲o̲f̲ ̲t̲h̲e̲ ̲C̲o̲n̲s̲o̲l̲e̲
The Watchdog has the task of managing the console,
as a resource shared between the Watchdog itself and
the two Node/MEDE-branches. These may in turn be connected
to the console. The operator select by means of control
keys strokes which of the three units the console shall
communicate with:
CTRL/W: Watchdog monitor mode
CTRL/O: Transparent mode BRANCH ONE
CTRL/T: Transparent mode BRANCH TWO
The transparent modes are used, when the console shall
act a system console for one of the branches i.e. when
bootloading the system, doing offline diagnostics,
etc. In these modes the branches are not supervised
by the Watchdog. In the monitor mode the operator can
by keying in:
C: Ask for printout of current Watchdog status.
R: Switchover of the red TDX-controllers.
B: Switchover of the black TDX-controllers.
M̲a̲n̲u̲a̲l̲ ̲o̲p̲e̲r̲a̲t̲i̲o̲n̲ ̲o̲f̲ ̲t̲h̲e̲ ̲W̲a̲t̲c̲h̲d̲o̲g̲
In the Watchdog is placed a self checking mechanism,
which starts a visual alarm in the relay board in case
of failure in the Watchdog-CPU (ref. fig. 4.3.4.2.2).
It is then possible to run the system by manipulating
the switches on the front panel in a manual mode. By
setting the switch to manual, one can select the wanted
controller and sending "master clear" to the specified
branch.
O̲p̲e̲r̲a̲t̲i̲n̲g̲ ̲t̲h̲e̲ ̲s̲y̲s̲t̲e̲m̲ ̲w̲i̲t̲h̲o̲u̲t̲ ̲t̲h̲e̲ ̲W̲a̲t̲c̲h̲d̲o̲g̲
The operational system (ESP) is transparent to where
it receives the Watchdog commands from. Therefore it
is possible by connecting the console directly to the
SCM-bord in a branch, for a system operator to issue
commands to the branch just as if it was the Watchdog.
In this way the Node/MEDE can be operational even if
the Watchdog is missing.
Figure 4.3.4.2.2…01…Front Panel Layout
4.3.4.3 E̲S̲P̲ ̲S̲y̲s̲t̲e̲m̲
In the FIKS System the ESP System (ERROR SWITCHOVER
PROCESS) makes out the FIKS System Operational Software.
The ESP has been designed to handle the following tasks:
- Interface to the Watchdog and the system operator,
ref. 4.3.4.2.
- CR80 Memory Management.
As the FIKS system software in principles is loaded
and runs forever, there is no need for dynamic
allocation/delocation of memory areas. The memory
management is therefore mostly concerned with utilizing
the memory in an optimal way (no gaps in memory).
The system maintainer specifies a strategy for
laying out the memory. In the initializing phase
(ref. sec. 4.3.4.4) the allocation of memory in
carefully logged. It is then left to the system
maintainer to see if, it is convenient. If not,
he can change the memory layout.
- Supervision/Management of the Disk System, ref.
sec. 4.3.4.8.
- System Command Performance.
The system is controlled by issuing of commands
to the ESP, which then executes the commands. The
commands may have their origin from different kind
of sources.
- the Watchdog (ref. sec. 4.3.4.3)
- The operator
- items in a Job Control File, created and edited
offline. The commands in this file is read
and executed sequential. A whole sequence of
command can be executed in this manner with
a single command.
- sequence of commands. At moments where there
is no access to the disk system, Job Control
Files can not be used. Instead commands sequences
fetched from internal ESP-data is used.
- applications. Execution of a command can be
initialized by an application in the FIKS-system.
- System Command Performance
The commands to be executed are mainly:
- LOAD-commands
Those are used when modules (program, processes,
critical regions, etc) are to be loaded into
the CR80-memory.
- START/STOP/REMOVE process
- Commands concerning the disk system, e.g. FMS-user
ON/OFF, ASSIGN/DEASSIGN of devices, MOUNT/DEMOUNT
of volumes, UPDATE of volumes etc. (ref. sec.
4.3.4.8)
- Setting of system time (DTG)
- Watchdog commands.
- CLOSE/RECOVER system (ref. sec. 4.3.4.9). STANDBY
ON/OFF (ref. sec. 4.3.4.2)
- System Initialization Management. By receiving,
interpretation and execution of commands issued
by the Watchdog/operator the ESP is able to perform
the different kinds of system initializations/changes
that may be needed. (ref. 4.3.4.4 - System Initialization,
ref. 4.3.4.8 - Switchover of branches).
- Background Management.
The loading and scheduling of background tasks
is left to the ESP. ref. sec. 4.3.4.5.
- System Error Handling.
Error cases reported by the application processes
are received by the ESP. It is then up to the ESP
to report these to the Watchdog/system operator
and if needed to take proper action upon the reports,
ref. sec. 4.3.4.6.
4.3.4.4 S̲y̲s̲t̲e̲m̲ ̲I̲n̲i̲t̲i̲a̲l̲i̲z̲a̲t̲i̲o̲n̲
When a branch has been "master cleared" the only active
process in the CR80-computer is then FIKS BOOT LOADER.
This is a PROM-resident program, special implemented
to handle the security demands in the FIKS system.
This process can, as response upon system operator
input from the console, load a BOOT MODULE into the
CR80 memory in both the user- and file processor. The
boot modules contain the necessary software modules
and configurations parameters needed to start up the
CR80 AMOS operational system.
In the following is listed the most important items
in these boot modules. This will also be a list of
which CR80 standard software modules, that is used
in the FIKS-system.
U̲s̲e̲r̲ ̲P̲r̲o̲c̲e̲s̲s̲o̲r̲ ̲B̲o̲o̲t̲ ̲M̲o̲d̲u̲l̲e̲
- CR80 AMOS MONITOR KERNEL.
This is the lowest level of the CR80 AMOS system.
The KERNEL implements processes, CPU management,
inter process communication and the lowest level
of I/O device handling, i.e. interrupt handling.
In the FIKS system is used a version which includes
CRITICAL REGIONS and has system data placed in
page 1 of the CR80-memory.
- The ROOT-module.
This is in the FIKS system equal to the ESP-system.
- Declaration of how many CPUs the system is configurated
with. (2 CPUs with names CPU000/CPU001) and the
time slice values used for the three possible priority
levels the processes may have.
- Declaration of haw many processes, AMOS messages
and critical regions that may exist in the system.
- CR80 AMOS I/O SYSTEM
The I/O system is a program module which implements
a set of procedures, that interfaces the user to
the peripherals, i.e. in the FIKS system to the
CR80 File Management System and the CR80 TDX System.
- CR80 DMA LINK.
This process handles the data transfers between
the user - and file processor. (user processor
version).
- CR80 TDX-DRIVER
The TDX-driver makes out the interface between
the CR80 TDX HOST computer and the AMOS I/O System.
This module is not included in the boot module
but loaded later in the initializing phase.
F̲i̲l̲e̲ ̲P̲r̲o̲c̲e̲s̲s̲o̲r̲ ̲B̲o̲o̲t̲ ̲M̲o̲d̲u̲l̲e̲
- CR80 AMOS MONITOR KERNEL.
A version with no critical regions and which has
system data placed in page 0 is used.
- ROOT-module.
The standard AMOS ROOT module is used.
- Declarations concerning CPU-use.
One CPU with name CPU000 is used.
- Declaration of how many processes, AMOS messages
that may exist.
- CR80 FILE MANAGEMENT SYSTEM.
This system makes out the interface between the
I/O-system and the files placed in the disk system
(CDE- and FLOPPY-disks).
- CR80 CDC DRIVER.
This is the process that handles the interface
to the CDC-disks.
- CR80 FLOPPY DRIVER
This is the process that handles the interface
to the FLOPPY-diskettes.
- CR80 DMA LINK
This process handles the data transfers between
the user- and file processor (file processor version).
The boot modules are generated offline by means of
the CR80 AMOS SYSGEN utility program.
When the boot loading is finished, then the ESP-process
is started and hereafter the ESP-system is responsible
of the further initialization.
At Node/MEDE installations the system operator has
to tell the system which state (ACTIVE/STANDBY) the
branch is going to be and which branch it is (ONE/TWO),
ref. sec. 4.3.4.1.
The system time (DTG) must always be specified. As
it is very important, that the DTG is correct, due
to the use as key-index in the HDB-system (ref. sec.
4.1.4) through checking of the DTG is performed. Besides
having correct format, it must not be less than the
DTG of the youngest message stored on HDB. If it is
much greater, then it could be a mistake of the operator,
and he is warned giving him the opportunity to reset
the DTG, before processing is started.
The system is then initialized to be ACTIVE/STANDBY.
This is performed by issuing of commands to the ESP.
The commands are placed in the Job Control Files with
names "ACTIVE/STANDBY". Those files determines how
the FIKS CR80 software configuration is (ref. sec.
4.3.4.3).
The files contains commands about:
- Loading and creating of all critical regions.
- Loading and initializing of all monitor procedures.
- Initializing of all FIKS data-areas (MTCB-, QACCESS-
and RDF-areas). This initialization is performed
by special implemented processes, which only are
present in this phase. (MTCB ̲INIT-, QACCESS ̲INIT
and RDF ̲INIT-process).
- The CHECKPOINT-process (ref. sec. 4.3.4.7) is loaded.
In the ACTIVE-file it is also started to be used
at recovery of those data, that is independent
of SWITCHOVER (last used page number, message number,
etc.). This recovery is performed by the SYSCHP-process.
- The TDX-drivers are loaded. In the ACTIVE-file,
they are also started and the TDX-system is initialized,
i.e. the LTUX's is loaded with configuration parameters.
- Then the rest of the FIKS application processes
are loaded. In the ACTIVE-file they are also started.
4.3.4.5 B̲a̲c̲k̲g̲r̲o̲u̲n̲d̲ ̲P̲r̲o̲c̲e̲s̲s̲i̲n̲g̲
Some tasks in the FIKS-system may be of low priority
and some tasks may only be activated periodicly or
very seldom. If nothing else was done, then these tasks
would occupy their individual part of the CR80-memory.
They might as well take turns on using one destinat
area and thereby share this. In this way the memory
is much better utilized. The above mentioned scheme
is used in the FIKS Background Management.
O̲u̲t̲l̲i̲n̲e̲
A fixed amount of memory has been allocated for use
of background processing, one area for program and
one for process. Those areas must not by any coincident
be used of more than one background task (BGT) at one
moment to ensure successful background processing.
A BGT is said to be ACTIVATED/DEACTIVATED if any processing
concerned the BGT is involved/not involved in those
areas. A deactivated BGT (BGT-A) may then be dumped
to disk, another deactivated BGT (BGT-B) loaded and
activated. After a while (background processing time
slice) BGT-B is deactivated and dumped to disk. Then
BGT-A is loaded and activated, etc.
Besides the BGT itself other processes may access the
areas. Those processes will be defined as DRIVERs.
Assuming that the concept for standard CR80 software
drivers is used, no DRIVER access is tantamount to
that no AMOS answer/systemanswer/path answer is awaited
from a DRIVER (outstanding IO-requests). Refer to CR80
AMOS Kernel, C33/302/PSP/0008. Processes may deliver
AMOS events to a BGT. This involve use of a message
buffer and some "event-registers" in the BGT-process
area. If these events are rerouted to areas outside
the BGT-areas, while the BGT is deactivated, then the
following simple criterias will define when a BGT is
deactivated:
A. The Kernel state of the BGT is STOPPED.
B. The Kernel state of the BGT is GOING ̲TO ̲BE ̲STOPPED
and some Kernel event is awaited.
C. There must not exist any outstanding IO-requests.
The BGT is then DEACTIVATED if
(A and C) or (B and C) is true.
The Kernel-module used in the FIKS system has been
modified to handle this scheme and to give notice to
the ESP-process when a BGT is DEACTIVATED or it request
processing (awaited event occur).
I̲m̲p̲l̲e̲m̲e̲n̲t̲a̲t̲i̲o̲n̲
The BGTs are initial loaded and started as ordinary
processes. After a certain time slice or when the BGT-A
gets deactivated (no processing demand) the ESP is
notified and the next BGT-B shall be loaded and activated.
The following processing is performed.
- BGT-A is STOPPED.
(deactivation started).
- When BGT-A is deactivated, it is swapped out on
a disk file. Each BGT has its own distinct area
in this file. A disk volume with fast access is
used for this file.
- BGT-B is determined by using a "Background Schedule"
- scheme similar to the concept used in the Kernel
multiprocess scheduling. Three priorities and one
idle priority, used when none of the others are
active.
- BGT-B is swapped in from the disk file and activated.
- The next event concerning background processing
is awaited.
L̲i̲m̲i̲t̲a̲t̲i̲o̲n̲s̲
The design of FIKS Background Processing has been performed
so that the designer of FIKS-applications shall not
care too much about, whether the task should be loaded
as background task or not. Some considerations, however,
have to be taken into account:
- A BGT must not have outstanding IOs in long periods.
I.e. when doing IO-operations, these must be expected
to be finished within a reasonable time. (2-3 seconds
equal to the BGT-time slice). The BGT with outstanding
IOs will obstruct for the other BGTs. A warning
about 'BGPS DISORDER' will appear on the system
console as an error report. If still after three
warnings, the IO has not yet finished, then the
system takes it as a fatal error with involving
of SWITCHOVER etc.
- The BGT must be of limited size.
The BGT-memory areas shall be valid for all BGTs.
- Use of processing resources (CPU-time, IOs, etc.)
shall be limited. The BGTs are loaded with the
lowest Kernel-priority and shall share the resources
with all other BGTs. A risk for bottle-necks may
arise. An IDLE-priority to be used for tasks with
"absolute" low priority tasks (online diagnostic
programs) has been implemented.
4.3.4.6 E̲r̲r̲o̲r̲ ̲H̲a̲n̲d̲l̲i̲n̲g̲
Errors in the FIKS system can arise because of different
reasons and be handled in different ways.
- Hardware errors.
If an error occurs in a vital hardware component,
then the only thing, that may be done to recover,
is to switchover to use a possible dualized counterpart.
The error may occur in a component of non vital
importance (one terminal, one trunk). The erroneous
component can be excluded from the system and rerouting
of concerned message traffic performed. The system
will still be operating but now in a reduced version.
- Resource errors.
Some resources (MTCBs, Queue elements, disk files,
etc) are shared between the applications. Caused
by the limited number of the resources and the
random way they are reserved by the applications,
there is a certain probability, that a lack of
resources arises. The way of solving this problem,
is to wait for release of the resources, and see
to that those already reserved, get released.
- Software errors.
In the developping and debugging phase of an application,
errors may arise due to errors in the design/code
or incorrect configuration of the system. The cause
shall be removed and restart of the system performed.
The errors are sensed by processes in the FIKS-system
and reported by use of standard CR80 system software
procedures to the ESP. The reports will contain:
- name of reporting process
- an error code stating the cause of the error.
- an error label, stating the stage in the processing
where the error occurred.
The above mentioned items shall satisfy unique determination
of all possible error cases. The reporting process
is stopped when it issues a report.
The ESP receives the report, analyses it and formats
a report to be presented for the system operator and
the Watchdog. They may then take action upon this.
Based upon the analyse the ESP takes actions by itself.
This may involve restart of the logging process, which
then also may take some actions.
In the following, the layout of the error report and
the actions to be taken, is outlined.
Error report items:
- Watchdog header.
This contains information to the Watchdog about
fatal/non fatal error in hardware equipment. The
Watchdog decision about switchover is based upon
this.
- Indication of time.
- Name of logging process.
- Error code.
- Error label
- Action
The actions are based upon an analyse of the error
codes and can be:
- IGNORED
The error report is used to pass information to
the system operator. This may be the result of
an online diagnostic test, that has no influence
in other respects. The process is restarted.
- LOCAL ̲FIX ̲UP.
The error is one of those that may be recovered
or is allowed to exist (resource errors/single
device out of use). The ESP restarts the process
and it is then up to the process to take further
actions.
- DISCARD ̲DISK
When using the FIKS Dual Disk system (ref. sec.
4.3.4.8) a hardware error in one of two disk may
occur. This can be recovered by discarding the
erroneous disk unit. This is performed sole by
the ESP. The discovery and reporting of the error
is performed in such a way that is transparent
to the process.
- SWITCHOVER
A fatal error (hardware/software) has occurred
in the CR80 computer. The Watchdog header will
contain this information. The Watchdog will then,
if possible, start a "Switchover of Branches",
ref. sec. 4.3.4.9.
4.3.4.7 C̲h̲e̲c̲k̲p̲o̲i̲n̲t̲i̲n̲g̲
A checkpoint represents a state (or part of a state)
of some data structure in the CR80-memory, that shall
be reestablished in connection with "Switchover of
Branches", ref. sec. 4.3.4.9. The data structure, that
is checkpointed and recovered in the FIKS system, and
thereby not lost in case of fatal error in the CR80-computer,
are:
- Terminal Control Blocks.
This will ensure that all information concerning
a terminal and its user will be available also
after Switchover. I.e. logged on/off users will
be logged on/off after Switchover and there will
be no violation against the security procedures
concerning terminal operations.
- Message Preparation Pool.
If the system breaks down in the middle of a message
preparation, the preparator do not need to key
in the whole message once again, when Switchover
is finished.
- MTCBs and Queues.
From a message is released until it is printed
on the receivers terminal, it is checkpointed in
such a way, that the message will not be lost regardless
any computer in the FIKS networks crashed, any
point in the message processing it occur. This
is achieved by carefully checkpointing the Message
Control Block (MTCB), each time it indicates a
change in the message state and checkpointing in
which queue, the message is placed at one given
moment.
- Routing Tables.
All routing information not already kept on disk
files, is checkpointed so that all routing/rerouting
of message traffic is in affect even after Switchover.
Checkpointing is performed by the application processes
by sending an AMOS message to the Checkpoint process,
which then format the final checkpoint and writes it
to the disk.
The AMOS-message contains information about what kind
of checkpointing, that is wanted to be executed. When
the checkpointing is finished, an AMOS answer is returned
to the application process and it can proceed with
the processing. In this way it is exactly controlled
when and in which sequence the checkpointing is performed.
Thereby inconsistance between, what is checkpointed
and what has been processed, is avoided.
In the restart phase the checkpoints are retrieved
for building up the checkpointed data structures. As
much as possible of this recovery is performed on system
level, i.e. the application processes do not need to
care for this processing.
It shall be noted that the checkpointing is redundant
if Switchover does not occur. It adds to the overhead
processing. It is therefore desired not to checkpoint
more than needed. Because of this not all processing
is checkpointed. The processing/procedures that may
be easily repeated after Switchover is not reflected
in checkpoints. In this way a Switchover can also act
as "clean up"-procedure. All processing not concerned
with the data structures mentioned previous is cleared.
4.3.4.8 D̲u̲a̲l̲ ̲D̲i̲s̲k̲ ̲O̲p̲e̲r̲a̲t̲i̲o̲n̲s̲
The FIKS Dual Disk Hardware configuration is outlined
in figure 4.3.4.3-1.
It is noted:
- 2 disk units (DISK ONE, DISK TWO) is available.
One in each BRANCH.
- both disk units can be accessed from both branches.
I.e. a disk unit is not allocated for especially
to be used one branch.
- 1 floppy disk unit is available. This can only
be accessed from BRANCH ONE.
- A special "File Processor" is allocated to perform
disk operations.
- The File Processor is connected to the User Processor,
where the application processes are running, via
a DMA link.
The idea with this configuration is to make FIKS operations
more independent of hardware failures in the disk system.
The two disk units are intended to be copies of each
other. If one of them fails, then the other will still
be present to carry out the operations, but now alone.
To handle this design, the software design as outlined
in the following, has been implemented. Ref. fig. 4.3.4.8-2.
S̲t̲a̲n̲d̲a̲r̲d̲ ̲D̲i̲s̲k̲ ̲O̲p̲e̲r̲a̲t̲i̲o̲n̲s̲
When an application process placed in the user processor
requests a disk read/write operation on a logical file,
a command concerning this is sent via the IO-system
and DMA-driver to the File Management System (FMS)
in the file processor. The FMS translates the command
to disk-sector read/write commands. These are handed
over to the disk drivers, one for each disk unit. The
disk drivers perform the final interface to the disk
controllers. By this disk sectors are transferred to/from
the CR80-memory (disk cache). The data transfer between
user and file processor is controlled by the FMS and
executed by the DMA-drivers. When the operation is
finished, the application is informed about completion
via the FMS, DMA-drivers and IO-system.…86…1 …02…
…02… …02… …02… …02…
Figure 4.3.4.8-1…01…Dual Disk Hardware Configuration
Figure 4.3.4.8-2…01…Disk Software Configuration
D̲u̲a̲l̲ ̲D̲i̲s̲k̲ ̲O̲p̲e̲r̲a̲t̲i̲o̲n̲s̲
When the Disk System status is DUAL, both disk units
are available. Disk read operations are performed from
one of the units while disk write operations are performed
on both units. In case of hardware failure in one of
the units, this one is DISCARDED. The one unit left
is then used as single, i.e. both read/write operations.
The Disk System status has become ONE/TWO, corresponding
to the unit left. Later on when the erroneous disk
is repaired or exchanged, it can be included in the
disk system again - a DUALIZE ̲DISKS-procedure is performed.
After START ̲DUALIZE all disk write operations are performed
on both units. Read operations are still performed
from the unit included in the system all the time (the
old one). Meantime copying of all disk sectors from
the old unit to the new take place. When the copying
is finished, the units will be identicals. The whole
procedure is terminated with FINISH ̲DUALIZE, after
which the Disk Status again in DUAL. The whole procedure
has been performed without having the Node/MEDE out
of operation at any moment. The dualize-procedure is
activated by starting the background task "DUALIZE
̲DISKS".
Both branches can access the disks at the same time,
but only the ACTIVE branch is allowed to (and can)
do write operations on the disks.
All dual disk operations are transparent to the application
processes, i.e. they do not need to care for the disk
status. This holds also for disk hardware failures
that can be recovered by discarding a unit. The error
is discovered in the IO-system, the ESP is notified
and discards the unit, before ok completion is returned
to the application ref. figure 4.3.4.8-2.
The Disk Status is checkpointed (ref. sec. 4.3.4.7)
each time a change in it occurs. The checkpoint is
going to be used at the following procedure.
I̲n̲i̲t̲i̲a̲l̲i̲z̲i̲n̲g̲ ̲o̲f̲ ̲t̲h̲e̲ ̲D̲i̲s̲k̲ ̲S̲y̲s̲t̲e̲m̲
When the FIKS System is bootloaded a 'hardware' Disk
Status is achieved. This status tells which disk units
can be accessed from a hardware point of view. The
BRANCH/STATE is settled. On basis of this and the checkpointed
Disk Status, the Disk Status to be used in the further
processing is determined. The status shall be that
used last time the system was ACTIVE, or at least not
indicate use of a disk unit not in use last time. This
is assured by using the largest common Disk Status
of all the three possible disk status, one checkpointed
from each disk unit and one hardware disk status. The
disk status is then checkpointed. If the branch is
going to be ACTIVE then allowance to do disk writing
is given.
4.3.4.9 S̲w̲i̲t̲c̲h̲o̲v̲e̲r̲ ̲o̲f̲ ̲B̲r̲a̲n̲c̲h̲e̲s̲
Suppose that an entire dualized Node/MEDE configuration
as in figure 4.3.4.1 exist. One branch (BRANCH ONE)
is ACTIVE and the other branch (BRANCH TWO) is STANDBY.
The ACTIVE branch is performing all the operations
and the STANDBY branch is as ready as possible to take
over the operations. A fatal error occurs in the ACTIVE
branch. The STANDBY branch is then to take over the
operations. This happens as follows:
- The error is sensed by the Watchdog, either by
missing answers upon the 'Alive Status Report'
(ref. sec. 4.3.4.2) to the ESP in the ACTIVE branch
or at reception of a fatal error report (ref.sec.
4.3.4.6) from the ACTIVE branch.
The Watchdog knows then it has to start the Switchover
procedrue.
- The system operator is asked if 'SWITCHOVER FROM
P1 TO P2 ALLOWED". The operator has 10 seconds
to decide. If he answers 'NO', then all further
processing is cancelled. If 'YES' or the time expires,
then the Switchover proceeds.
- Before the STANDBY branch can take over, possible
ongoing processing in the ACTIVE branch has to
be stopped. If both branches were executing active
operations at the same moment, then this would
cause severe confusion in the disk/TDX-system.
The Watchdog issues a CLOSE-command to the ACTIVE
branch and waits upon completion of this command.
- The ESP receives the command. It is very important
that CLOSE of the system is performed in a proper
manner. It is especially important, that the processing
concerning accessing of the disk system is terminated
in such a way, that the logical coherence is kept.
All application processes are terminated so that
the disk files being accessed is dismantled correctly.
This cleanup procedure is accomplished by use of
the Job Control File 'CLOSE'. When finished, the
ESP will take care that the disks/TDX-hosts used
is released. Completion about execution of the
CLOSE-command is returned to the Watchdog.
- When the Watchdog receives CLOSE-completion or
the waiting upon expires, the Watchdog will MASTERCLEAR
the former ACTIVE branch. In this way it is assured,
that the branch will not in any way interfere with
the processing in the coming ACTIVE branch. The
Watchdog will then issue a Recover-command to the
STANDBY branch.
- The ESP receives the command and starts the RESTART-procedure.
First the disk system is initialized (ref. sec.
4.3.4.9). This will ensure that the same disk configuration
as used by the former ACTIVE branch also will be
used of the coming ACTIVE branch. A global semaphare,
telling all the application processes, that they
have to do their part of the RESTART-procedures,
is set.
- The RECOVER of those CR80-memory data structures,
that can be handled on system level, is now performed.
To do that, the checkpoints stored by the former
ACTIVE branch is used. Special applications have
been developped for this purpose. The SYSCHP-process
recovers 'Terminal Control Blocks' and 'Message
Preparation Pool', the RECOVM-process recovers
the MTCBs and Queues and the RECMES-process recovers
the messages that was being in preparation at the
moment of Swithchover (ref. sec. 4.3.4.7). The
task of resetting all the Message Preparation Files
not in use is left to the RESPDB-process.
- The TDX-system is initialized once again. All the
application processes, loaded earlier in the STANDBY
initializing phase, are started. They will then
do their part of the RECOVER-procedure.
- The execution of the total RESTART-procedure is
controlled by using the Job Control File 'RECOVER'.
The whole procedure takes about 2 minutes. This
is the time the Node/MEDE will be out of operation
in case of fatal errors. When the procedure is
finished a message will be sent to the SCC, telling
that 'Switchover' has occurred. The Watchdog is
told that the branch is now ACTIVE.
4.3.4.10 S̲y̲s̲t̲e̲m̲ ̲U̲t̲i̲l̲i̲t̲i̲e̲s̲
The FIKS operating system (the ESP) offers a wide range
of utilities to the staff (system operator, technicians,
programmer and designer), that maintains the system:
- Printout of various system states:
- The status of the processes.
- The disk system status, no. of read/write operations
performed/failed, etc.
- The TDX-system status, no. of transmissions/retransmissions/errors,
etc.
- The file processor status.
- The status of the critical regions:
The printouts are meant to be used for diagnostics,
resource consumption investigations, debugging,
etc.
- Error dumps.
At error cases, the whole CR80-memory or part of
it can be dumped to disk files, either by intervention
from a system operator or in an automatic way,
initialized by a process error report.
- Floppy utilities.
The error dumps performed can be retrieved from
the disk files to floppy files while the FIKS system
is operating online. Minor updates of the software
configurations may also be executed online. The
software modules to be updated is loaded from floppy
to the disk system and a SWITCHOVER-procedure is
performed. After this the updated modules are included
in the operating system.
- Online inspection utilities.
The CR80-memory and the disk files can be inspected
online, while the FIKS system is running. This
gives very good opportunities in the debugugging
and analyzing phase of error cases. Those utilities
are also meant to be used in test procedures of
updated/new software modules in the FIKS system.
The use of some of the utilities has a security
aspect. Therefore these utilities can not be activated
unless a PASSWORD-procedure has been passed.
4.3.4.11 S̲y̲s̲t̲e̲m̲ ̲O̲f̲f̲l̲i̲n̲e̲ ̲S̲t̲a̲t̲u̲s̲
The maintenance of the FIKS software system is in principle
performed in an offline mode using the standard CR80
AMOS Terminal Operating System (TOS). The TOS gives
access to all needed functions
- generations of programs
- editing of files
- patching of files
- copying of files
- disk system recovery
- etc.
The TOS-use is carried out by BOOT-loading the CR80-computer
with especial boot modules.
Other offline states of the CR80-computer, also started
by using a special boot-module, may exist.
- AMOS Master Clear Utilities.
(low level CR80-computer testing and analyzing).
- Disk Test Utilities
(Especial FIKS-application)
- Backup of disk volumes.
Uncontrolled access to start up of these offline states
would cause security problems. Therefore the boot modules,
which are the key to the states, can not normally be
used. They will have to be enabled. This can only be
done in a FIKS-operating system by a system operator,
that has passed a PASSWORD-procedure.