Rational/R1000s400/Logbook/2022
2022-12-22 It sounds like a soft packet
The "nopar" simulation keeled over:
====>> Environment Elaborator <<==== Elaborating subsystem: ENVIRONMENT_DEBUGGER Elaborating subsystem: ABSTRACT_TYPES Elaborating subsystem: MISCELLANEOUS Elaborating subsystem: OS_UTILITIES Elaborating subsystem: ELABORATOR_DATABASE ====>> Environment Log <<==== 04:55:13 *** EEDB.Init Format_Error Exception = Constraint_Error (Array Index), from PC = #2ED808, #E8 *** Calling task (16#EFE8C04#) will be stopped in wait service 04:55:13 !!! Internal_diagnostic (trace) ** Start of trace ** 04:55:13 !!! Internal_diagnostic (trace) Task: #EFEB004 04:55:14 !!! Internal_diagnostic (trace) Frame: 1, Pc = #92013, #64D 04:55:14 !!! Internal_diagnostic (trace) in rendezvous with #EFE8C04 04:55:14 !!! Internal_diagnostic (trace) Task: #EFE8C04 04:55:14 !!! Internal_diagnostic (trace) Frame: 1, Pc = #163C13, #151B 04:55:14 !!! Internal_diagnostic (trace) Frame: 2, Pc = #163C13, #1B52 04:55:14 !!! Internal_diagnostic (trace) Frame: 3, Pc = #163C13, #19B1 04:55:14 !!! Internal_diagnostic (trace) Frame: 4, Pc = #92813, #C8 04:55:15 !!! Internal_diagnostic (trace) Frame: 5, Pc = #92813, #D6 04:55:15 !!! Internal_diagnostic (trace) Frame: 6, Pc = #92413, #E4 04:55:15 !!! Internal_diagnostic (trace) Frame: 7, Pc = #92013, #CC 04:55:15 !!! Internal_diagnostic (trace) ** End of trace **
We cant say we are surprised, since that was a really long shot.
But there is also good news, really good news in fact: We finally cracked how to delete CMVC views which (think they) have a remote component:
design_implementation.set_target("RS6000_AIX_VADS", "!PROJECTS.…");
gets rid of the *** The target key "" has not been registered.
diagnostic, and:
switches.set("rci.host_only=true", "!PROJECTS.….STATE.COMPILER_SWITCHES");
prevents attempts to contact long gone University of Haute Alsace machines to delete the remote views.
With this new found power, aided by positive developments in the AutoArchaeologist which we have forgotten to talk about, we have deleted 44431 names so we now have a set of disk-images which boot in 26 instead of 61 minutes.
This makes a huge difference both on the real hardware and for the emulator.
The AutoArchaeologist breakthrough is that we have cracked the format of the segmented-heap, in which the Directory Daemon
holds the entire file system name-space.
The segment in question is the only one with a value of 0x81 in the field we have named "tag", and when excavated it looks like this (NB: very huge & slow webpage):
https://datamuseum.dk/aa/r1k_backup/2b/2bea6d323.html
Simply having a list of the entire contents of the disk-images is an enormous help:
For one thing it allowed us to construct lists of commands to be issued through the operators terminal, using our little python script, leaving the machine to chew unattended for hours, instead of us manually having to attend ever minute or two.
The other thing the list of filenames reveal is what an archaeologic treasure these disk-images are:
!.MACHINE.RELEASE.ENVIRONMENT.D_12_5_0 !.MACHINE.RELEASE.ENVIRONMENT.D_12_6_5 !.MACHINE.RELEASE.ENVIRONMENT.D_12_7_3 !.MACHINE.RELEASE.ENVIRONMENT.D_12_7_4
And we find no less than 48 versions of !COMMANDS.SYSTEM_MAINTENANCE
Also revealed by the list: Students are messy :-)
2022-12-16 All we want for X-mas
The "nopar" simulation is at 1285 seconds, projected to hit the 1957 seconds where the first "main" run faltered on december 26th, which we call "2nd X-mas day" here in Denmark.
The amazing boot speedup we reported in the previous report was a fluke: Sometimes the system will emit the username :
prompt on the operator console before the initialization is fully completed.
We have written a new python3 script which reads commands from a (unix) text-file, submits them via the operators serial console and logs the communication with timestamps.
Using this script, two sequential boots, starting from the original PAM disk images, differed only 0.2% in duration. (3671 seconds vs. 3678 seconds.)
Booting the same image with only one MEM32 board is 8.5% slower (3989 seconds.) There is no difference during the "starting vm" phase, and most of the slowdown happens after INITIALIZE
. This supports our theory that deinstalling some of the software may help.
Our first command script destroys some large archive files, removes !PROJECTS.KBA.SYSTEM
(121022 blocks), runs Daemon.run("Daily")
and shuts down.
That took 6 hours to run, including boot and shutdown. Having saved a snapshot of the resulting disk-image, we should never need to run it ever again.
The new snapshot boots in 15% faster, 3132 seconds, with most of the speedup being in the "starting vm" and EEDB phases.
2022-12-10 SITREP
After some experiments we are now running two jobs in parallel on the MacBook M2:
(The label on the right Y-axis should read "days")
In purple and green we see the "main" run which failed.
In cyan and orange we have the precise same run as the one which failed on us, to find out if it fails the same way again.
In yellow and blue is the "nopar" branch, which runs 2.8 times faster, to find out if it fails the same way.
By early January we will have one of three outcomes:
A) "nopar" fails like "main."
B) "nopar" continues past the point of failure.
C) "nopar" fails some other way sooner.
It goes without saying that we are rooting for "B", and not unreasonably so, since both the VME/ENP100, RTC and internal modem emulator code has been changed significantly between the version running "main" and "nopar".
In the meantime we have redoubled our efforts to reduce the boot-time on the real computer.
The boot time for the pristine image, as PAM brought the disks, is 37 minutes to launch the VM and 32 minutes from there to the terminals are enabled for login, 69 minutes in total.
Several hours of cmvc.destroy_view
and Daemon.run("Disk")
this Thursday, gained us three minutes in VM startup, but a whopping 10 minutes from there on, so we are down to 54 minutes total.
(Not quite enough to make us restart the "nopar" run, but close: It simulates 15 minutes in 12 days.)
Our main hypothesis for the post-VM speedup is that deleting views under !PROJECTS.KBA.SYSTEM
, one of the largest subsystems, CMVC had less work to do during initialization.
We are also reading up on the boot/initialization sequence, hoping to be able to disable everything related to ethernet.
Since we are not emulating the ENP100, the best we can hope for from that code, is a pile of angry error messages, after some generous timeouts.
At least that is what we saw on the real system when we removed the ENP100.
At worst our emulation of the missing ENP100 might lead to kernel panics, including potentially, the one which stopped our first long run.
2022-11-27 Options
We are of course disappointed that the emulator died after three months, but to be honest, we never expected it to get that far in the first place.
Now that it has stopped, we are thinking hard about what we do next, because a 3 month turn-around does not indicate trial&error as a viable strategy: Actuarial tables gives us, at best, 50 sequential experiments.
We have come up with this list of calamities we cannot rule out, in no specific order:
A) An inaccuracy in the redrawn schematics.
B) A deficiency in one of the SystemC component classes
C) A race condition due to imprecise emulation of component timing
D) Wrong responses to I/O requests sent to the IOP
E) Wrong Cluster Number or other EEPROM or NOVRAM content
F) The schematics do not match the hardware 100%
G) MacBook execution error, Cosmic rays, etc.
Their probabilities are different, but falsifying any one of them will take a lot of time and effort.
The emitted diagnostic is undoubtedly an important clue, but in a language it will take considerable time for us to master.
And while we ponder these options, the MacBook could be chucking away to get us more data, but what should we run on it ?
If we restart the exact same run, in three+ months we will learn if:
A) The problem is exactly reproducible, ie: it crashes, microcode trace RAMs are identical.
B) Approximately reproducible, ie: it crashes approximately the same, but microcode trace RAM tells different story.
C) Not reproducible, ie: The emulation fails in a different way.
D) Transient, ie: The emulation continues to login in 5-6 months.
If we start one of the optimized branches, we only learn something about this failure, if the optimization introduced no new problems, about which we have almost zero confidence.
On the other hand, if new problems have been introduced, we will know much sooner than three months, and if not, we will learn about the current problem in approx 6 weeks instead of 3 months.
We have also not explored what the performance impact would be, if we ran two instances of the emulator in parallel.
So why not do both ?
We will - once this entry has been saved in the logbook.
So what are our options ?
One obvious and compelling option is to switch to a hardware based approach.
Find a suitable FPGA evaluation board, convert our netlist to VHDL, come up with component models in VHDL, and run tests in a matter of days and hours instead of months.
It is of course nowehere near as simple as that, but if we jumped on it, we might have the first test-result before the MacBook delivers the next one.
It is however not cheap.
We need are pretty good FPGA to fit all of the R1000, the M68K20 and five 8052 CPUs, and we need 32+MB of RAM for the MEM32 board, external or internal to the FPGA.
That is of course assuming we can find a M68k20 model to use, and that we reverse engineer the RESHA board, because we only have preliminary schematics for that.
Partitioning the IOC board so the IOP can run on an external support processor is probably both faster and more feasible.
Also: None of us are VHDL sharks.
On the plus side: The fantastic diagnostic subsystem of the R1000 will help a lot.
Another option is to speed up how fast we can run tests on the software emulation.
We have looked at getting more CPU cores engaged, ideally one per board.
SystemC has threads, but they are notoriously slow because of the cross-thread locking the require, and in particular, they all seem to use the same central event-scheduling data-structures, so there is no real prospect of any gain by running a thread per board.
A more promising avenue would be to run each board in a separate UNIX process, and implement the front- & backplane in shared memory, using atomic instructions to (spin-)lock the boards each simulated 200ns clock period.
Such an implementation will run at the speed of the slowest board, and we currently see boards run 100-200 times slower, when simulated individually.
If we assume such a model end up running only 100 times slower than hardware, that means the five processes must synchronize every (200ns * 100 = 20µs) which is not unreasonable.
This is clearly worth looking into.
Finally, it is time we complete the support for snapshots, so that we can restart a three month run from T - N hours, instead of not getting $200 when we cross "Start". This has been in the design from the very start, until now we have just not needed it enough.
So much work to do, so little time...
2022-11-24 All good things
After 84 days & 15½ hours, and 1957 seconds of simulated time:
====>> Kernel.11.5.8 <<==== Kernel: Kernel assert failure detected by kernel debugger stub. System shutting down. Exception: !Lrm.System.Assertion_Error, from PC = #F6413, #707 *************************************** Sequencer has detected a machine check.
The overall simulation ratio was 1/3736, not quite a second per hour, but close.
2022-11-24 Getting further
IMAGE.11.4.2D CORE_EDITOR.11.6.2D TOOLS.11.5.1D OE_MECHANISMS.11.1.2D OBJECT_EDITOR.11.6.1D MAIL.DELTA OS_COMMANDS.11.6.2D ====>> Kernel.11.5.8 <<==== Kernel:
2022-11-15 It keeps dripping
====>> Elaborator Database <<==== COMPILER_UTILITIES.11.51.0D SEMANTICS.11.50.3D R1000_DEPENDENT.11.51.0D R1000_CHECKING.11.51.0D R1000_CODE_GEN.11.51.0D ====>> Kernel.11.5.8 <<==== Kernel:
2022-11-13 Houston, this is not a problem
This morning on the simulated console:
DISK_CLEANER.11.1.3D PARSER.11.50.1D PRETTY_PRINTER.11.50.3D DIRECTORY.11.4.6D INPUT_OUTPUT.11.7.0D ====>> Environment Log <<==== 00:01:06 !!! Product_Authorization Invalid for Work_Orders 00:01:06 !!! Product_Authorization Invalid for Cmvc 00:01:06 !!! Product_Authorization Invalid for Insight 00:01:07 !!! Product_Authorization Invalid for Rpc 00:01:07 !!! Product_Authorization Invalid for Tcp/Ip 00:01:07 !!! Product_Authorization Invalid for Rci 00:01:07 !!! Product_Authorization Invalid for X Interface 00:01:08 !!! Product_Authorization Invalid for Rs6000_Aix_Ibm 00:01:08 !!! Product_Authorization Invalid for Cmvc.Source_Control 00:01:08 !!! Product_Authorization Invalid for Rcf 00:01:09 !!! Product_Authorization Invalid for Testmate 00:01:09 !!! Product_Authorization Invalid for Lrm_Interface 00:01:09 !!! Product_Authorization Invalid for Fundamental Session 00:01:10 !!! Product_Authorization Invalid for Telnet 00:01:10 !!! Product_Authorization Invalid for Dtia 00:01:10 !!! Product_Authorization Invalid for X_Library 00:01:10 !!! Product_Authorization Invalid for Ftp ====>> Kernel.11.5.8 <<==== Kernel:
That is probably because the disk-image is from PAM's machine, (cluster number 408459), whereas the IOC-EEPROM image, where the cluster number is stored, is from Terma's machine (cluster number 453305).
If we are luck we get a login-prompt sooner this way, if lack of authorization eliminates these layeres products from the workload.
2022-11-10 Loss of grid power
We lost grid power for 30 minutes today:
But the MacBook M2 coasted right through, and have added two more packages in the last three days:
BASIC_MANAGERS.11.3.0D ADA_MANAGEMENT.11.50.4D
2022-11-07 Action!
Things are happening now:
the virtual memory system is up ====>> Kernel.11.5.8 <<==== Kernel: START_NETWORK_IO Kernel: START_ENVIRONMENT TRACE LEVEL: INFORMATIVE ====>> Environment Elaborator <<==== Elaborating subsystem: ENVIRONMENT_DEBUGGER Elaborating subsystem: ABSTRACT_TYPES Elaborating subsystem: MISCELLANEOUS Elaborating subsystem: OS_UTILITIES Elaborating subsystem: ELABORATOR_DATABASE ====>> Elaborator Database <<==== NETWORK.11.1.5D OM_MECHANISMS.11.2.0D
This is from the MacBook M2, after emulating approx 1560 seconds of CPU time.
We have scanned all the documents we received from Grady Booch, and archived them in the Datamuseum.dk Bit Archive and are reading our way through them.
There has also been progress on the optimized branches of the emulator, but not enough to warrant a detailed update yet.
2022-10-22 Potato-Vacation
Week 42 used to be "potato-vacation" in Denmark, where kids were out of school to help get the potatoes harvested. These days we call it "autumn-vacation" where people close down their "summer-houses" or just stay inside and read.
We did a first quick read of the documents Grady sent us, and it is obvious that this is really going to further our understanding of the R1000 machine and software.
The HW-identical simulation is at 1223 seconds in 52 days and simulating.
We have also implemented "turbo download" which speeds up loading the microcode:
The purple and corresponding green-ish lines are the MacBook Pro (M2) running the HW-identical emulation.
The big drop at 3.8 seconds is when the microcode has been loaded and started.
The small peaks at 4.75 seconds is when the microcode has initialized and waits for the download of the segments specified in the configuration. The peak at 5 seconds is the pause to show the copyright message on the operator console.
The cyan and corresponding orange lines is the "megacomp4" branch, running on a Lenovo T41s. Only the first minor peak is visible. Notice that performance is approximately the same as the rr000 run, but on a machine with only half the performance.
The yellow and corresponding blue-ish lines is with the "turbo download" enabled, that saves a second of simulated time and about 15 minutes of real time.
The red and corresponding black lines are with both "turbo download" and "direct download". Now the microcode download is almost instant, and we are executing microcode in a matter of (real) minutes.
So what is "turbo download" ?
On the real machine, the IOP pours the microcode onto the DIAGBUS, interleaving "experiments" for multiple cards in order to speed up things.
But because SystemC is single-threaded, that parallelism does not happen since only one DIPROC execute at a time.
"Turbo download" cheats. Instead of passing the received DIAGBUS data to the DIPROC and it's interrupt routine, the thread which subscribed to the "elastic" buffer, writes the experiment directly into the DIPROC RAM. Since each DIPROC has it's own elastic-thread, that runs in parallel, which shaves a second of emulated time.
"Direct download" takes this a step further, it recognizes specific experiments, for instance LOAD_CONTROL_STORE_200.FIU
, picks the microcode bits out, and sticks them directly into the SystemC components shared memory context, without involving the DIPROC and the thread executing the SystemC emulation at all.
Not only is that parallel, it is also much faster, as can be seen from the diagonal lines: It that shaves one (real) hour of our test-runs.
2022-10-17 Packet from Grady
1101 seconds emulated in 46 days, still nothing new on the console.
This morning a packet arrived with a donation from Grady Booch:
Amongst the gems were:
All these documents will be scanned and made available in our BitArchive as soon as possible.
We have also experimented with a new branch where we start to remove "excess" functionality in order to gain speed and flexibility.
First to go where the parity-checks and ECC checks. Unless we explicitly want to trigger them, they will never happen, so there is no need for them in the "show-the-Environment" version of the simulator.
Next step is to also remove the diagnostic archipelago, both as a matter of speed, but also because it makes very concrete demands on the circuitry in order to function, demands which for instance prevents us from using dual-port component for the TYP+VAL register-files.
The first thing is to stop using the DIPROCs, so far we have implemented downloading of microcode, register-files and dispatch ram, by taking the data from the DIAGBUS and stuffing it directly into the shared memory context of the applicable components. Works nicely.
This then frees us from the exact layout of the diagnostic registers, for instance to rearrange bits in the microcode words to make for more and wider SystemC busses etc.
2022-10-09 A quarter of the way ?
921 seconds emulated in 936 hours, so we are about ¼ of the way to login.
The optimized schematics run at about 2100 seconds per second, 43% faster than the hw-identical simulation, and our current test-run has started initializing the virtual memory, so it looks correct-ish.
If we let it run, it will overtake the hw-identical simuation in a couple of months, call it mid-december, and get to the login-prompt three weeks sooner, early january rather than late january.
Once side effect of both of these runs, is that they give us a trace of which disk-sectors the kernel reads and writes during VM initialization, this will be valuable information when we resume trying to figure out the on-disk layout.
2022-10-01 Now talking to the emulator
733 seconds emulated in 740 hours, otherwise known as "a month".
A three day run with the optimized schematics, booted into "KERNEL mode" allowed us to issue kernel commands via the serial port for the first time:
CLI/CRASH MENU - options are: 1 => enter CLI 2 => make a CRASHDUMP tape 3 => display CRASH INFO 4 => Boot DDC configuration 5 => Boot EEDB configuration 6 => Boot STANDARD configuration Enter option [enter CLI] : 4 […] Starting R1000 Environment - it now owns this console. ====>> Kernel.11.5.8 <<==== ERROR_LOG <<==== 00:51:13 --- TCP_IP_Driver.Worker.Finalized ====>> Kernel.11.5.8 <<==== Kernel: SHOW_CONFIGURATION_BITS IOP 0 POWER ON CPU 0 POWER ON OPERATOR MODE => AUTOMATIC KERNEL DEBUGGER AUTO BOOT => TRUE KERNEL AUTO BOOT => FALSE EEDB AUTO BOOT => FALSE KERNEL DEBUGGER WAIT ON CRASH => FALSE KERNEL DEBUGGER DIALOUT ON CRASH => FALSE DIAGNOSTIC MODEM CAN DIALOUT => FALSE DIAGNOSTIC MODEM CAN ANSWER => TRUE Processor revision => 2.0 IOP revision => 4.2.18 Kernel: SHOW_DISK_SUMMARY DISK STATUS SUMMARY Q IOP Total Total Seek Soft Hard Un Total Vol Unt Len Len Reads Writes Errs Ecc Ecc Recov Errs -------------------------------------------------------------------------- -- 0 -- -- -- -- -- -- -- -- -- -- 1 -- -- -- -- -- -- -- -- -- no disk IO in progress Debugging information: Ready_Volume mask => 0 Busy_Event_Page => ( 1023, DATA, 259, 193) Volume_Offline_Event_Page => ( 1023, DATA, 259, 194) Kernel:
2022-09-25 10 minutes in 25 days
We just crossed 600 seconds of emulated time, after running for 25 days, so the "end of January" prognosis still holds.
We have implemented enough of the Xecom XE1201 integrated modem that the
{diagnostic modem: received DISCONNECT event}
Message no longer appears.
The optimized "megacomp4" version of the schematics are nearly 60% faster than the hardware-identical "main" branch.
2022-09-18 The emulation goes ever ever on…
424 seconds emulated in 419 hours, nothing new output on the console, as expected.
2022-09-13 Good R1000 news
The R1000 was still in a bad mood this afternoon:
R1000-400 IOC SELFTEST 1.3.2 512 KB memory ... * * * * * * * FAILED
This provided an opportunity to try our new boot image, which provided us with the following hint:
Defect chips detected: H34
So not the slightly suspicious H40, but the oscilloscope verified the too well-known latch-up on H34, so little doubt remained on what was needed...
Since this is probably not the last time we have to do this, the procedure we have followed is as follows:
Cut the chip away from its legs (much easier to de-solder the legs one at a time).
Cleaning the pads were not quite as easy, but succeeded:
Then solder in the replacement:
With a new H34 in place, the R1000 became happy again:
R1000-400 IOC SELFTEST 1.3.2 512 KB memory ... [OK] Memory parity ... [OK] I/O bus control ... [OK] I/O bus map ... [OK] I/O bus map parity ... [OK] I/O bus transactions ... [OK] PIT ... [OK] Modem DUART channel ... [OK] Diagnostic DUART channel ... [OK] Clock / Calendar ... Warning: Calendar crystal out of spec! ... [OK] Checking for RESHA board RESHA EEProm Interface ... [OK] Downloading RESHA EEProm 0 - TEST Downloading RESHA EEProm 1 - LANCE Downloading RESHA EEProm 2 - DISK Downloading RESHA EEProm 3 - TAPE DIAGNOSTIC MODEM ... DISABLED RESHA VME sub-tests ... [OK] LANCE chip Selftest ... [OK] RESHA DISK SCSI sub-tests ... [OK] RESHA TAPE SCSI sub-tests ... [OK] Local interrupts ... [OK] Illegal reference protection ... [OK] I/O bus parity ... [OK] I/O bus spurious interrupts ... [OK] Temperature sensors ... [OK] IOC diagnostic processor ... [OK] Power margining ... [OK] Clock margining ... [OK] Selftest passed Restarting R1000-400S September 15th, 1922 at 17:25:23 OPERATOR MODE MENU - options are: 1 => Change BOOT/CRASH/MAINTENANCE options 2 => Change IOP CONFIGURATION 3 => Enable manual crash debugging (EXPERTS ONLY) 4 => Boot IOP, prompting for tape or disk 5 => Boot SYSTEM Enter option [Boot SYSTEM] :
The Rational R1000-400 is fully functional again!
2022-09-13 Thank you for holding…
The MacBook Pro chews on, 13 days so far and 329 seconds of emulated time, a little better than "an hour per second".
So far about 11300 disk-accesses have taken place as part of the virtual memory startup.
Here is a plot of the simulation rate and accumulated time:
The jump at 110 seconds is where the virtual memory startup commences.
If we zoom in on the left side:
…we see the plot also contains a much shorter run using optimized schematics (and on the MacMini), which is about 33% faster.
We are almost through the list of trivial optimizations, and it looks like the first hard one should be the TYP and VAL register files, which are quite slow to emulate, because they have so many and so fast control signals.
Making the central RAM component of the register files dual-port will probably help, but then we may start to fail diagnostic experiments which rely on the specific topology of the busses.
There are also some medium-hard optimizations, for instance building 64 and 72 bit shift registers for MDR, WDR &c, and modify the permuation tables in the DIPROC to make them more SystemC-busable.
2022-09-08 A really obscure bug
(Emulation still running, 186 hours in real time, 207 seconds emulated time, 5400 disk accesses)
We noticed a file called ENP100.M200
in the DFS filesystem, it contains a low-level debug/exerciser program for the ENP100 network processor, so we launched a IOC-CLI-only emulator and tried to run it, in order to get the IOP memory mapping correct, even though we have no plans to implement the ENP100 in the emulator.
When we tried the DOWNLOAD
command, which downloads the ENP100 firmware, the IOC bailed out with a PASCAL error:
ENP100> download PASCAL error #1 at location 00010808 Last called from location 00029480 Last called from location 00029702 Last called from location 0002F632 Last called from location 0003336C Last called from location 00033454 Last called from location 000338CC Abort : Run time error PASCAL error #1 From ENP100
One part of the download is the IP-number to use, which is read from the DFS file TCP_IP_HOST_ID
which contains 192.5.60.20
.
To convert the IP-number to a 32 bit integer, each quarter is converted to binary, and multiplied by a scaling factor, and this is where it goes wrong: 192 * 0x01000000 is 0xc0000000, but the multiplication routine used (at 0x1028c) sees that as a negative result and bails out.
Using a class-A IP-number, one that is less than 128.0.0.0, works fine.
Neither https://www.rfc-editor.org/rfc/rfc1117 nor https://www.rfc-editor.org/rfc/rfc1166 lists any IP numbers (obviously) assigned to Rational, so we suspect they used an unofficial class-A network internally, probably 89/8 like everybody else.
2022-09-04 good and bad R1000 news
The good news is that emulator still runs and have progressed to:
[…] Starting R1000 Environment - it now owns this console. {diagnostic modem: received DISCONNECT event} ====>> Kernel.11.5.8 <<==== Kernel: CHANGE_GHOST_LOGGING WANT TRACING: FALSE WANT LOGGING: FALSE Kernel: START_VIRTUAL_MEMORY ALLOW PAGE FAULTS: YES ====>> ERROR_LOG <<==== 22:32:20 --- TCP_IP_Driver.Worker.Finalized ====>> CONFIGURATOR <<==== starting diagnosis of configuration starting virtual memory system
And if it keeps running we can start to log in … ehh … sometime early next year.
Time to start optimizations in earnest now.
The bad news is that when we tried to start the real R1000 today, the IOC reported RAM errors.
When we repaired this IOC three years ago we thought the RAM chip in position H40 behaved slightly suspect, but we did not replace it. Maybe it has finally failed now ?
2022-08-31 Same place, a week later
We have started a run on the MacBook Pro, with the intent to let it run until something happens.
Something can either be a microcode halt, or boot progressing from:
Starting R1000 Environment - it now owns this console.
To the kernel signing in two or three minutes later:
====>> Kernel.11.5.8 <<==== Kernel: CHANGE_GHOST_LOGGING […]
That will take better part of a week, since we are simulating the HW-identical "main" schematics.
2022-08-24 The debugging never ends
We're getting further and further:
Loading : KAB.11.0.1.MLOAD Loading : KMI.11.0.0.MLOAD Loading : KKDIO.11.0.3.MLOAD Loading : KKD.11.0.0S.MLOAD Loading : KK.11.5.9K.MLOAD Loading : EEDB.11.2.0D.MLOAD Loading : UOSU.11.3.0D.MLOAD Loading : UED.10.0.0R.MLOAD Loading : UM.11.1.5D.MLOAD Loading : UAT.11.2.2D.MLOAD 851/1529 wired/total pages loaded. The use of this system is subject to the software license terms and conditions agreed upon between Rational and the Customer. Copyright 1992 by Rational. RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DoD FAR Supplement 252.227-7013. Rational 3320 Scott Blvd. Santa Clara, California 95054-3197 Starting R1000 Environment - it now owns this console. *************************************** Sequencer has detected a machine check. ************************************************ Booting R1000 IOP after R1000 Halt or Machine Check detected Boot Reason code = 0C, from PC 0001ADA2 Restarting R1000-400S August 24th, 1921 at 02:25:27
It looks like the µcode stops at location 0x204, which according to page 150 in the Knowledge Transfer Manual is 0204_HAVE_MULTI_BIT_MEMORY_ERROR.
Preliminary debugging indicates that is indeed happening, next will be to find out why. Test runs around 70 minutes to fail.
2022-08-23 Waiting for the cows to come home
Things take time, so until we have all the requisite test-runs in, we cannot be certain, but it looks like we pass all the self-tests, after implementing the missing bits of the R1000/IOP memory access in the "megacomp3" branch.
We have also tried to boot the Environment, and it looks like the first and only sector of KAB.11.0.1.MLOAD
correctly gets read into IOP RAM at 0x40000, the R1000 is notified through the "response FIFO", DMA's the sector into R1000-land and signals completion through the "request FIFO", which interrupts the IOP, which service the interrupt.
But the interrupt remains raised, so it immediately re-services the interrupt, and after doing that 255 times, it reads a 0xffff value which trips a range-check.
Looking, it is not obvious where the irq_lower()
call should go, but it is clearly not there now.
2022-08-21 Kitting up for the next phase
Since we are nowhere near pari speedwise, and since we get further and further into the selftest, some effort have gone into finding the best platform to run tests on.
Ideally it should be have a fast CPU and given the potential runtimes of months, backup power would be nice, even though Denmark has one of the most stable power-grids.
Having tried running the R1000 emulation on various machines we have access to, it appeared that Apple's M1 beat every other computer, hands down. As in: it runs the emulation almost twice as fast as a new 'Lenovo T14s Gen 2" laptop.
We have therefore procured a MacBook Pro, which gives us a fast CPU with built in UPS. It has the newer M2 chip, but we see no statistically significant speed difference from the M1 chip.
Simulating the 24 seconds necessary to run the uDIAG
test, takes 32000 seconds, almost 9 hours, at an average emulation/hw ratio of 1300.
However this number is slightly misleading, as the microcode loading emulates much faster, yet still takes up ¼ of the emulated time, looking only at the microcode execution time, 16 seconds emulated in 29000 seconds, the performance drops to a ratio of 1800.
Here is a plot as a function of the emulated time:
The first plateau at 1.4% of hardware speed is the microcode loading.
The repetitive square-wave pattern are the tests of each of the four cache-sets (A/B times Early/Late), and as can be seen, that takes up the majority of the test run.
The FRU
tests currently fails with:
*** ERROR reported by P3URF: The IOC board got a memory parity error while the microcode was restoring the register file contents from IOC memory. Field replaceable units : IOC Board *** P3UCODE failed ***
That is consistent with current thinking that the R1000 interface to the IOC RAM is our next hurdle.
Currently we use a "megacomponent" which implements all 512Kx26 ram as a single SystemC class, and it uses the IOP-emulation's RAM as backing store, and since the IOP-emulation does not maintain parity bits, the error makes sense.
We have yet to try to move the IOP RAM entirely into the SystemC space, by having the IOP perform memory cycles through the 68K20 SystemC "shell-component", but that is probably the end result, as that seems the only realistic way to keep the hardware-identical "main" branch of the schematics working - with its 72 chip IOP SRAM bank.
If we are lucky, we can identify limited memory ranges which the R1000 CPU accesses, and "cache" the rest, which likely includes all the actual 68K instructions, outside the SystemC model.
But first, we need to make that P3UCODE
subtest fail faster.
2022-08-16 It seems to be working
We have now replicated the successful uDIAG
run three times,
so it is clearly not a fluke.
The connection from the R1000 CPU to the IOP's 512KByte RAM is not implemented, and we expected uDIAG to test that, but it seems not.
One of the runs were done on an Apple Mac-Mini with the M1 ARM CPU, it ran nearly twice as fast as the T14s laptop.
2022-08-14 Unexpectedly good news
CLI> x run_udiag preparing to run the Confidence Test (uDIAG) The long version stress tests the DRAMs but runs 2 minutes longer Do you want to run the long version [N] ? n Loading from file DIAG.M200_UCODE bound on November 15, 1989 13:02:00 Loading Register Files and Dispatch Rams .... [OK] Loading Control Store ............ [OK] the Confidence test (uDIAG) passed CLI>
Begin statistics 61559.274716 s Wall Clock Time 23.083458220 s SystemC simulation 1/2666.8 SystemC Simulation ratio 51981.890124900 s IOC simulation 51949.796301 s IOC stopped 40179247 IOC instructions 63148.231 s User time 1631.137 s System time 109780 Max RSS End statistics
This was the megacomp3 branch of the schematics, git rev 7d92f98c7415251d59fe.
2022-08-12 Indeed exciting news
[…] CLEAR_DRIVE_HIT.M32 RESET.M32 LOAD_CONFIG.M32 Phase 3 passed Diagnostic execution menu 1 => Test the foreplane 2 => Run all tests 3 => Run all tests applicable to a given FRU (board) 4 => Run a specific test 0 => Return to main menu Please enter option :
Using the un-optimized, (ie: identical to HW) schematics, this took a bit over five days, 4187 times slower than the real machine.
2022-08-05 Maybe exciting news
We found another 8/0 misreading, and now the tests just keep running.
Until they stop, one way or another, we will not know what the status is, but it looks good-ish.
Update:
run_udiag
has run for 28½ hours now, currently toodling around at microaddress 0x26a5…7, which according to the information in the Knowledge Transfer Manual (pdf pg 136) is in the MEM_TEST
area.
At the same time, a "FRU" test has been running 37⅓ hours, and have gotten past P2ABUS
and currently chugging through P2MM
.
It is incredibly boring and incredibly exciting at the same time :-)
Both these machines use the 'main' branch of the schematics, identical to the schematics in the binder we got to Terma.
We expect run_udiag
to fail when it gets to SYS_IOC_TEST
because the M68K20's RAM is not the same as the SystemC models RAM.
2022-08-02 It's 30% faster to ask a friend…
The video of Michael Druke's presentation from his july 5th visit is now up on YouTube:
But there were of course more questions to ask than we could get through in a single day.
There is a signal called BUZZ_OFF~
which pops up all over the place like this:
The net effect of this signal is to turn a lot of tri-state bus-drivers off during the first quarter ("Q1") of the machine-cycle, but not because another driver needs the bus, since they are also gated by the BUZZ_OFF~
signal.
So why then ?
As Mike explains in the presentation, there are no truly digital signals, they are all analog when you put the scope on them, and he explained in a later email that »The reason for (BUZZ_OFF) is suppressing noise glitches when the busses switch at the beginning of the cycle.«
That makes a lot of sense, once you think about it.
By always putting a bus into "Hi-Z" state between driven periods, the inputs will drain some of the charge away, and the voltage will drift towards the middle from whatever side the bus was driven.
Next time the bus is driven, the driver chips will have to do less work, and it totally eliminates any risk of "shoot-through" if one driver is slow to release while another is fast to drive.
(Are there books with this kind of big-computer HW-design wisdom ?)
Our emulation do use truly digital signals, it is not subject to ground-bounce, reflections, leakage, capacitance and all those pesky physical phenomena, so BUZZ_OFF~
is needlessly triggering a lot of components, twice every single machine-cycle - ten million times per simulated second of time.
Preliminary experiments indicate a 30% speedup without the BUZZ_OFF~
signal, but we need to run the full gamut of tests before we can be sure it is OK.
2022-07-31 P2FP and P2EVNT passes
In both cases trivial typos and misunderstandings.
Next up is P2ABUS which tests the address bus, which takes us into semi-murky territory, including the fidelity of our PAL-to-SystemC conversion.
On a recent tour of the museum, a guest asked why we use the "simulation / real" ratio as our performance metric, and the answer is that that when the performance gap is on the order of thousands, percentages are not very communicative:
Machine | Branch | Ratio | Percentage | Performance |
---|---|---|---|---|
CI-server | main | 4173 | 0.024 | -99.976 % |
CI-server | megacomp2 | 2380 | 0.042 | -99.958 % |
T14s laptop | megacomp2 | 1142 | 0.088 | -99.912 % |
But we are getting closer to the magic threshold of "kHz instead of MHz".
2022-07-24 VAL: valeō
[…] TEST_Z_CNTR_WALKING.VAL Loading from file PHASE2_MULT_TEST.M200_UCODE bound on July 16, 1986 14:31:44 Loading Register Files and Dispatch Rams .... [OK] Loading Control Store [OK] TEST_MULTIPLIER.VAL CLEAR_PARITY.VAL LOAD_WCS_UIR.VAL RESET.VAL P2VAL passed
This also means that we are, in some limited amount, able to execute microcode.
2022-07-23 Et tu TYP?
With a similar workaround, the P2TYP
test completes.
2022-07-22 Moving along
After some work on the disassembly of the .M200
IOP programs, specifically the P2VAL.M200 program, it transpired that the reason the "COUNTER OVERFLOW" test failed is because the P2VAL
program busy-waits for the experiment to complete, and the simulated IOP runs too fast:
0002077c PUSHTXT "TEST_LOOP_CNTR_OVERFLOW.VAL" [push other arguments for ExpLoad] 000207a0 JSR ExpLoad(PTR.L, PTR.L) [push other arguments for ExpXmit] 000207ae JSR ExpXmit(EXP.L, NODE.B) 000207b6 MOVE.L #-5000,D7 [push arguments for DiProcPing] 000207ca JSR DiProcPing(adr.B, &status.B, &b80,B, &b40.B) 000207d2 ADDQ.L #0x1,D7 000207d4 BEQ 0x207de [check proper status] 000207dc BNE 0x207bc 000207de [...]
We need a proper fix for this, preferably something which does not involve slowing the DIAGBUS down all the time.
In the meantime, we can work around the problem by patching the constant -5000 from the CLI:
dfs patch P2VAL.M200 0x7b8 0xff 0xfe 0xec 0x78
That gets us to:
[…] TEST_Z_CNTR_WALKING.VAL Loading from file PHASE2_MULT_TEST.M200_UCODE bound on July 16, 1986 14:31:44 Loading Register Files and Dispatch Rams .... [OK] Loading Control Store [OK] TEST_MULTIPLIER.VAL *** ERROR reported by P2VAL: An error in the multiplier logic was detected (P2VAL). Field replaceable units : VALUE Board *** P2VAL failed ***
Which can either be a problem with the multiplier circuit, which we have not seen activated until now, or failing microcode execution, which we have also not seen much of yet.
The multiplication circuit on the VAL board is quite complex, it takes up a 7 full pages, because the 16x16=>32 multiplier had to be built out of four 8x8=>16 multiplier chips and 4-bit adders to combine their output.
2022-07-17 Lots of cleanup
With all boards passing unit-tests, the next step is to start to execute micro-code, first diagnostic and when that works, the real thing.
Such a juncture is a good opportunity for a bit of cleanup, and this is currently ongoing.
Right now the FRU
program errors out with:
Running FRU P2VAL TEST_LOOP_CNTR_OVERFLOW.VAL*** ERROR reported by P2VAL: VAL LOOP COUNTER overflow does not work correctly (P2VAL).
Getting to the point of failure takes 5 hours on our fastest machine (at 1/1300 speed ratio with all boards), but if we tell FRU
to run P2VAL
directly, it instead launches P2FP
instead, which after some unknown micro-instructions have executed, fails with a generic error message (see previous entry.)
2022-07-05 Mike Druke visits
Today Mike Druke and his wife finally to visit us, this was yet another much anticipated event rudely postponed by Covid-19.
We showed Mike a running R1000 machine, in this case PAM's machine, but using the IOC board from the Terma machine, we also toured our little exhibition, for the occation augmented with a Nova2 computer from the magazines and on the way to lunch we stopped to demo our 50+ year old GIER computer.
In the afternoon Mike gave a wondeful talk about Rational, the people, the company, the ideas and the computers.
The video recording from Mike's talk will go into our bit-archive and be posted online, when the post-processing is finished.
Work on the emulator continues and has reached the major milestone where microcode is being executed:
Running FRU P2FP Loading from file FPTEST.M200_UCODE bound on January 29, 1990 17:26:52 Loading Register Files and Dispatch Rams .... [OK] Loading Control Store [OK] *** ERROR reported by P2FP: ABORT -> uCODE DID NOT HALT AT CORRECT ADDRESS
Now we need to figure out what the diagnostic microcode was supposed to do and once we understand that, figure out why it did not.
2022-07-02 FRU and DFS hacking
Going forward, the FRU program is going to be our primary test-driver, and the emulation already passes phase-1, which seems to more or less consist of the same experiments as the TEST_$BOARD.EM scripts.
The first test which fails in phase-2 is the attempt to make the request-FIFO on the IOC generate an interrupt, and that is understandable, because that part of the SystemC code is not hooked up to the MC68K20 emulation.
But in order to get to that point the P2IOC test spends some hours on other experiments, and because FRU expects all boards to be "plugged in", and that is still pretty slow.
That catapulted an old entry from the TODO list to the top, so now the emulation has a "dfs" cli command, which allows reading, writing, editing (with sed(1)) of files in the DFS filesystem, and a special subcommand "dfs neuter" to turn an experiment into a successful no-op.
With that in place, and when neutering eight experiments, it only takes a couple of minutes to get to the WRITE_REQUEST_QUEUE_FIFO.IOC experiment.
When run individually the P2FIU, P2SEQ, P2MEM, P2STOP, P2COND and P2CSA tests all seem to pass.
The P2TYP and P2VAL tests both fail on "LOOP COUNTER overflow does not work correctly", which sounds simple, and P2EVNT fails with "The signal for the GP_TIMER event is stuck low on the backplane" which may simply be because the IOP cannot read the STATUS register yet.
So all in all, no unexpected or unexpectedly bad news from FRU … yet.
2022-06-28 +83% more running R1000 computers
Today we transplanted the IOC and PSU from Terma's R1000 to PAM's R1000, slotted in a SCSI2SD and powered it up.
There were a fair number of intermediate steps, transport, adapting power-cables, swapping two PAL-chips that had gotten swapped after the readout etc. etc.
But the important thing is that it came up.
That means that we "just" need to get RAM working on one of the two spare IOCs we have, and one way or another, get a power-supply going, then the world will have two running R1000 computers, instead of just one.
2022-06-20 IOP fined for speeding
The error from the SEQ board transpired to be the IOP downloading data faster than the DIPROC could get them stuffed into the SystemC model.
In difference from when normal experiments are run, when downloading the IOP just blasts bytes down the DIAGBUS, as fast as it can, and by interleaving downloads to multiple boards, for instance {SEQ, TYP, SEQ, VAL}… the DIPROCs get enough time to do their thing.
If we had tied the 68K20 emulation, the DIAGBUS and the DIPROCs to the SystemC clock at all times, that would just work, but it would also be a lot slower.
So we cheat: The 68K20 emulation and the i8052 emulation of the DIPROC runs asynchronous to the SystemC model, only synchronizing when it is needed to perform a bus-transaction, and the DIAGBUS has infinite baud-rate.
Therefore we have added a [small hack] to delay DOWNLOAD commands from the IOP if the targeted DIPROC is still in RUNNING state.
Now the FPTEST starts running, and comes back with:
CLI> fptest Loading from file FPTEST.M200_UCODE bound on January 29, 1990 17:26:52 Loading Register Files and Dispatch Rams .... [OK] Loading Control Store [OK] VAL bad FIU bits = FFFF_FFFF_FFFF_FFFF TYP bad TYP bits = FFFF_FFFF_FFFF_FFFF VAL bad VAL bits = FFFF_FFFF_FFFF_FFFF TEST AGAIN [Y] ?
Which is an improvement.
However, it is not obvious to us that FPTEST is what we should be attempting now.
The FPTEST.CLI script contains:
x rdiag fptest;
That makes RDIAG.M200 interpret FPTEST.DIAG, which contains:
init_state;push p2fp interactive;
And to us "p2" sounds a lot like "phase two".
There is another script to RDIAG called GENERIC.DIAG which looks like a comprehensive test:
init_state; run all p1dcomm; [#eq,[model],100] run p1sys; [end] [#eq,[model],200] run p1ioc; [end] run p1val; run p1typ; run p1seq; run p1fiu; run allmem p1mem; run all p1sf; init_state; [#eq,[model],100] run p2ioa; [end] [#eq,[model],200] run p2ioc; [end] [#eq,[model],100] run p2sys; [end] run p2val; run p2typ; run p2seq; run p2fiu; run allmem p2mem; init_state; run p2uadr; run p2fp; run p2evnt; run p2stop; run p2abus; run p2csa; run p2mm; [#eq,[model],100] run p2sbus; [end] run p2cond; run all p2ucode; run all p3rams; run all p3ucode
Running that instead we get:
CLI> x rdiag generic Running FRU P1DCOMM Running FRU P1DCOMM P1DCOMM Failed The test that found the failure was P1DCOMM ONE_BOARD_FAILED_HARD_RESET Field replaceable units : Backplane / Backplane Connector All Memory Boards Diagnostic Failed
That looks actionable...
2022-06-19 First attempt at FPTEST
With all boards passing their unit-tests, next step is the FPTEST
.
Until now the 68K20 emulator's only contact with the SystemC code has been through the asynchronous DIAGBUS, but one of the first thing FPTEST
does is to reset the R1000 processor, and therefore we had to implement the SystemC model of the 68K20, so it can initiate write cycles to DREG4 at address 0xfffffe00.
That got us this far:
CLI> fptest Loading from file FPTEST.M200_UCODE bound on January 29, 1990 17:26:52 Loading Register Files and Dispatch Rams .... Experiment error : Board : Sequencer Experiment : LOAD_DISPATCH_RAMS_200.SEQ Status : Error Abort : Experiment error Fatal experiment error. From DBUSULOAD Abort : Experiment error Fatal experiment error. From P2FP CLI>
2022-06-04 All boards pass unit test
Fixing two timing problems in the simulation made the TEST_MEM32.EM
pass,
and with that we have zeros in the entire right hand column in the table above.
2022-05-29 SEQ passes unit test
Have we mentioned zero vs eight confusion in the schematics yet ?
And with that, the emulated SEQ passes TEST_SEQ.EM
Now we just need to track down the final problems with MEM32.
2022-05-21 Watching the grass grow
Spring has slowed down work on the R1000 Emulator, but some progress is being made.
The SEQ board is now down to only two failing subtests:
RESOLVE_RAM_(OFFSET_PART)_TEST FAILED TOS_REGISTER_TEST_4 FAILED
or rather, all the other errors where phantom failures due to two colliding optimizations, one by Rational engineers and one by us:
125c 93 | | MOVC A,@A+DPTR 125d b4 ff f1 | | CJNE A,#0xff,0x1251 1260 74 02 |t | MOV A,#0x02 1262 f2 | | MOVX @R0,A 1263 08 | | INC R0 1264 02 05 1c | | LJMP EXECUTE
The above is a snippet of the DIPROC(1) code, the end of a loop used extensively on the SEQ board.
The Rational optimization is the instruction at 0x1262, which we think initiates a reset of the Diagnostic FSM.
Normally, the INC,LJMP and the instructions which pick up and decodes the next bytecode-instruction would leave the FSM plenty of time to get things done, but since our emulated DIPROC excutes all non-I/O instructions instantly (See: [[1]]) some of the SEQ testcases, notably LATCHED_STACK_BIT_1_FRU.SEQ
would fail.
The failure mode was that the bytecode expected to read a pattern like "ABAABB" from the hardware, but would get "CABAAB", which sent us on wild goose-chase for non-existent clock-skew problems.
Have we mentioned before that one should never optimize until things actually work ?
2022-05-08 Slowly making way
As can be seen in the table above, the main DRAM array now works on the emulated MEM32 board.
It takes 48 hours to run that test, because the entire DRAM array is tested 16 times, very comprehensively:
TESTING TILE 4 - TILE_MEM32_DATA_STORE DYNAMIC RAM DATA PATH TEST PASSED DYNAMIC RAMS ADDRESS TEST PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 0 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 1 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 2 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 3 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 4 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 5 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 6 PASSED DYNAMIC RAM ZERO TEST - LOCAL SET 7 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 0 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 1 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 2 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 3 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 4 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 5 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 6 PASSED DYNAMIC RAM ONES TEST - LOCAL SET 7 PASSED TILE 4 - TILE_MEM32_DATA_STORE PASSED
While "FAILURE" is printed five times on the console, there is actually only two failing experiments:
TESTING TILE 3 - TILE_MEM32_TAGSTORE TAGSTORE SHORTS/STUCK-ATS TEST PASSED TAGSTORE ADDRESS PATTERN TEST PASSED TAGSTORE PARITY TEST1 PASSED TAGSTORE PARITY TEST2 FAILED FAILING EXPERIMENT IS : TEST_TAGSTORE_PARITY_2 TAGSTORE RAMS ZERO TEST PASSED TAGSTORE RAMS ONES TEST PASSED LRU UPDATE TEST FAILED FAILING EXPERIMENT IS : TEST_LRU_UPDATE TILE 3 - TILE_MEM32_TAGSTORE FAILED
Despite some effort, we have still not figured out what the problem is. We suspect a timing issue near or with the tag-RAM.
2022-04-16 A long overdue update
As can be seen in the table above, the simulated SEQ board is down to 12 FAILURE
messages, and what the table does not show is that the MEM32 board simulation completes now, but takes more than 24 hour to do, and which makes the daily CI cron(8) job fail catastrophically.
The bug which have taken us almost a month to fix turned out to be the i8052 emulator's CLC C
, Complement Carry, instruction not complementing, in a DIPROC bytecode-instruction we had not previously encountered: Calculate Even/Odd parity for a multi-byte word.
Along the way we have attended to much other stuff, tracing, python code for decoding scan-chains, "mega components" etc. and, notably, python generated component SystemC models.
Initially all 12 thousand electrical networks in the simulated part of the system were a sc_signal_resolved
instance.
Sc_signal_resolved is the most general signal type in SystemC, having four possible levels, '0', '1', 'Z' and 'X' and allowing multiple 'writers', but it is therefore also the slowest.
Migrating to faster types, bool
for single wire binary networks and uint%d_t
for single-driver binary busses, requires component models for all the combinations we may encounter, and writing those by hand got old really fast.
For true Tri-state signals, we will still need to use the sc_signal_resolved
type, but a lot of Tri-state output chips are used as binary drivers, by tying their OE
pin to ground, so relying on the type of a component to tell us what type its output has misses a lot of optimization opportunities.
And thus we now have Python "models" of components, which automatically produce adapted SystemC component models.
Here is an example of the 2149 SRAM model:
class SRAM2149(PartFactory): ''' 2149 CMOS Static RAM 1024 x 4 bit ''' def state(self, file): file.fmt(''' | uint8_t ram[1024]; | bool writing; |''') def sensitive(self): for node in self.comp: if node.pin.name[0] != 'D' and not node.net.is_const(): yield "PIN_" + node.pin.name def doit(self, file): ''' The meat of the doit() function ''' super().doit(file) file.fmt(''' | unsigned adr = 0; | | BUS_A_READ(adr); | if (state->writing) | BUS_DQ_READ(state->ram[adr]); | |''') if not self.comp.nodes["CS"].net.is_pd(): file.fmt(''' | if (PIN_CS=>) { | TRACE(<< "z"); | BUS_DQ_Z(); | next_trigger(PIN_CS.negedge_event()); | state->writing = false; | return; | } |''') file.fmt(''' | | | if (!PIN_WE=>) { | BUS_DQ_Z(); | state->writing = true; | } else { | state->writing = false; | BUS_DQ_WRITE(state->ram[adr]); | } | TRACE( | << " cs " << PIN_CS? | << " we " << PIN_WE? | << " a " << BUS_A_TRACE() | << " dq " << BUS_DQ_TRACE() | << " | " | << std::hex << adr | << " " | << std::hex << (unsigned)(state->ram[adr]) | ); |''')
Notice how the code to put the output in high-impedance "3-state" mode is only produced if the chip's CS
pin which is not pulled down.
Note also that the code handles the address bus and data bus as a unit, by calling C++ macros generated by common python code. This allows the same component model to be used for wider "megacomp" variants of the components.
This is particularly important for the MEM32 board, which has 64(Type)+64(Value)+9(Ecc) DRAM chips in each of the two memory banks. The simulation runs much faster with just two "1MX64" and one "1MX9" components, than it does with 137 "1MX1" components in each bank.
This optimization is what disabused us of the notion that the CHECK_MEMORY_ONES.M32
experiment hung, it did not, it just took several hours to run - and it is run once for each of the eight "sets" of memory.
With the current 11 failures, the entire MEM32 test takes 140 seconds of simulated time, 7½ hours in our fastest "megacomp2" version of the schematics on our fastest machine.
However our "CI" machine is somewhat slower, and runs the un-optimized "main" version of the schematics, which means the next daily "CI" run is started before the previous one completed, and with them using the same filenames, they both crash.
So despite the world distracting us with actual work, travel, talks, social events, and notably the first ever opening of Datamuseum.dk for the public, we are still making good progress.
2022-03-06 Do not optimize until it works, unless …
It is very old wisdom in computing that it does not matter how fast you can make a program which does not work, and usually we stick firmly to that wisdom.
However, there are exceptions, and the R1000-emulator is one of them.
When the computer was designed, the abstract architecture had to be implemented with the available chips in the 74Sxx and later 74Fxx families of TTL chips, and there being no 64 bit buffers in those families, a buffer for one of the busses was decomposed into 8 parallel 8 bit busses, each running through a 74F245 chip, etc.
In hardware the 8 chips operate in parallel, in software, at least with SystemC, they are sequential, so there is a performance impact.
What is worse, there is a debugging impact as well, because instead of the trace-file telling what the state of the 64 bits are, in a single line, it contains eight lines of 8 bits, in random order.
Therefore we operate with three branches in the R1000.HwDoc github repository: "main", "optimized" and "megacomp".
"Main" is the schematics as they are on paper. That is the branch reported in the table above.
"Optimized" is primarily deduplication of multi-buffered signals, that is signals where multiple outputs in parallel are required to drive all the inputs of that signal, a canonical example being the address lines of the DRAM chips on the MEM32 board.
Finally in "megacomp" we invent our own chips, like a 64 bit version of the 74F245, whereby we both improve the clarity of the schematics, and make the simulation run faster, almost twice as fast as "main" at this point.
Here is the same table as above, for the "megacomp" branch, and run on the fastest CPU currently available to this project:
Test | Wall Clock | SystemC | Ratio | Exp run | Exp fail |
---|---|---|---|---|---|
expmon_reset_all | 51.787 | 0.026151 | 1/1980.3 | 0 | 0 |
expmon_test_fiu | 1275.507 | 17.799928 | 1/71.7 | 95 | 0 |
expmon_test_ioc | 1018.086 | 11.231571 | 1/90.6 | 29 | 0 |
expmon_test_mem32 | 5331.993 | 30.000000 | 1/177.7 | 28 | 9 |
expmon_test_seq | 1183.407 | 13.081077 | 1/90.5 | 108 | 32 |
expmon_test_typ | 3629.642 | 7.468383 | 1/486.0 | 73 | 2 |
expmon_test_val | 3625.022 | 7.434761 | 1/487.6 | 66 | 0 |
novram | 69.302 | 0.035584 | 1/1947.6 | 0 | 0 |
Note that the megacomponents has caused one of the TYP tests to fail, so the old wisdom does apply after all. (The table shows two failures because both the individual test and the entire test-script reports "FAILED" on the console.)
2022-03-05 We will not need to emulate the ENP-100
The R1000/s400 has two ethernet interfaces, one is on the internal IO/bus and can be directly accessed by the IOC and, presumably, the R1000 CPU, the other is on a VME daughter-board, mounted on the RESHA board.
Strangely enough, the TCP/IP protocol only seems to be supported on the latter, whereas the "direct" ethernet port is for use only in cluster configurations.
The VME board is a "ENP-100" from Communication Machinery Corp. of 125 Cremona Drive, Santa Barbara, CA 93117".
The board contains a full 12.5 MHz 68k20 computer, including boot-code EPROMs, 512K RAM, two serial ports and a Ethernet interface.
The firmware for this board is downloaded from the R1000 CPU, and implements a TCP/IP stack, including TELNET and FTP services.
Interestingly, the TCP/IP implementations ignores all packets with IP or TCP options, so no contemporary computers can talk with it, until "modern" options are disabled.
We have no hardware documentation for the ENP-100 board, but we expect emulation is feasible, given enough time and effort.
Fortunately it seems the R1000 can boot without the ENP-100 board, it complains a bit, but it boots.
That takes emulation of the ENP-100 out of the critical path, and makes it optional to even do it.
2022-03-01 The fish will appreciate this
Below the R1000 two genuine and quite beefy Papst fans blow cooling air up through the card-cage.
For a machine which most likely will end up in a raised-floor computing room, it can be done no other way.
However, if the machine is housed anywhere else, an air-filter is required to not suck all sorts of crap into the electronics.
And of course, air-filters should be maintained, so we pulled out the fan-tray and found that the filter mat was rapidly deteriorating, literally falling apart.
Not being air-cooling specialists, we initially ordered normal fan-filters, the kind that looks like loose felt made of plastic, but the exhaust temperature on the top of the machine climbed to over 54°C.
So what was the original filter material, and where could we buy it?
It looks a lot like the material used on the front of the obscure but deservedly renowned concrete Rauna Njord speakers, designed by Bo Hansson in the early 1980'ies, and that material also fell to pieces after about a decade.
Surfing fora where vintage hifi-nerds have restored Rauna Njord we found "PPI 10 Polyureathane foam" mentioned, and that transpires to be a what water-filters for aquariums are made from.
A trip to the local aquarium shop got us a 50x50x3cm sheet of filter material, and the promise that the fish will really appreciate us buying it.
We cut a 12.5cm wide stripe and parted it lengthwise in two slices of roughly equal thickness, using two wooden strips as guides and a sharp bread-knife.
It is almost too bad that one cannot see the sporty blue color when it is mounted below the R1000:
The two small pieces in the middle is the largest fragment of the old air filter and an off-cut from the 17mm thick slice:
In the spirit of scientific inquiry, we measured the temperature with both thicknesses.
With the 17mm thick filter, the exhaust to rose to above 52°C.
With the 13mm thick filter, it stabilized around 41°C.
That is a pretty convincing demonstration of the conventional wisdom, that axial fans should push air, not pull it.
So why are the filters on the "pull" side of the fans in the R1000, when the fan-tray is plenty deep for filters to be mounted on the "push" side?
Maybe this is an after-market modification, trying to convert an unfiltered "data-center fan-tray" into a filtered "office fan-tray" ?
2022-02-20 TEST_VAL.EM passes
We're making progress.
Now we are going focus on TYP, where most of the failing tests have something to do with parity checking.
2022-02-12 TEST_FIU.EM passes
As can be seen on the status above, there are no longer any failures on the FIU board when running TEST_FIU.EM.
The speed in the table above is when simulating the unadultered schematics.
Concurrent with fixing bugs we are working on two levels of optimized schematics, one where buffered signals are deduplicated, and one where we use "megacomponents", for instance 64 bit wide versions of 74F240 etc.
The "megacomp" version of FIU runs twice as fast, 1.4% of hardware speed.
2022-01-10 SystemC performance is weird
As often aluded to, the performance of a SystemC simulation is … not ideal … from a usability point of view, so we spend a lot of time thinking about it and measuring it, and it is not at all intuitive for software people like us.
Take the following (redrawn) sheet from the IOC schematic (click for full size)
This is the 2048x16 bit FIFO buffer through which the IOC sends replies the the R1000 CPU.
None of the tests in the "TEST_IOC.EM" file gets anywhere near this FIFO, yet the simulation runs 15% faster if this sheet is commented out, because this sheet uses a lot of free-running clocks:
1 * 2X~ @ 20 MHz 20 MHz 2 * H2.PHD @ 10 MHz 20 MHz 2 * H1E @ 10 MHz 20 MHz 1 * H2E @ 10 MHz 10 MHz 1 * Q1~ @ 5 MHz 5 MHz 2 * Q2~ @ 5 MHz 10 MHz 1 * Q3~ @ 5 MHz 5 MHz ---------------------------------- Simulation load 90 MHz
Where the clocks feed into edge sensitive chip, for instance "Q1~" to "FOREG0" (left center), only one of the flanks need to be simulated, but when state sensitive gates like "2X~" into "FFNAN0A" (near the top), the "FOO" class instance is called for both flanks, effectively doubling the frequency of the 10MHz clock signal.
To make matters even worse, there is an identical FIFO feeding requests the opposite way, from R1000 to IOC, on the next sheet.
And to really drive the point home, all the simulation runs will have to include the IOC board.
In SystemC a FIFO is one of the primitive objects, which can simulate these two pages much faster than this, but to do that we need enough of the machine simulated well enough, to run the experiments which tests the FIFOs.
Until then, we can save oceans of time by simply commenting these two FIFOs out.
2022-01-08 Making headway with FIU
We are making headway with the simulated FIU board, currently 19 tests fail, the 16 "Execute from WCS" and three parity-related tests. We hope the 16 WCS tests have a common cause.
On the FIU we have found the first test-case which depends on undocumented behaviour: TEST_ABUS_PARITY.FIU fails if OFFREG does not have even parity when the test starts.
Simulating the IOC and FIU boards, the simulation currently clocks around 1/380 of hardware speed, if the TYP, VAL and SEQ boards are also simulated, speed drops to 1/3000 of hardware speed. Not bad, not certainly not good.
We have started playing with "mega-symbols" for instance 64bit versions of the 74F240, 74F244 and 74F374. There is a speed advantage, but the major advantage right now is that debugging operates on the entire bus-width at the same time.