Rational/R1000s400/Logbook/2023
2023-11-23 IOC2 still being obstinate
Today we hooked up the Logic Analyzer to the UniBus on the backplane trying to figure out why the second IOC-board doesn't wanna play.
We managed to pull out a trace that compares perfectly with the R1000 simulator up to the point where the real machine just freezes. After successfully having read and written both single and multi-sectors, it freezes during a multi-sector write - The IOC-board seems to wait for a disk transfer interrupt that never comes.
We can pretty much rule out broken connections since full reads and writes do succeed.
Right now our suspicion are focused on some IOC UniBus transceivers that might have grown tired.
2023-11-17 CMC ENP source code on bitsavers ?
It is not the ENP-100 which the R1000 uses, but still interesting:
http://bitsavers.org/bits/CMC/ENP-10/
2023-11-16 Still not two running R1000's
Apologies for not updating the log for a long time, life keeps happening to us.
We are working on getting PAM's R1000 running again and status currently is:
Using the TERMA IOC, now with a couple of bad SRAM chips changed, the machine works.
Using the other IOC, with a LOT of SRAM chips changed, the disk SCSI bus hangs very early in the IOP boot:
Kernel program (0,1,2) [0] : File system (0,1,2) [0] : User program (0,1,2) [0] : Initializing M400S I/O Processor Kernel 4_2_18 Disk 0 is ONLINE and WRITE ENABLED Disk 1 is ONLINE and WRITE ENABLED IOP Kernel is initialized Initializing diagnostic file system ... [OK] ==================================================== Restarting system after White Button Reset or DC power fault CrashSave has created tombstone file R1000_DUMP1. [halt sent] BREAK KEY MENU - options are: 0 => RESTART system 1 => IGNORE break key 2 => REDISPLAY recent console output 3 => Enter DEBUGGER 4 => 8mm TAPE reset Enter option : 3 Trapped into debugger. RD0 000000FF RD1 00002CA6 RD2 D2D2D2D2 RD3 D3D3D3D3 RD4 D4D4D4D4 RD5 D5D5D5D5 RD6 D6D6D6D6 RD7 D7D7D7D7 RA0 0000FC40 RA1 A1A1A1A1 RA2 A2A2A2A2 RA3 A3A3A3A3 RA4 A4A4A4A4 RA5 A5A5A5A5 RA6 A6A6A6A6 ISP 0000FA8C PC 00002D0A USP 0003FE46 ISP 0000FA8C MSP FDFF742D SR 2119 VBR 00000000 ICCR 00000001 ICAR DB359DB2 XSFC 1 XDFC 5
Obviously, some SCSI transfers have happened at this time, KERNEL.0, FS.0 and PROGRAM.0 got loaded, but then something happens and something related to SCSI stops working, clearly signaled by SCSI2SD's yellow activity light being stuck on.
The early disk access do not use interrupts to signal completion, instead the M68K spins on the status bit in the SCSI controller registers, so our current working hypothesis is that interrupt deliver from RESHA via UniBus to IOC is failing.
We are not quite fluent in the IOC debugger yet, but the "D2D2D2D2", "D3D3D3D3" etc strongly indicates that the CPU has not come far after waiting in STOP'ed state for an interrupt, which would support that hypothesis.
All EEPROM based test cases work, including the SCSI-loopback test, which probably also do not use interrupts.
We are preparing two different approaches now:
A) Hacking KERNEL.0 on the disk-image to try to find out precisely where the 68K20 is and what it expects to happen. Thanks to the SCSI2SD's "backdoor" USB connection, it will be quite easy for us to modify files in the DFS filesystem and run experiments that way.
We need to find a place in the kernel to stash our hacks, but fortunately that is something we can work out in the simulator, and with a bit of luck we can use the second serial port as debug-output channel.
B) Hooking up our HP16700 Logic Analyzer to the front- and back-planes to monitor the M68K signals and the IOC/RESHA UniBus interface, through which the SCSI traffic happens.
We also still have the third IOC board, but it will be some time before we get it back from hardware surgery.
Using "real" SCSI disks instead of the SCSI2SD makes no difference, except the ±12V supply we use is barely strong enough to get them started.
Speaking of PSU's, we recently found another COSEL 5V/200A PSU on eBay, so we will soon also have two "new" power supply modules for the two machines.
2023-11-13 Handwritten R1000 Assembler after all
It has been claimed that no R1000 Assembler has ever been handwritten, but we now have evidence to the contrary:
with System; procedure Update_Os2000_Crc (Data : in out System.Byte_String); -- This subprogram is used to generate the correct CRC needed within -- an OS2000 execuatable module. The body of this subprogram uses -- R1000 Machine_Code for execution efficiency. Because R1000 Machine_Code -- is not available on customer machines this subprogram has been Comp.Loaded -- into a different subprogram with the name Update_Crc which is actually -- used during the conversion process. The subprogram spec and its body -- are provided for documentation purposes only. with Bit_Operations; with Machine_Code; procedure Update_Os2000_Crc (Data : in out System.Byte_String) is subtype Accumulator is Long_Integer range 0 .. 2 ** 24 - 1; Seed : constant Accumulator := 16#00FF_FFFF#; Lowest_Bit : Long_Integer := 16#0000_0001#; Lowest_Byte : Long_Integer := 16#0000_00FF#; Middle_Bytes : Long_Integer := 16#00FF_FF00#; Accum_Bytes : Long_Integer := 16#00FF_FFFF#; Crc_If_Odd : Long_Integer := 16#0080_0021#; Acc : Accumulator := Seed; Index : Natural := Data'First; procedure Update (First : System.Byte; Second : System.Byte; Updated : in out Accumulator) is use Machine_Code; begin Instruction'(Load_Top, At_Offset => -3); -- Second byte Instruction'(Short_Literal, 8); -- 8 bits Instruction'(Execute, (Discrete_Class, -- Logical_Shift_Op)); -- left shift Instruction'(Load_Top, At_Offset => -3); -- First byte Instruction'(Short_Literal, 16); -- 16 bits Instruction'(Execute, (Discrete_Class, -- Logical_Shift_Op)); -- left shift Instruction'(Execute, (Discrete_Class, -- Word left shifted 8 Or_Op)); -- Instruction'(Load_Top, At_Offset => -5); -- Updated Instruction'(Execute, (Discrete_Class, -- Xor_Op)); -- exclusive or Instruction'(Load, (Middle_Bytes'Level, -- Middle_Bytes'Offset)); -- Instruction'(Execute, (Discrete_Class, -- And_Op)); -- Z Instruction'(Load_Top, At_Offset => -5); -- Updated Instruction'(Load, (Lowest_Byte'Level, -- Lowest_Byte'Offset)); -- Instruction'(Execute, (Discrete_Class, -- And_Op)); -- Instruction'(Short_Literal, 16); -- 16 bits Instruction'(Execute, (Discrete_Class, -- Logical_Shift_Op)); -- left shift Instruction'(Load_Top, At_Offset => -1); -- Z Instruction'(Short_Literal, -2); -- 2 bits Instruction'(Execute, (Discrete_Class, -- Logical_Shift_Op)); -- right shift Instruction'(Execute, (Discrete_Class, -- Updated := Xor_Op)); -- Updated xor (Z >> 2) Instruction'(Load_Top, At_Offset => -1); -- Z Instruction'(Short_Literal, -7); -- 7 bits Instruction'(Execute, (Discrete_Class, -- Logical_Shift_Op)); -- right shift Instruction'(Execute, (Discrete_Class, -- Updated := Xor_Op)); -- Updated xor (Z >> 7) Instruction'(Load_Top, At_Offset => -1); -- Z Instruction'(Execute, -- (Discrete_Class, -- Count_Nonzero_Bits_Op)); -- Instruction'(Load, (Lowest_Bit'Level, -- Lowest_Bit'Offset)); -- Instruction'(Execute, (Discrete_Class, -- And_Op)); -- Instruction'(Jump_Zero, -- Update_Accum'Location); -- Instruction'(Load, (Crc_If_Odd'Level, -- Crc_If_Odd'Offset)); -- Instruction'(Execute, (Discrete_Class, -- Xor_Op)); -- <<Update_Accum>> -- Instruction'(Store, (Updated'Level, -- Update out param Updated'Offset)); -- end Update; begin for I in Integer range 1 .. (Data'Length - 3) / 2 loop Update (First => Data (Index), Second => Data (Index + 1), Updated => Acc); Index := Index + 2; end loop; Acc := Bit_Operations.Logical_Xor (Seed, Acc); for I in 0 .. 2 loop Data (Index + I) := System.Byte (Bit_Operations.Extract (W => Acc, Start => 64 - 24 + (8 * I), Length => 8)); end loop; end Update_Os2000_Crc; pragma Main;
(This is from some recent work on the AutoArchaeologist module which dissects R1000 Environment archives.)
2023-10-14 Emulation update
Progress has been slow recently, but here is the current status:
The "Optimized" schematics have been condensed to 83 sheets now.
It takes 43½ hour to get to the "PRETTY_PRINTER" "mostly fails" problem at 380 seconds simulated time.
It takes 70¾ hour to get to the "OS_COMMANDS" "always fails" problem at 568 seconds simulated time.
2023-10-05 Second IOC board is running
R1000-400 IOC SELFTEST 1.3.2 512 KB memory ... [OK] Memory parity ... [OK] I/O bus control ... [OK] I/O bus map ... [OK] I/O bus map parity ... [OK] I/O bus transactions ... [OK] PIT ... [OK] Modem DUART channel ... [OK] Diagnostic DUART channel ... [OK] Clock / Calendar ... [OK] Checking for RESHA board -- Bench mode (ID 7) detected Skipping RESHA tests Local interrupts ... [OK] Illegal reference protection ... [OK] I/O bus parity ... [OK] I/O bus spurious interrupts ... [OK] Temperature sensors ... [OK] IOC diagnostic processor ... [OK] Power margining ... [OK] Clock margining ... [OK] Selftest passed Restarting R1000-400S January 1st, 1901 at 00:20:01 Logical tape drive 0 is an 8mm cartridge tape drive. Logical tape drive 1 is declared non-existent. Logical tape drive 2 is declared non-existent. Logical tape drive 3 is declared non-existent. Booting I/O Processor with Bootstrap version 0.4 Boot from (Tn or Dn) [D0] :
The problem turned out to be two other failed memory chips: H5 and H6. These two RAM-chips passed the following test: Set(A0), Clr(A0), Set(A0), A0++, Test (A0)==0x00000000, but failed in the next test: Clr(A0), Set(A0), Clr(A0), A0++, Test (A0)==0xFFFFFFFF.
The buffers and latches appear to be fine and the behavior we saw on the traces is more likely related to something else.
BIOS has now been updated to include the other test as part of the code that pin-points problematic RAM-chips.
In the mean time, the first board has now yet another failed RAM-chip, so we are yet to have two running R1000s...
2023-09-15 45h25m
Optimizations and "condensations" continue, the most recent run got to the PRETTY_PRINTER failure in 45 hours and 25 minutes, well under two days.
This latest round of work was on the Memory Monitor on the FIU board, where almost three pages of "noodle-logic" have been condensed into a single component.
2023-08-22 Two and a half day
Speedups continue, we can now fail (or pass) "PRETTY_PRINTER" in 60 hours, and we seem to be getting better and better at speeding things up, as we understand more of how the R1000 hardware actually works.
The run just started incorporates "condensations" of the microcode condition facilities.
The SEQ microcode has a seven bit field which selects one of about hundred boolean "conditions" to influence the execution of this and selection of the next micro instruction.
In hardware there are slightly fewer conditions, because some of the hundred are combinations, but they are all ready at all times, waiting for the tree of multiplexers to pick one of them, which is an incredibly wasteful way to go about it in software.
Most of the conditions come from the TYP and VAL boards, and this first round of optimization simply replaces the respective layers of multiplexers with smarter ones, which only wait for the designated single signal to change.
Eyeballing it, that seems to have increased performance by 5-7% and we expect this run to get to "PRETTY_PRINTER" in about 56 hours.
Conditions come in three speed-grades: "Early", "Medium" and "Late", roughly corresponding to them being available before the end of clock-phases Q2, Q3 and Q4 respectively.
Ideally the SEQ circuitry should create a clock signal ahead of the respective deadline, so that all the other boards can simply wait on that single signal and if the selected condition is theirs, evaluate just that single condition.
We expect that to be harder to implement than it sounds.
2023-07-19 Three days and 30 minutes
The most recent test run hit "PRETTY_PRINTER" after 3 days and 30 minutes, which means that the emulator is now running at 7 kHz, compared to the hardware's 5 MHz.
This batch contains a two good examples of the kind of optimizations we are applying right now.
The VAL board has a 16x16=32 multiplier which can shift both inputs and the output in steps of 16 bits:
In reading order the schematic pages contain:
- The A input latch and shifting
- The B input latch and shifting
- Four 8x8=16 multiplier chips
- The 16+16+16+16=32 adder circuitry
- Inverters
- Output latch and shifting
In the most recent commit (ddb1a52) it look like this:
One single component - with a lot of pins, which we do not have to draw by hand, thanks to some python code:
#!/usr/bin/env python3 ''' VAL complete multiplier ''' from chip import Chip, FChip, ChipSig class XMULT(FChip): ''' VAL complete multiplier ''' symbol_name = "XMULT" def __init__(self): super().__init__() self.sig_left(ChipSig("-->+", "A", 0, 63)) self.sig_left(ChipSig("-->+", "B", 0, 63)) self.sig_right(ChipSig("+===", "P", 0, 63)) self.sig_right(ChipSig("o<--", "OE")) self.sig_right(ChipSig("+<--", "START")) self.sig_right(ChipSig("+<--", "SRC", 0, 3)) self.sig_right(ChipSig("+<--", "Q2")) self.sig_right(ChipSig("+<--", "Q3")) self.sig_right(ChipSig("+<--", "DST", 0, 1)) self.sig_right(ChipSig("+<--", "DGMD")) self.finish() def register(): yield XMULT()
All the internal "connections" of the multiplier circuitry is now C++ code, instead of SystemC signals which is faster, which means less SystemC component activations.
And we are way down in activations now:
A/ucyc A total % what ---------------------------------------------- […] 33.266 61352009473 0.0201 page mem32_30 33.344 61495869051 0.0202 page typ_36 35.064 64668166462 0.0212 page fiu_62 35.791 66007948215 0.0216 page typ_30 36.188 66741121990 0.0219 page typ_13 36.381 67096398653 0.0220 page mem32_31 36.480 67279825011 0.0221 page mem32_28 38.061 70195472397 0.0230 page val_34 42.146 77728314882 0.0255 page val_13 43.160 79597747251 0.0261 page ioc_63 75.030 138374828092 0.0454 page seq_79 206.840 381468517924 0.1250 board ioc 261.990 483179638340 0.1584 board val 281.352 518888462242 0.1701 board mem32 281.917 519931189782 0.1704 board typ 302.839 558515919444 0.1831 board fiu 305.354 563154292231 0.1846 board seq ============================================== 1654.292 3050957778644 1.0000 Total
At 1654 SystemC activations per micro-cycle, even getting rid of a single NAND gate triggering on the fast "2X" clock is now measurable.
Incidentally, page 79 of the sequencer is a page full of clock-generation which has many such NAND gates.
But notice also that other component next to the multiplier: A buffer which has the top 32 bits of the A bus as input and only outputs the MSB.
The condition-circuit examines at the sign of the A- and B- inputs to the ALU by looking at the two MSBs: "A_D0" and "B_D0"
Therefore the "A_D0" signal has one more connection than the "A-D1…A-D31" signals, and consequently the A-bus gets broken up into three SystemC signals (0, 1-31, 32-63) instead of just two (0-31, 32-63).
To make matters worse, the single "A_D0" signal becomes a "sc_signal_resolved", which is a smart but slow, SystemC signal type, which can take the values '0', '1', 'Z' & 'X', whereas the rest of the A-bus becomes two C++ "uint32_t" integers, where the tri-state 'Z' is avoided by converting the bus to a multiplexer.
By imposing that extra buffer, which "needlessly" inputs "A_D1…D31", the A-bus becomes just the two "uint32_t" signals, and we avoid the slow "sc_signal_resolved" as part of the A-bus, which is active twice in each micro-instruction: Once to read and once to write to the register file.
2023-07-13 New kind of IOC error ?!
We are trying to repair a second IOC board in the hope that we can double the number of running R1000-400 machines.
HP-Elektronikmontage were very kind to remove the 28 faulty chips we previously identified. We have replaced these chips with sockets and new chips, but in the meantime 2 more chips failed, so they were replaced a couple of weeks ago. Even though that particular memory test now passes, the next test 48 instructions later fails...
Today we carried out a bunch of measurements with logic analyzer and oscilloscope and managed to home in on what goes wrong. We managed to get a good trace on the exact time where the test fails, and as can be seen below, the D27 line misbehaves. The red trace tacks the output of RAM.DO27, output from the RAM chips, and blue traces the output of data buffer D27 (I14).
At the spot where the test fails, it seems that the output level of the RAM chip does not get quite low enough. Two important traces from the flip-flop at H23 (CLK, DO27) are needed to determine if the culprit is actually the RAM chip at H6, the flip-flop at H23 or the buffer at I14, but at least we now have a good lead!
The fault is different than what we found back in 2019, which seemed like a "latch-up" of some kind - This looks more like a weak pull-down on the memory chip.
Time for Summer vacation!
2023-07-08 Two at the same time ?!
The two tests currently running on the M2 have both passed "PRETTY_PRINTER" now:
Out of several dozens of runs, less than a handful have passed "PRETTY_PRINTER" so far, so this is interesting.
The one on the right reached PRETTY_PRINTER" in 3 days and 15 hours, at an average emulation rate of 1:850.
2023-07-01 N-1 races left
We found a race and fixing it made a difference:
The IOC board generates three control signal for the SLICE, DELAY and RTC timers which rise at the end of Q4, which is also the clock signal for those three timers.
There is probably a reason this is not marginal in hardware, but in simulation it failed consistently, because when the "Q4~" clock signal rose, so did the other signals, so they did not cause anything to happen.
IOC Experiments had already revealed trouble with two of them, and we had given two NAND gates 5ns extra propagation delay in order to pass those tests. But we had not spotted that the third signal, "CLEAR_RTS.S~", which suffered from the same race condition.
The violet and cyan lines are runs prior to the fix, the yellow line is with the fix. The drop at 93 (yellow) and 107 (cyan) is where the "starting virtual memory system" message appears on the console.
The yellow run is also 3-4% faster than the cyan, and it looks like it might run faster than 1:1000
2023-06-26 Turbo microcode loading
We have reimplemented the turbo microcode loading:
The purple line is with the DIPROCs loading the microcode, byte code by byte code, instruction by instruction, and it takes nine minutes of real time to simulate until the vertical drop at 2½s simulated time.
The cyan line is with the "turbo loader", it is almost instantaneous by comparison.
The way it works is that when the bytes arrive on the DIAGBUS, the are put into the simulated 8052 DIPROC's RAM, as it would have been if we woke up the 8052 with a serial RX interrupt and let it do it's thing.
Once the experiment download is finished, we calculate a Koopman32 checksum of the start address (the first byte) and the code part of the experiment, and call a per-board function to decide what to do.
If that function returns zero, the 8052 simulator is woken up, and the usual thing happens.
But instead that function can decide that if the Koopman32 hash matches the one for the LOAD_CONTROL_STORE_200.SEQ experiment, it can stuff the 64 bits directly into the Sequencer Writable Control Store RAM "chip", and return DIPROC_RESPONSE_DONE, in which case the 8052 emulator will not be woken at all.
In the big picture it does not save much time, but it saves 9 minutes every time we run UDIAG, and we run that a lot.
2023-06-20 It always ends in tears...
The run mentioned in the previous entry failed in OS_COMMANDS, as expected, but it only took 6½ days this time.
The chances of getting past the PRETTY_PRINTER issue seems to be somewhere between 1 and 10%, but with simulations running in parallel we get two shots per week now, so trying to nail this down by brute-force logging seems a viable strategy.
There are three different events we can try to log, in order of increasing volume:
- Memory access
- macro instruction execution
- micro instruction execution
Strangely enough, the difficulty of getting good logging seems to be the opposite order:
The SEQ board has a single register with the current micro address, from which it would be trivial to look up the specific micro instruction in the microcode file.
The SEQ board also has 8 registers containing the current memory word worth of macro instructions, but because of multi-word instructions and other complications, it not immediately obvious how to log "the current instruction", so we would probably have to dump all eight and the selector register. If we did so, it would probably not be hard to correlate the instruction sequence to the binary objects we have disassembled in the DataArchaeologist.
Finally, logging memory access is a weird complicated business involving the "Memory Monitor" on the FIU board and the four banks of cache on the MEM32 board, but this would give us the segment names of reads and writes, from which we can infer not only the macro instruction fetches and probable execution, but also that data access patterns by that code.
From the documentation we get the impression that the MAR register on the FIU is the authoritative places for this, so we may try that first, and see what we learn from it.
2023-06-18 More performance filing and an interesting run
Presently we create work-trees for each board except TYP+VAL, which are so identical we work on them together.
After filing away in one of these trees, and once it passes the EXPMON and RUN_UDIAG and FRU_PHASE3 tests, we attempt to boot it, and if it fails with the usual PRETTY_PRINTER assert, it gets integrated.
This allows us to keep several CI machines busy at the same time.
Right now a FIU board source tree is in attempting to boot, and it has progressed past the PRETTY_PRINTER assert this morning.
That is probably just luck, our working theory is still that we have a race condition somewhere, but it will be interesting to see how far it gets.
2023-06-11 Make that 4 days 8 hours
More performance improvements have brought us down to 4 days 8 hours to trigger the "PRETTY_PRINTER" assert, and we are approaching the 1:1000 milestone, this run ran only 1126 times slower than the hardware.
2023-05-28 Now failure in 5 days is an option
We are still replacing 64Kx1 SRAM's on the other IOC boards, hoping to eventually get to or even three working IOC boards out of it.
Between the gardening and other museum-work, we are slowly but surely filing away at the optimizations, and we can now get to the "PRETTY_PRINTER" failure in less than five days (the orange line in the plot below.)
The diagonal lines in the plot are real time measured on the right hand Y-axis.
The mostly horizontal lines is "seconds simulated per real second" on the barely labeled left hand Y-axis.
(The missing labels is because the left Y-axis is now logarithmic, we find that more informative.)
The "tvwip2" and "fiuwip" are two independent branches, which will be integrated when their test-runs finish in a couple of days.
Between the two of then we should get another 6% speedup.
The "rd010" run (purple at the bottom) is a run of the HW-true schematics, on an old server, started 40 days ago. In a week or so it will probably crash with the PRETTY_PRINTER failure, but it will have proven that our automatic "3-state bus to multiplexer" code also works with the HW-true schematics.
2023-05-01 Still here, still burning CPU
Apologies for the lack of updates recently, spring is always busy in Denmark.
We are still burning CPU, trying to optimize things to run faster and figure out why the boot goes off the rails.
We still have two different and distinct failure points, PRETTY_PRINTER and OS_COMMANDS, but now we are able to reach them in 6 and 10 days respectively.
Our current working hypothesis is that we have a race condition in the simulation, but who knows ?
2023-04-06 Next stop is …
OBJECT_EDITOR.11.6.1D MAIL.DELTA OS_COMMANDS.11.6.2D ====>> Kernel.11.5.8 <<==== Kernel: Kernel assert failure detected by kernel debugger stub. System shutting down. Exception: !Lrm.System.Assertion_Error, from PC = #F6413, #707
This is the [original problem], but we reached it in 14 days instead of 16 days.
2023-04-05 Continued progress
INPUT_OUTPUT.11.7.0D COMPILER_UTILITIES.11.51.0D SEMANTICS.11.50.3D R1000_DEPENDENT.11.51.0D R1000_CHECKING.11.51.0D R1000_CODE_GEN.11.51.0D
433 seconds and counting…
2023-04-03 It's hardly ever done that before…
One of the current runs (rm156) did something unexpected just now:
====>> Elaborator Database <<==== BASIC_MANAGERS.11.3.0D ADA_MANAGEMENT.11.50.4D DISK_CLEANER.11.1.3D PARSER.11.50.1D PRETTY_PRINTER.11.50.3D DIRECTORY.11.4.6D
This is the first time any run has made it past PRETTY_PRINTER
and
if it continues we will have a login-prompt in a couple of months.
The new thing in this run is some experimental code which converts 3-state busses to use a multiplexer instead, in order to use C++ unsigned variables instead of the very sc_logic "01XZ" resolution.
Not only does that gives us a pretty strong clue where to look for the PRETTY_PRINTER
issue, it also takes it off the critical path.
We hope the power-grid is stable for a couple of months, because this runs on the Mac Mini and thus without backup power.
2023-03-23 With a littlebig help from our friends
We are still working on getting the IOC board working for PAM's machine.
Over winter we received a big help from the local electronics company [HP Elektronikmontage] who kindly removed all the bad SRAM chips from the board for us:
We also asked them to remove the parity chips, because we are not certain if the IOC boot code RAM test would detect if they suffer from the same issue as the main-RAM chips.
The plan now is to mount sockets and mount the the New-Old-Stock SRAM chips we have bought, and see what happens.
2023-03-15 And now for something entirely different…
Wayne Meretsky told us via email that the Oregon Software PASCAL compiler for the IOP programs, and after a number of not very productive searches, we finally found that one of the founders of Oregon Software had put almost exactly the most interesting parts of the compiler up on github:
Some good clues about the peephole-optimizations can be found here:
https://github.com/dhogaza/mc68000_pascal_2/blob/main/cross/sources/code.pas#L639
2023-03-10 No emulating after bedtime
Every so often, in late evening sessions, we would persistently run into this error:
PASCAL error #4 at location 000218D0 Last called from location 000220C8 Last called from location 0002251C Last called from location 00023296
And being late at night, the obvious thing was to go to bed and look at it with fresher eyes later … where it then fails to happen.
We finally dived into the disassembled PASCAL code of LOADER.M200
and found the bug, which was, of course, painfully obvious in the tracefile, if we had not been so sleepy:
Calendar([14 03 08 03 16 13 11]) Calendar([14 03 08 03 16 13 12]) Calendar([14 03 08 03 16 13 13]) Calendar([14 03 08 03 16 14 57]) Calendar([14 03 08 03 17 47 21]) Calendar([14 03 08 03 24 18 28])
These are the timestamps read from the RTC chip by kernel system-call 0x13, and the fields are:
Year-1 (binary) Month (BCD) Day (BCD) Weekday Hour (BCD) Minute (BCD) Second (BCD)
The RTC emulation had a bug relating to the values 0x18, 0x24 and 24.
This is now fixed, and we implemented the missing midnight increment of the day/wday/month fields as well.
2023-02-26 License or not
Having established that the real machine does not assert when the cluster-ID is wrong and the licenses invalid, it is obvious to ask: Does the machine boot faster if those software products are unlicensed?
A quick run this morning confirms that it is so.
An licensed boot takes 1327 seconds, an unlicensed only 1194 seconds.
It is the time
between the "INITIALIZING"
and "username:"
console messages,
which is a bit more than two minutes shorter.
This is good to know, but not actionable until the emulator makes it all the way to
that INITIALIZING
message.
2023-02-25 Different. Still lousy, but different
Well…
ADA_MANAGEMENT.11.50.4D DISK_CLEANER.11.1.3D PARSER.11.50.1D PRETTY_PRINTER.11.50.3D
====>> Environment Log <<==== 09:58:39 !!! EEDB Assert_Failure Unable to elaborate PRETTY_PRINTER.11.50.3D: Unhandled exception - 16#84# at PC 00169C1300007FE0 ====>> Elaborator Database <<==== Unable to elaborate PRETTY_PRINTER.11.50.3D: Unhandled exception - 16#84# at PC& 00169C1300007FE0 EEDB: %
So this run was the "megacomp4" branch where we have attempted to optimize things, so the next logical experiment is to fall back to the "main" branch, which is our ground-truth, using the "much-deleted" disk-images, and see what that does.
At least we will find out in a couple of weeks, rather than months.
We tried booting the real machine with a wrong cluster ID, and it produced the same »Product_Authorization Invalid for« messages, but did not assert, so that is also a solid indication that something is wrong still.
2023-02-24 Good Omens
A certain part of the failing runs look like this:
Elaborating subsystem: ELABORATOR_DATABASE ====>> Elaborator Database <<==== NETWORK.11.1.5D OM_MECHANISMS.11.2.0D BASIC_MANAGERS.11.3.0D
While on the real machine it looks like this:
Elaborating subsystem: ELABORATOR_DATABASE ====>> Elaborator Database <<==== NETWORK.11.1.5D OM_MECHANISMS.11.2.0D ====>> Environment Log <<==== 10:48:37 +++ Low_Level_Action.Manager Restoration_From_Archive ====>> Elaborator Database <<==== BASIC_MANAGERS.11.3.0D
And, as of this morning, so does it in emulator when running with the cluster number from PAM's machine.
2023-02-22 Second long run failed same way
After the three months long run failed in November, we started the exact same run again, to get an idea how reproducible the failure-mode is, and this morning we have the answer: It failed precisely the same way:
Kernel assert failure detected by kernel debugger stub. System shutting down. Exception: !Lrm.System.Assertion_Error, from PC = #F6413, #707
A week ago we got the exact same failure after only 16 days, using the new much-deleted disk-images, so we are very confident that this is something we will be able to track down.
Segment #F6413
is loaded in KK.11.5.9K.MLOAD
where it is commented as "BTREE_MANAGER", which sounds like very fundamental Ada-code.
We are seven days into the run where we patched the cluster-id to match PAM's machine, not out of the "starting VM" phase yet, but getting there, in some days we will know if that solves the "Product_Authorization Invalid for […]" issue, and a couple of weeks from now, we will know if that also solves the BTREE_MANAGER assert.
In the meantime, "mega-component" optimizations continue.
2023-02-14 Patching the cluster-id
We have started a new run identical to the 16 day run which just failed, except for one detail: We have patched the cluster-id of PAM's machine into the IOC_EEPROM with the following CLI command to the emulator:
cli firmware patch IOC_EEPROM 0x7e84 0x00 0x06 0x3b 0x8b
Hopefully that fixes at least the Product_authorization
issues.
2023-02-14 Failing five times faster
The bad news:
The much-deleted-diskimage + megacomp4 run failed after 16 days.
The good news:
It failed precisely the same way (see below), as the three month run on the full diskimage did, back in November 2022.
The product authorization failures is undoubtedly a very strong clue, we just dont know what it means.
The code segment which fails,#F6413, is 1008659.SEG.
In the .MLOAD
file it has the comment -- BTREE_MANAGER
and the instruction at #707 is:
0707 0257 Execute Discrete,Raise
Now we need to find out why…
DISK_CLEANER.11.1.3D PARSER.11.50.1D PRETTY_PRINTER.11.50.3D DIRECTORY.11.4.6D INPUT_OUTPUT.11.7.0D ====>> Environment Log <<==== 05:49:07 !!! Product_Authorization Invalid for Work_Orders 05:49:07 !!! Product_Authorization Invalid for Cmvc 05:49:08 !!! Product_Authorization Invalid for Rci 05:49:08 !!! Product_Authorization Invalid for Tcp/Ip 05:49:08 !!! Product_Authorization Invalid for Rpc 05:49:09 !!! Product_Authorization Invalid for Insight 05:49:09 !!! Product_Authorization Invalid for Rs6000_Aix_Ibm 05:49:09 !!! Product_Authorization Invalid for X Interface 05:49:09 !!! Product_Authorization Invalid for Cmvc.Source_Control 05:49:10 !!! Product_Authorization Invalid for Rcf 05:49:10 !!! Product_Authorization Invalid for Testmate 05:49:10 !!! Product_Authorization Invalid for Fundamental Session 05:49:11 !!! Product_Authorization Invalid for Lrm_Interface 05:49:11 !!! Product_Authorization Invalid for Telnet 05:49:11 !!! Product_Authorization Invalid for Ftp 05:49:11 !!! Product_Authorization Invalid for X_Library 05:49:12 !!! Product_Authorization Invalid for Dtia ====>> Elaborator Database <<==== COMPILER_UTILITIES.11.51.0D SEMANTICS.11.50.3D R1000_DEPENDENT.11.51.0D R1000_CHECKING.11.51.0D R1000_CODE_GEN.11.51.0D IMAGE.11.4.2D CORE_EDITOR.11.6.2D TOOLS.11.5.1D OE_MECHANISMS.11.1.2D OBJECT_EDITOR.11.6.1D MAIL.DELTA OS_COMMANDS.11.6.2D ====>> Kernel.11.5.8 <<==== Kernel: Kernel assert failure detected by kernel debugger stub. System shutting down. Exception: !Lrm.System.Assertion_Error, from PC = #F6413, #707
2023-02-07 Getting somewhere 8x faster
Thanks to the much-stuff-deleted-diskimages, we can now get to the code after the VM initialization in 9 days, instead of 72 days:
rr001 (cyan&green) is main branch and original disk-images
rr007 (purple&orange) is megacomp4 branch and reduced disk-images
2023-01-28 Belated Happy New-year!
(Apologies for the lack of updates this last month, a bacterial pneumonia stole a couple of weeks.)
We have finally found the secret to making the R1000, real or emulated, boot faster: The nightly daemon run.
PAM told us that the way to get faster boot was to "delete stuff, in particular ADA objects".
We have tried that a number of times, but to no avail: The boot time stubbornly took one hour and some minutes.
Then some random experiments made us to run so much out of disk-space, that we had to run the nightly daemon run to proceed.
That took 4+ hours, and then the next boot was 10 minutes faster - Secret sauce detected!
With the pristine disk-image it take 2076 seconds to get from starting virtual memory system
to the virtual memory system is up
After a two week campaign of sessions of wanton destruction, each concluding with daemon.run("daily")
, we have reduced that to 284 seconds.
At the current speed, that should make the emulation of the "main" branch boot 70 days faster: 58 days instead of 128 days.