Rational/R1000s400/Logbook/2021
2021-12-31 Final update from 2021
We have started emulating the FIU board too and it is not a total disaster: "TEST_FIU.EM" reports 24 tests passing and 69 tests failing.
Happy New Year!
2021-12-30 From hardware to software
The clock signals are the most important electrical signals in any synchronous logic circuitry, and it is evident from the schematics that a lot of care went into designing good clock circuits. One aspect of this is that the number of chips any particular TTL gate can drive is limited, and therefore duplicated drivers are a very prominent feature of the R1000 schematics:
The 74F37 NAND buffer is designed for driving signals and there 115 of them in the R1000 with 434 of 460 gates are used.
In the hardware computer, such duplicated buffering costs parts, PCB area and power.
In software the duplication costs time, because both of the two NAND gates must be simulated, and for no good reason really, because a single software gate can drive an infinite number of inputs.
Running the "TEST_IOC.EM" through EXPMON in the simulator, causes SystemC to activate the chip classes 26,181,186,645 times and a full 30% of those activations relate to producing and driving the clock signals, because the base clocks run all the time and they run at high frequencies.
So we went through the IOC schematic and de-duplicated all the obvious double buffering, signals named "mumble.B[1-5]" were merged with "mumble.B0" and that reduced it to 23.2 billion activations, a 12% saving.
But there really is no need to produce these clocks with discrete gates, so we created a new component called "CLKGEN" which magically outputs precisely the clocks needed in a R1000 computer, that brought it down to 17.6 billion activations, ⅔ of the original.
A second round over the schematics to perform a "reduction in strength" to move the clock-gating before fan-out in a couple of logic trees brought it further down to 15.3 billion, 58.5% of the original, for the first time making the emulator less than 200 times slower than the hardware.
This is precisely the way we hope to speed up the simulator: Replacing the hardware-way with the software-way, module by module, function by function.
Here are the three clock generation pages, before and after:
In hardware, the IOC board distributes three clock signals, two of which are double buffered, across the backplane to the other cards, which each derive the same 8 "free-running clocks" from them.
In software we will distribute all the clocks from the IOC to the other cards, because we can add more pins to the backplane for free, but each chip we must simulate costs time.
2021-12-29 A day at the races
One of the test-failures we had to solve, involves the three timers available to the microcode on the IOC board, specifically the "TEST_COUNTER_DATA.IOC" experiment failed because two of the three counters (on IOC page 56) did not behave as expected.
The relevant circuitry, slightly rearranged, looks like this:
The signal to load the counter "LOAD_DELAY.S~" is gated by the "CLK.2XE~.B2" clock, as shown on the "timing diagram", and the 74F579 chip reacts to the positive flank of the "Q4~.B1" signal, which is coincident with the rising flank on "CLK.2XE~.B2".
SystemC does a very smart "delta/update" trick, but that does unfortunately save us in this case, because multiple delta/update cycles happen in this moment, and therefore the simulation does not work. Or rather, it might or might not work, depending on the non-deterministic order of things in SystemC's internal data structures: In some cases the F579 sees the clock-flank before the F30 cuts the "LOAD_DELAY.S~" signal, in some cases it happens the other way around.
It does however work in hardware, with plenty of margin, because of this detail in the 74F579 datasheet:
Whatever value we want a 74F579 to use, it must be in place 9.5 nanoseconds before the clock signal goes positive but does not need to be held after the clock flank.
Implementing "Give me the value 9.5 nanoseconds ago" in SystemC can be done, but only at a rather high cost in simulation overhead. It can be done a lot cheaper if one cheats and hardcodes the clock-rate in the F579 class, but for general reasons of sanity, we have decided to make the fix plainly visible:
The output of the "DELAY" components is the same as the input, but delayed an arbitrary 5 nanoseconds, that way there is no longer a race.
2021-12-27 Speeding things up
At the bottom of the previous entry, you will find this line:
13093.544349 s Wall Clock Time
Testing the IOC board in the emulator took 6½ hour, because the simulation runs more than 200 times slower than the real thing and it had to simulate almost a minute of time.
Fortunately we can do better than that.
Right now the simulator looks like this:
68K20 C "Musashi" | | DIAGBUS C | | 8051 DIPROC C & SystemC | | DFSM SystemC | | IOC BOARD SystemC
All these parts are driven from the same master crystal in the real computer, and everything runs fast enough that there were no reason to optimize any of this too hard.
But in our case the SystemC stuff runs very slowly, so any and all amount of time spent entirely in the 68K20 or 8051 gets dragged out as well, for no good reason, and both the 68K20 and the 8051 do significant work.
The EXPMON program has an internal script interpreter which is not very performant when the 68K20 CPU only clocks at around 33 kHz.
The 8051, which is not a fast CPU to begin with, each machine cycle being 12 clock cycles long, uses a lot of instructions on "corner-turning" to load of the DUMMY and UIR serial registers.
So the order of the day was "clock decoupling", so that each of the three major parts can run as fast as their simulation code allows.
Of course they cannot just free-wheel, their mutual communication and expectations must keep working, so these three interfaces required attention:
- The 68K20 cannot send faster on the DIAGBUS than the 8051 can receive.
- The 68K20 uses the PIT timer to decide of experiments have run too long or DIPROCs have failed to answer.
- At least one place in the DIPROC firmware, data is fed into the Diagnostic Finite State Machine with no flow-control.
We may have to add more interfaces later, for instance when the R1000 CPU starts interacting with the IOC via the FIFO/SHM circuitry but we will burn that bridge when we get to it.
The 'elastic' buffer used to implement the DIAGBUS, as the name implies, offers flow-control, so the DIPROC simulator can just read from the DIAGBUS as fast as possible, even if that is slower than the 68K20 transmits.
That would have been even easier to implement, if the firmware did not cheat and read the 'SBUF' register twice to calculate a checksum, but it is no biggie:
/* * The DIPROC firmware reads SBUF twice, at both 0x649 and * 0x64d, so we cannot use the SCON.RI bit for flow-control. * Instead, wait if: * Interrupted and not spinning in 0x646 or 0x6e3 * or if SCON.RI is already set * of if SCON.REN is not set */
The addresses will obviously have to be tweaked for the DIPROC2 firmware on the MEM32 board.
The second one was even easier: Feed the PIT clock from the SystemC simulation, done.
The last one took some manual work. Obviously any 8051 instruction which does I/O to the DFSM must run at SystemC speed, but a few other instructions must as well, for instance:
122a f2 MOVX @R0,A 122b 41 2d AJMP 0x122d 122d e5 a0 MOV A,P2
It is pretty obvious that the AJMP is only there to put time between the MOVX which starts the DFSM and reading the result it produces.
As far as I can tell, that is an optimization targeting precisely the slow "corner-turning".
With those three interfaces taken care of, the 68K20 and 8051 emulators was let loose, which dropped the wall clock time for the same 'TEST_IOC' run to 55 minutes, doing only 11.5 seconds of simulated IOC board time.
The SystemC still runs at 1/284 real speed, but now we can work on that without needlessly wasting 5½ hour per test run.
2021-12-25 IOC does not fail tests
CLI> x expmon 1 32MB MEMORY BOARDS IN PROCESSOR - TOTAL OF 32 MEGABYTES. EM> TEST_IOC BEGINNING OF IOC BOARD TESTING TESTING IOC BOARD TILE 1 - IOC_DIPROC_TEST RESET TEST PASSED TRIGGER SCOPE TEST PASSED […] TRACE RAM SIMPLE DATA TEST PASSED TRACE RAM ADDRESS TEST PASSED TRACE RAM CELL TEST PASSED END OF IOC BOARD TESTING IOC BOARD PASSED EM> 49.626989500 s simulated 47.509434200 s SystemC simulation 13093.544349 s Wall Clock Time 40.283670 s IOC stopped 14717443 IOC instructions 1/263.84 Simulation ratio 13226.043 s User time 37.425 s System time 52188 Max RSS
… and they just launched the James Webb telescope, not a bad X-mas.
2021-12-05 Hacking the MEM32 DIPROC: Habemus Firmware
With a way to execute code from external memory, the remaining task was to write 8051 code to exfiltrate the firmware from the internal ROM.
With the code-protection enabled, the straigtforward use of
MOVC A,@A+DPTR
does not work, so we need to find that instruction somewhere in the internal ROM and make it do the deed for us.
Instead of actual memory, we are using a tiny ARM processor, which is connected to the 8051's port-0, port-2, CLOCK, ALE and PSEN pins. This allows us to not only give the 8051 the instructions we want, it also allows us to monitor what it does as a result, to the extent something shows up on the external memory interface.
Under the assumption that the MEM32 DIPROC's firmware shares a lot of code with the older DIPROC's we did a scan to locate instructions which give themselves away easily.
If, for instance from the external code we CALL to an address in the internal ROM which happens to contain a 0x22 byte, the 8051 will execute that as a "RET" instruction and return straight back to our code.
(Of course, it would also do that if it hit any trivial instruction, say "CLC C" followed by "RET", but the ARM counts the number of clock- cycles, so we easily tell those apart: Only the fastest returns are direct.)
Similarly, if we hit a 0x02 byte the 8051 will interpret that as a "LJMP" instruction, and if the next byte is larger than or equal to 0x20, the jump is into external memory, and the ARM processor will capture the address.
If we load up the DPTR register with 0x5555, and the A register with 0x55 before the call, any instruction using "@A+DPTR" on code memory, will access location 0x55aa, which, again, the ARM processor detects and reports.
However, for most the addresses, the 8051 goes on to somewhere in internal memory and does not reappear on the external memory interface, so after a timeout the probe fails.
This probing is not fast.
For each location we want to probe, we have to reset the DIPROC and go through the download-experiment-which-corrupts-the-stack gyrations.
To make matters worse, the DIPROC is seriously underclocked to make sure the ARM processor can keep up with the external memory interface.
Combining the necessary with the convenient, we run the DIPROC at 1200 * 64 = 76.8kHz clock-rate, which makes the DIAGBUS run at a standard rate of 1200 bps.
The result of the survey was that the very convenient subroutine at address 0x1393 in the older DIPROC, which we hoped to use, was nowhere to be found in the MEM32 DIPROC's firmware. This does make sense, that routine was used for the short scan-chains, and the MEM32 has none of those.
So time to study the "@A+DPTR" locations we had found.
The "JMP @A+DPTR" instructions were of no use, and most of the "MOVC @A+DPTR" did not return via a "RET(I)" instruction and the ones which did, had mangled the byte they read, before returning.
At this point we would have been stuck if there were not a single 0x93 byte in the internal firmware, because without no "MOVC A,@A+DPTR" instructions, we would have no way to read the internal ROM, fortunately there were plenty of those.
So it was time to roll out the heavy artillery.
Zero the A register, set the address of the location to be read in DPTR, send the first byte of a DIAGBUS "Upload" command, execute a calibrated number of "NOP" instructions and call the "MOVC @A+DPTR" instruction.
With a bit of luck, the 8051 will get a DIAGBUS receive interrupt at just the right moment, where the MOVC instruction has completed the read, but before the next instruction(s) had a chance to mangle it.
The serial interrupt handler very conveniently pushes the A register onto the stack, so some number of clock cycles after the last external memory fetch, we can trust the serial interrupt has happened, and that the 8051 is spinning in the interrupt handler, waiting for the pointer and length bytes to arrive on the DIAGBUS.
At this point the stack contains the return address to our code, the return address for the serial interrupt and the pushed A register, so sending 0x00 and 0x10 for the pointer and length bytes, we get something like this back on the DIAGBUS:
c0 00 0f cf 0f 06 46 55 45 70 93 06 98 6b 07 55 55 50 ----- ----- -- | | +--- Byte we want | +-------- Interrupted instruction +-------------- Next address in external memory
Getting that just right took some calibration, and still in 20% of the attempts the interrupt comes too early, in 38% of the attempts it comes too late, but 42% of the time it works.
So now we also have the firmware for the MEM32 DIPROC.
2021-12-03 Hacking the MEM32 DIPROC: First flag captured
It didn't go quite the way I explained in the previous entry: I had forgotten that the 8052 only allows access to Special Function Registers via direct memory access, and the EXP instructions only use indirect memory access.
But I found another way: When a EXP subroutine is called, the return address is pushed on the stack, and the stack lives below the EXP work-area and grows up, so doing a specific number of calls, and then PAUSE will put the stack pointer in the EXP work-area, and then we can over-write it with a download, and make the 8051 return from the serial interrupt routine to an address of our choice.
This experiment does that:
10 1b PC_: .CODE EXPERIMENT 11 00 R1_: .DATA 0x0 12 00 R2_: .DATA 0x0 13 00 R3_: .DATA 0x0 14 00 R4_: .DATA 0x0 15 00 R5_: .DATA 0x0 16 00 R6_: .DATA 0x0 17 00 R7_: .DATA 0x0 18 1b .CODE EXPERIMENT 19 00 .DATA 0x0 1a 0a .DATA 0xa 1b EXPERIMENT: 1b 86 1a DEC 0x1a 1d 51 00 00 19 23 BEQ.W #0x0000,0x19,0x23 22 18 CALL EXPERIMENT 23 60 PAUSE
After that, we send these bytes in the diag_bus:
0x1a5 ANY DOWNLOAD 0x002 payload length 0x031 payload byte #0 0x073 payload byte #1 0x0a6 checksum (0x02 + 0x31 + 0x73 = 0x(1)a6)
The 8052 returns out of the mask-ROM:
IOC.ioc_64.DPROC Instr 0x06aa 32: |{90 03 b0 1}|RETI|pop(@0x11) -> 0x73|pop(@0x10) -> 0x31|nPC 0x7331/0 IOC.ioc_64.DPROC OUT OF PROGRAM at 0x7331
0wned!
2021-12-02 Hacking the MEM32 DIPROC for Fun & Profit
I have been pondering ways to defeat the code-protection on the MEM32 DIPROC and I think I have found a way to do it with EXP instructions.
The EXP instructions perform no range-checking, so the 8051 SFR (Special Function Registers) can be modified by downloaded EXP instructions.
The Stack Pointer is a SFR on the 8051, so first part of the attack goes like this is:
1. Move stack pointer into the experiment download area
2. Get the 8051 to push a return address
3. Modify the return address
4. Get the 8051 to return through the modified return address to external code memory
5. Supply external-memory instructions to read out the internal ROM.
The most accesible use of the return is the interrupt handler which receives bytes from the DIAGBUS, because it performs the entire transaction while in the interrupt handler, before executing RETI.
So with a bit more detail the attack looks like this:
Download & execute EXP instructions to:
store 0x30 into SFR 0x81 PAUSE
The use of "PAUSE" is important, because it does not reset the stack pointer, like the normal END instruction does:
016b ; -------------------------------------------------------------------------------------- 016b ; PAUSE - |0 1 1 0 0 0 0|m| 016b ; -------------------------------------------------------------------------------------- 016b INS_PAUSE: 016b 75 04 03 |u | MOV diag_status,#0x03 016e 21 7d |!} | AJMP 0x17d ; Flow J 0x17d […] 0175 ; -------------------------------------------------------------------------------------- 0175 ; END >R |0 1 0 1 1 1 0|m| 0175 ; -------------------------------------------------------------------------------------- 0175 INS_END: 0175 75 04 01 |u | MOV diag_status,#0x01 0178 75 81 06 |u | MOV SP,#0x06 017b c2 8a | | CLR TCON.IT1 017d e5 04 | | MOV A,diag_status 017f b4 06 fb | | CJNE A,#0x06,0x17d ; Flow J cc=NE 0x17d 0182 a1 1c | | AJMP EXECUTE ; Flow J 0x51c
Then download another set of "EXP instructions" which overwrite the stack, so that instead of returning to the loop at 0x017d…0x017f, the 8051 returns into external memory.
However, we are not done, because the code protection bits also prevent a "MOVC" instruction in external memory from reading the internal code ROM, so we have to find a MOVC instruction in the internal ROM to do the dirty work for us.
The 0xDA "PERMUTE" EXP instruction uses a number of MOVC instructions to read the permutation tables, one of them is in the subroutine starting at 0x13d8 in the normal DIPROC. However, they all have the job of mangling things based on the table they find in code memory, so we need to deduce the "table" they read from CODE memory from the mangling they do.
That part is still "for further study" as the CCITT would say.
2021-11-28 7 down, 7 to go
We are now down to only 7 tests failing on the simulated IOC board:
DELAY TIMER MACRO EVENT TEST SLICE TIMER MACRO EVENT TEST EXIT MICRO EVENT TEST TRACE RAM SIMPLE DATA TEST TRACE RAM ADDRESS TEST TRACE RAM CELL TEST
Rooting out these bugs is slow going: First we need to disassemble the experiment - to the extent that we currently can - then we need to find out what the DFSM (Diagnostic Finite State Machine) steps it uses do, and finally we can try to find out why they do not.
The main cause of trouble continues to be misreadings of the schematics, to the extent that it is always the first working hypothesis that one or more of (0|8, I|T, M|N|H, B|D|S) were confused.
Fixing the ECC was slightly tricky because seven of the nine checkbits are a parity over a "random" selection of the 128 bits in the memory word:
TYP:0 TYP:63 VAL:0 VAL:63 CHECKBIT ECCG16 --------------------------------++++++++++++++++++++++++++++++++ --------------------------------++++++++++++++++++++++++++++++++ +-------- ECCG17 ++++++++++++++++++++++++++++++++-------------------------------- --------------------------------++++++++++++++++++++++++++++++++ -+------- ECCG28 ++++++++-----++---+++-+--+++--+-++++++++-----++---+++-+--+++--+- ++++++++-------+-----+-+--+-+-+----++---+-+-+--++++---++++-+---- --+------ ECCG29 +++++++++++++-++++---+-------+--+++++++++++++-++++---+-------+-- -+------++++++++-------++----++-----+++++--+-++----+---+--++---+ ---+----- ECCG44 ++-----+++++++++--++--+++---+-++++-----+++++++++--++--+++-----++ ---+---+----+++---++++--++++------------++++++++---++++------++- ----+---- ECCG45 -----++++--+---++++++++++++----+-----++++--+---++++++++++++-+--+ --+--++----+--+-++++++++-----+---++----+--------++++++++----+--+ -----+--- ECCG46 +-++--+-++--+---++----+-+++++++++-++--+-++--+---++----+-++++++++ ----+-+--++------++---+-+++++++++-++--+--+----++--+-+----++-++-- ------+-- ECCG61 -++-++----+-++-++-+-++-++-++++---++-++----+-++-++-+-++-++-++++-- +---++-++-+-+--++---+-+--+-----+++---+----+--+---+---+--++++++++ -------+- ECCG62 ---++----+++-++--+-+++-+-+-+++++---++----+++-++--+-+++-+-+-+++++ ++++----++-+-+--++-+-------++--+++++++++-+-++---+-------+-----+- --------+
I'm far from an ECC specialist, but I do know the crucial feature of this pattern is that any two columns differ from all the other ones in three places or more, (Hamming-distance >= 3), that way there will be exactly one legal "syndrome" which is closest to any of the illegal ones produced by a single bit error.
I wrote a bit of python code which finds all the "ECCG" chips in the SystemC code, extracts the connections between them and calculates the above map, and then finally calculate the minimum hamming distance and complain if it is less than three. The list of which signals went into the codes which had hamming distance of only two would be the main culprits and as usual it ended up being a 0|8 confusion.
Since the 74F280 chip only has 9 inputs, it tales a tree of them to produce the above result:
TYP:0 TYP:63 VAL:0 VAL:63 CHECKBIT ECCG00 ++++++++-------------------------------------------------------- ---------------------------------------------------------------- --------- ECCG01 --------++++++++------------------------------------------------ ---------------------------------------------------------------- --------- ECCG04 ----------------++++++++---------------------------------------- ---------------------------------------------------------------- --------- ECCG05 ------------------------++++++++-------------------------------- ---------------------------------------------------------------- --------- ECCG08 --------------------------------++++++++------------------------ ---------------------------------------------------------------- --------- ECCG09 ----------------------------------------++++++++---------------- ---------------------------------------------------------------- --------- ECCG12 ------------------------------------------------++++++++-------- ---------------------------------------------------------------- --------- ECCG13 --------------------------------------------------------++++++++ ---------------------------------------------------------------- --------- ECCG02 ---------------------------------------------------------------- ++++++++-------------------------------------------------------- --------- ECCG03 ---------------------------------------------------------------- --------++++++++------------------------------------------------ --------- ECCG06 ---------------------------------------------------------------- ----------------++++++++---------------------------------------- --------- ECCG07 ---------------------------------------------------------------- ------------------------++++++++-------------------------------- --------- ECCG10 ---------------------------------------------------------------- --------------------------------++++++++------------------------ --------- ECCG11 ---------------------------------------------------------------- ----------------------------------------++++++++---------------- --------- ECCG14 ---------------------------------------------------------------- ------------------------------------------------++++++++-------- --------- ECCG15 ---------------------------------------------------------------- --------------------------------------------------------++++++++ ---------
These sixteen chips calculate a per-byte parity, and similar circuits are present on TYP, VAL and FIU boards, to protect the transfer across the frontplane from one board to another.
Then another 38 chips complete the pattern:
TYP:0 TYP:63 VAL:0 VAL:63 CHECKBIT ECCG18 ---------------------------------------------------------------- ---------------+-----+-+--+-+-+----++--------------------------- --------- ECCG19 ---------------------------------------------------------------- -+---------------------++----++-----+++------------------------- --------- ECCG20 ---------------------------------------------------------------- ----------------------------------------+-+-+--++++---+--------- --------- ECCG21 ---------------------------------------------------------------- ---------------------------------------++--+-++----+---+--+----- --------- ECCG22 -------------++---++-------------------------------------------- -------------------------------------------------------+++-+---- --------- ECCG23 --------+++++-+------------------------------------------------- -----------------------------------------------------------+---+ --------- ECCG24 --------------------+-+--+++--+--------------++----------------- ---------------------------------------------------------------- --------- ECCG25 ---------------+++---+-------+----------+++--------------------- ---------------------------------------------------------------- --------- ECCG26 --------------------------------------------------+++-+--+++--+- ---------------------------------------------------------------- --------- ECCG27 -------------------------------------------++-++++---+-------+-- ---------------------------------------------------------------- --------- ECCG41 ---------------------------------------+----------++--+++-----++ ---------------------------------------------------------------- --------- ECCG42 ----------------------------------------+--+---+--------+++-+--+ ---------------------------------------------------------------- --------- ECCG30 ---------------------------------------------------------------- ---+---+----+++---+++------------------------------------------- --------- ECCG31 ---------------------------------------------------------------- ----+-+--++------++---+---------+------------------------------- --------- ECCG32 ---------------------------------------------------------------- ---------------------+--++++-----------------------+++---------- --------- ECCG33 ---------------------------------------------------------------- --+--++----+--+--------------+---++----------------------------- --------- ECCG34 ---------------------------------------------------------------- ----------------------------------++--+--+----++--+-+----------- --------- ECCG35 ++-----+----------++-------------------------------------------- ------------------------------------------------------+------++- --------- ECCG36 -----++++--+---------------------------------------------------- ---------------------------------------+--------------------+--+ --------- ECCG37 +-++--+--------------------------------------------------------- ---------------------------------------------------------++-++-- --------- ECCG38 ----------------------+++---+-++++------------------------------ ---------------------------------------------------------------- --------- ECCG39 ---------------+--------+++----+-----+++------------------------ ---------------------------------------------------------------- --------- ECCG40 --------++--+---++----+---------+-+----------------------------- ---------------------------------------------------------------- --------- ECCG43 -----------------------------------+--+-++--+---++----+--------- ---------------------------------------------------------------- --------- ECCG47 ---------------------------------------------------------------- +---++-++-+-+--+------------------------------------------------ --------- ECCG48 ---------------------------------------------------------------- ----------------+---+-+--+-----+++---+-------------------------- --------- ECCG49 ---------------------------------------------------------------- ++++----++-+-+-------------------------------------------------- --------- ECCG50 ---------------------------------------------------------------- ----------------++-+-------++--+---------+-+-------------------- --------- ECCG51 -++-++---------------------------------------------------------- ------------------------------------------+--+---+---+---------- --------- ECCG52 ---++----++----------------------------------------------------- --------------------------------------------+---+-------+-----+- --------- ECCG53 ----------+-++-++-+-++------------------------------------------ ---------------------------------------------------------------- --------- ECCG54 -----------+-++--+-+++-+---------------------------------------- ---------------------------------------------------------------- --------- ECCG55 -----------------------++-++++---++----------------------------- ---------------------------------------------------------------- --------- ECCG56 -------------------------+-+++++---++--------------------------- ---------------------------------------------------------------- --------- ECCG57 ------------------------------------++----+-++-++-+------------- ---------------------------------------------------------------- --------- ECCG58 -----------------------------------------+++-++--+-++----------- ---------------------------------------------------------------- --------- ECCG59 ----------------------------------------------------++-++-++++-- ---------------------------------------------------------------- --------- ECCG60 -----------------------------------------------------+-+-+-+++++ ---------------------------------------------------------------- ---------
In Hamming's original 1950 article, Detecting and Error Correcting Codes the check-bits were arranged so that their numerical value gave the position of the flipped bit.
But this is not a classical Hamming-syndrome.
I suspect the R1000 designers figured out they could save some chips by creating a "syndrome" which recycled some of the sixteen parity bits, and therefore the syndrome of the check-bits are run through a small PROM chip to convert it to a bit position:
One output bit tells if this an "invalid syndrome", 374 of the 512 syndromes consist of just that bit.
If the "invalid syndrome" bit is zero, the other seven bits identify which one of the 128 bits should be flipped.
That leaves 10 syndromes which have the invalid syndrome bit set, but which also presents something in the other seven bits:
0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x7f
Could they be "magic markers" which confirm to diagnostic tests that they got the expected wrong syndrome ?
2021-11-24 IOC page 28 reconstructed
Page 28 is missing in our binder of schematics.
The first thing I did with that binder, almost a decade ago, was to scan it, and page 28 was missing, so it was missing when we got it.
From the titles of the surrounding pages it would appear to be related to memory, somehow:
By cross-referencing the the netlist out of KiCad and the "chipmap" listing the name of each position on the PCB and what was in it, four 74F280 chips were totally unaccounted for, and the KiCad "Electrical Rules Check" complained about nothing feeding the data input of the RAM chips storing the parity, so tonight I went into the lab and "beeped" out the connections of those four chips, and now we have a "hypothesized page 28", with a explanatory note about where the fiction comes from:
2021-11-20 KiCad Schematics online
Redrawing the schematics in KiCad is essentially complete, so I have put the result online in github:
https://github.com/Datamuseum-DK/R1000.HwDoc
RESHA needs to be redrawn, but I think we are one or more sheets shy, and the ones we have look very preliminary, so that is on the "later" list.
IOC lacks page 28, we simply do not have that page. I suspect it was affected by an ECO and never got put back in the binder.
2021-11-09 No luck reading MEM32 DIPROC firmware
I tried reading the firmware out of the MEM32 DIPROC tonight, using the [External Address Hack] but no luck: The 8752 chip clams up and executes no instructions at all when the `EA` input is pulled low.
Intel added three successive "lock bits" to their MCS51 microcontrollers in order to be able to keep the microcode from being copied, the third one is the one which simply disables external code execution ability.
As far as I can tell, the only known way to read out the firmware from a chip so protected, is to decap it and hit the security fuses with a laser, while being very careful not to hit the EPROM bits. Given that we have only three chips, one in each of our three working boards, that is totally out of the question.
One of our MEM32 boards have a much lower serial number than the two others, and a slightly different text on the sticker on the DIPROC, so there is a very small chance, that it may be possible to read the firmware out of that one. That board is in remote storage now, so that will have to wait.
Another unlikely avenue could be to exercise the DIPROC while in the board, via the DIAGBUS, using the `*.M32` files in the DFS filesystem, while capturing the activity on all its pins with a logic analyzer, and then reconstruct the firmware from those traces. If the firmware shares enough code with the DIPROC on the other boards, this just might be feasible, given enough time and sufficient motivation, but we are into "stranded on a desert island" territory there.
Being unable to emulate this firmware obviously takes the "Diagnostic Archipelago" out of the picture as far as the MEM32 board goes, but I do not think that is a show-stopper for us.
Compared to the other boards, the MEM32 board is quite simple, and has no microcode of its own, but gets commanded from part of the FIU boards microcode, via 20 signals on the backplane.
But obviously a complication we could do without.
2021-11-02 Almost 50/50 isn't bad
Almost 20 years ago, at BSDCON2002, John R. Mashey, already then a legend, gave us young geeks a reprise of a already then 20 year old talk, Software Army on the March, and ever since I have wished that tank was captured on video, because it has some of the best wisdom I have ever gotten about software development, in all its forms and shapes.
One particular kind of software development, is where you know where you want to go, but not how to, or even if it is possible to get there.
Frederick P. Brooks tells us to always build a prototype and throw it away, and John R. Mashey took that advise one step further by framing the prototype in terms of the scouts a moving army sends out, to find viable routes etc: You dont send scouts backwards or along routes you think will work, you send them out where things may not work and where you simply have no idea of things will work.
He also told us to be extremely sceptical of the promises made by the scouts and prototypes: A bridge which untroubled bore a scout on a motorbike is not by definition a way to move your tanks across the river.
The R1000 Emulation project has a lot of unknowns but we know it is not impossible, emulating computers is old hat, and the R1000 was even simulated to quite some extent before it was built in the first place. But getting from A to B still involves a lot of path-finding and I think we have the first scout back now, with some good and some bad news.
The good news is that KiCad+SystemC looks like a feasible path for us, the bad news is that it will require some very strong tools to debug problems along the way.
Amongst the major and absolutely inescapable items on the shopping list is a way to snapshot and restart the emulator, otherwise we will waste far too much time getting the emulator to run to the interesting point. Somewhat surprising to me, snapshort/restart is not part of SystemC, but I think I have an plan how to go about it.
I think that is a bridge we have to construct now, along with that some more flexible debugging facilities. So while I go of to a side doing that, enjoy this almost 50/50 success-ratio IOC test-run:
CLI> x expmon 1 32MB MEMORY BOARDS IN PROCESSOR - TOTAL OF 32 MEGABYTES. EM> TEST_IOC BEGINNING OF IOC BOARD TESTING TESTING IOC BOARD TILE 1 - IOC_DIPROC_TEST RESET.IOC RESET TEST PASSED TRIGGER_SCOPE.IOC TRIGGER SCOPE TEST PASSED TESTING IOC BOARD TILE 2 - IOC_SCAN_CHAIN_TEST TEST_PAREG_SCAN.IOC PARITY REGISTER SCAN TEST PASSED TEST_SYNDROME_SCAN.IOC SYNDROME REGISTER SCAN TEST PASSED TEST_UIR_SCAN.IOC MICRO INSTRUCTION REGISTER SCAN TEST PASSED TEST_RDR_SCAN.IOC "DUMMY" READ DATA REGISTER SCAN TEST PASSED TESTING IOC BOARD TILE 3 - IOC_WCS_TEST TEST_WCS_DATA.IOC SIMPLE WCS DATA TEST PASSED TEST_WCS_ADDRESSING.IOC WCS ADDRESS TEST FAILED FAILING EXPERIMENT IS : TEST_WCS_ADDRESSING TEST_WCS_BITS.IOC WCS CELL TEST PASSED TESTING IOC BOARD TILE 4 - IOC_ENABLE_TEST TEST_ADDRESS_ENABLES.IOC ADDRESS BUS SOURCE DECODER TEST PASSED TEST_FIU_ENABLES.IOC FIU BUS SOURCE DECODER TEST PASSED TEST_TV_0_ENABLES.IOC TV BUS SOURCE DECODER TEST - NO DUMMY_NEXT OR CSA_HIT FAILED FAILING EXPERIMENT IS : TEST_TV_0_ENABLES TEST_TV_1_ENABLES.IOC TYPE VAL BUS SOURCE DECODER TEST - WITH CSA_HIT FAILED FAILING EXPERIMENT IS : TEST_TV_1_ENABLES TEST_TV_2_ENABLES.IOC TYPE VAL BUS SOURCE DECODER TEST - WITH DUMMY_NEXT FAILED FAILING EXPERIMENT IS : TEST_TV_2_ENABLES TEST_TV_3_ENABLES.IOC TV BUS SOURCE DECODER TEST - WITH DUMMY_NEXT & CSA_HIT FAILED FAILING EXPERIMENT IS : TEST_TV_3_ENABLES TESTING IOC BOARD TILE 5 - IOC_TIMER_TEST TEST_COUNTER_DATA.IOC TIMER DATA TEST FAILED FAILING EXPERIMENT IS : TEST_COUNTER_DATA TESTING IOC BOARD TILE 6 - IOC_ECC_TEST TEST_SYNDROME_TRANSCEIVER.IOC SYNDROME TRANSCEIVER (ECC) TEST PASSED TEST_CHECKBITS.IOC CHECKBIT (ECC) TEST FAILED FAILING EXPERIMENT IS : TEST_CHECKBITS TEST_MULTIBIT_ERROR.IOC MULTIBIT ERROR (ECC) TEST PASSED TEST_ECC.IOC FULL ECC TEST FAILED FAILING EXPERIMENT IS : TEST_ECC TESTING IOC BOARD TILE 7 - IOC_EVENTS_TEST TEST_MACRO_EVENT_DELAY.IOC DELAY TIMER MACRO EVENT TEST FAILED FAILING EXPERIMENT IS : TEST_MACRO_EVENT_DELAY TEST_MACRO_EVENT_SLICE.IOC SLICE TIMER MACRO EVENT TEST FAILED FAILING EXPERIMENT IS : TEST_MACRO_EVENT_SLICE TEST_MICRO_EVENT_ECC.IOC TEST_MICRO_EVENT_EXIT.IOC EXIT MICRO EVENT TEST FAILED FAILING EXPERIMENT IS : TEST_MICRO_EVENT_EXIT TEST_IOC_CLOCKSTOP.IOC IOC CLOCKSTOP EVENT TEST PASSED TESTING IOC BOARD TILE 8 - IOC_TRACE_TEST TEST_TRACE_DATA.IOC TRACE RAM SIMPLE DATA TEST FAILED FAILING EXPERIMENT IS : TEST_TRACE_DATA TEST_TRACE_ADDRESSING.IOC TRACE RAM ADDRESS TEST FAILED FAILING EXPERIMENT IS : TEST_TRACE_ADDRESSING TEST_TRACE_BITS.IOC TRACE RAM CELL TEST FAILED FAILING EXPERIMENT IS : TEST_TRACE_BITS END OF IOC BOARD TESTING IOC BOARD FAILED 14 TEST(S) FAILED EM>
2021-10-26 Getting somewhere
I have found a set of TEST*.EM files, which seem to be a collection of per-board test and I have started with the IOC board, since the simulator runs fastest with only one board, and the IOC produces the clock signals.
Status so far:
CLI> x expmon 1 32MB MEMORY BOARDS IN PROCESSOR - TOTAL OF 32 MEGABYTES. EM> TEST_IOC BEGINNING OF IOC BOARD TESTING TESTING IOC BOARD TILE 1 - IOC_DIPROC_TEST RESET TEST PASSED TRIGGER SCOPE TEST PASSED TESTING IOC BOARD TILE 2 - IOC_SCAN_CHAIN_TEST PARITY REGISTER SCAN TEST PASSED SYNDROME REGISTER SCAN TEST PASSED MICRO INSTRUCTION REGISTER SCAN TEST PASSED "DUMMY" READ DATA REGISTER SCAN TEST PASSED TESTING IOC BOARD TILE 3 - IOC_WCS_TEST SIMPLE WCS DATA TEST FAILED FAILING EXPERIMENT IS : TEST_WCS_DATA
The bugs I have fixed so far have all been variants of typos, either in the schematics or in the component models.
It is my general policy to run development versions of software where I can, and the development version of KiCad seems to be somewhat in transit with respect to Electrical Rule Checks, at least I think it should have spotted one of my typos, but it did not.
It is probably about time to start adding some teeth to the code which converts the KiCad netlist to SystemC, so it spots trouble when all boards are "plugged in", since KiCad only sees one board at a time. This will also automatically plug some of the holes in KiCad's per-board ERC.
My F245AB component model did not react to changes on pin18, because I accidentally had left that pin out of the "sensitive" clause.
2021-10-25 First test passed
It may not do much, but this is officially the first passed test for the simulator:
CLI> x expmon 1 32MB MEMORY BOARDS IN PROCESSOR - TOTAL OF 32 MEGABYTES. EM> FIU_RESET_TEST TESTING FIU BOARD TILE 1 - FIU_RESET_TEST SIMPLE_DIAGNOSTIC_COMMAND PASSED EM>
The simulator can be told which boards to "plug in", and in this case only the IOC (which generates the clocks) and the FIU board were active, so the simulator ran quite fast:
2.903304450 s simulated 0.722312450 s SystemC simulation 272.648054 s Wall Clock Time 0.424236 s IOC stopped 4236823 IOC instructions 1/93.91 Simulation ratio 272.853 s User time 2.134 s System time 61428 Max RSS
Now we need to create a "Continuous Integration" facility, to run all the "*.EX" files on the DFS filesystem and record their outcomes.
2021-10-20 Bug found in TYP schematics
One of the really big gambles about basing the software emulation on the schematics, is the uncertainty if the schematics we have are truthful.
The RESHA schematics look unfinished, but fortunately they are rather peripheral to the project and there are relatively few components on that board.
Until now the other schematics seems to have held up well, but today I spotted the first inconsistency on one of the TYP schematics: Two chips both named DIXCV1.
Investigation of the S400 board images, convince me that the topmost one is DIVXCV0, and I see no realistic way the schematic could have contained that, and subsequent printing/copying/scanning etc. turned that into …1 as seen in the image.
This is not a problem, in the sense that it is not information which affects the netlist produced, but it does tell us to be (more) critical.
It has been the plan all along to digitize as many data sources as possible, in order to cross-reference all the information we have, in an attempt to detect discrepancies between the schematics and reality, for instance by typing in the chip names from the solder side silk screen, in our ChipMaps page, and later the chip types from the component side as well.
Hopefully that "census" matches the completed schematics, otherwise we have a LOT more work to do.
2021-10-17 Hole through the diagbus
We have hole through the simulated diagbus now:
CLI> x novram Options are: 0 => Exit. 1 => Display novram contents. 2 => Modify novram contents. 3 => Change TCP/IP board serial number. Enter option : 1 Part Serial Artwork ECO Date of Board Number Number Revision Level Manufacture IOC 49 10295 3 13 10-JUL-92 VAL 0 0 0 0 ??-???-?? TYP 0 0 0 0 ??-???-?? SEQ 45 10571 2 5 12-JAN-93 FIU 0 0 0 0 ??-???-?? MEM0 0 0 0 0 ??-???-?? RESHA 41 10272 3 13 24-JUN-92 TCP/IP (CMC) board serial number is 1671
The values from the SEQ board comes from the SystemC model of the Xicor X22C12 NOVRAM chip via the simulated i8052 DIPROC across the simulated diagbus to the simulated M68K20.
Getting it work involved hacking som random usleep(3) calls into the M68K20 side of the diagbus, so it would not time the DIPROC out before it got a chance to wiggle the READ_NOVRAM_DATA.SEQ experiment through the SystemC simulation.
The easiest way, and in many ways the correct way, to handle this would be let the SystemC 20MHz clock drive the simulated IOC.
But that would needlessly slow down the IOC simulation, which is useful on its own for various research purposes.
It looks like the DIPROC timeout test is driven by the PIT timer, so maybe tying only that to the SystemC clock, when that is running, will be enough ?
Entering the schematics in KiCad is relatively painless, it does get a bit slow near completion of a board, but I get through it. It looks like the completed schematics will generate a tad over 100KLOC of C++/SystemC code.
2021-10-09 Time is an illusion
I concluded the previous entry by noting that the NOVRAM saw an all-ones address, that is now solved, and it transpired to be quite interesting.
The R1000 is, by and large, a synchronous logic system, there is a centralized clock of 20 MHz and things happen on its beat.
Since things often have to happen sequentially, it is very common to divide the main clock down and provide for clock-phases.
The R1000 has four phases, named Q1 to Q4, each of which is 50 nanoseconds long, and these signals are produced locally on each board, if the board needs them. Q2 and Q4 are the popular ones, al boards seem to have them, while Q1 and Q3 are less used.
In practice the signals are produced by AND'ing a version of the "2X" and a version of the "1X" clock signals, these being 10MHz and 5 MHz respectively, divided down from the 20MHz, the waveforms looks something like this:
The 2X and the H1 signals are fed to a NAND gate, which produces a Q3~ low pulse, whenever they both are high.
The problem arises because the two inputs trace places from one being low and the other high to the opposite.
In real life that happens simultaneously, the R1000 clock circuits are designed very carefully to make sure of that.
But in the SystemC simulation these two events are simulated sequentially, and depending on the order, that can create a zero-width glitch, which will trigger downstream edge-triggered circuits.
Here is the debugging output where I first spotted it:
@4721125 ns sc_module FIU.fiu_63.U6317_F37 11|0 @4721175 ns sc_module FIU.fiu_63.U6317_F37 10|1 @4721175 ns sc_module FIU.fiu_65.U6516_F74 CLK 1111|10 @4721225 ns sc_module FIU.fiu_63.U6317_F37 11|0 @4721225 ns sc_module FIU.fiu_63.U6317_F37 01|1 <----- @4721225 ns sc_module FIU.fiu_65.U6516_F74 CLK 1011|01 @4721275 ns sc_module FIU.fiu_63.U6317_F37 00|1 @4721325 ns sc_module FIU.fiu_63.U6317_F37 01|1 @4721325 ns sc_module FIU.fiu_63.U6317_F37 11|0 @4721375 ns sc_module FIU.fiu_63.U6317_F37 10|1 @4721375 ns sc_module FIU.fiu_65.U6516_F74 CLK 1011|01 @4721425 ns sc_module FIU.fiu_63.U6317_F37 11|0 @4721425 ns sc_module FIU.fiu_63.U6317_F37 01|1 <----- @4721425 ns sc_module FIU.fiu_65.U6516_F74 CLK 1011|01 @4721475 ns sc_module FIU.fiu_63.U6317_F37 00|1 @4721525 ns sc_module FIU.fiu_63.U6317_F37 01|1 @4721525 ns sc_module FIU.fiu_63.U6317_F37 11|0 @4721575 ns sc_module FIU.fiu_63.U6317_F37 10|1 @4721575 ns sc_module FIU.fiu_65.U6516_F74 CLK 1011|01 @4721625 ns sc_module FIU.fiu_63.U6317_F37 11|0 @4721625 ns sc_module FIU.fiu_63.U6317_F37 01|1 <----- @4721625 ns sc_module FIU.fiu_65.U6516_F74 CLK 1011|01 @4721675 ns sc_module FIU.fiu_63.U6317_F37 00|1
Notice how at ....25 nanoseconds, the F37 output goes low and high with the same timestamp, and how that triggers the F74.
If SystemC had happened to update the two inputs in the other order, everything would be fine.
Interestingly, and certainly worth knowing, this glitch does not appear in the "wave-file" output:
Notice how the "NEW_CMD_B_inv" signal (output of F74) goes low despite there being no positive edge on the "fiu_globals_Q4_inv..." signal.
After thinking it over on a bike-ride, I decided that the cleanest and most efficient fix was to generate the four phase signals on the IOC board, where the 20MHz clock was available to synchronize them, and distribute them to the other boards via four extra pins on the backplane:
Having done that, the NOVRAM chip sees far more interesting addresses:
@4191275 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 1 store_1 recall_ 1 we_ 1 a 00000010 dq ZZZZ @5065675 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 0 store_1 recall_ 1 we_ 1 a 00000011 dq ZZZZ @5070875 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 1 store_1 recall_ 1 we_ 1 a 00000011 dq ZZZZ @5737675 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 0 store_1 recall_ 1 we_ 1 a 00000100 dq ZZZZ @5742875 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 1 store_1 recall_ 1 we_ 1 a 00000100 dq ZZZZ @6617275 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 0 store_1 recall_ 1 we_ 1 a 00000101 dq ZZZZ @6622475 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 1 store_1 recall_ 1 we_ 1 a 00000101 dq ZZZZ @7289275 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 0 store_1 recall_ 1 we_ 1 a 00000110 dq ZZZZ @7294475 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 1 store_1 recall_ 1 we_ 1 a 00000110 dq ZZZZ @8168875 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 0 store_1 recall_ 1 we_ 1 a 00000111 dq ZZZZ @8174075 ns sc_module SEQ.seq_83.U8315_NOVRAM cs_ 1 store_1 recall_ 1 we_ 1 a 00000111 dq ZZZZ
Next: Run the SystemC code in "real-time" and have it send the responses back to the IOC processor via the DIAGBUS.
2021-10-06 The worlds largest Blinky works!
I have tied the SystemC code, including the i8051 emulator into the IOC emulator, and have now seen the Chip-Select pin of the SEQ's NOVRAM go low 31 times.
Let me explain the sequence of events that takes us there:
- The simulated IOC 68k20 resets into the IOC-EEPROM
- It completes the selftests (although we skip some of them)
- It loads the KERNEL.0, FS.0 and PROGRAM.0 from the simulated disk
- We ask to enter the CLI
- We execute the CLI command "x novram"
- This command excutes using the "dummy diproc" code, so all it gets back is zeros.
- The traffic from the IOC to the DIAGBUS is written to the file "_diag" as we go.
- When the CLI prompt comes back, we launch the SystemC code with the IOC and SEQ boards, for 27 milliseconds
- Special-hacky-code reads the "_diag" file and sends those bytes to the DIAGPROCs, essentially replaying the "x novram" programs traffic.
- The IOC's DIAGPROC has no NOVRAM, but it handles the traffic correctly
- The SEQ's DIAGPROC downloads the experiment (http://datamuseum.dk/aa/r1k_dfs/32/32b885caa.html READ_NOVRAM_DATA.SEQ)
- The SEQ's DIAGPROC executes the experiment, wiggling the simulated i8051's pins in the SystemC model.
- The SystemC model propagates these wigglings through the SystemC model.
- The wiggling pins trigger the diagnostic finite state machine, where DPROM5 and DUIRG5 produce control signals for the NOVRAM chip.
- The SEQ's NOVRAM chip sees its Chip Select pin go low/high 31 times.
- The SEQ's DIAGPROC uploads the result of the experiment.
With 143754 lines of C and C++ source code, this is probably the worlds largest Blinky program :-)
Now the rest is really just debugging and the first question is: Why does the NOVRAM get all-ones address bits ?
2021-10-02 Schematics, i8051 emulator and a snag
I have the i8051 emulator code tied into the SystemC code, and on five of the six "boards", the 8051 correctly figures out which address to use on the serial diagnostic bus:
DIAGPROC IOC.ioc_63.U6302_8051 Step PC 0x063a MOV 0x03,A direct 0x03 is 0x04 nPC 0x063c DIAGPROC FIU.fiu_65.U6509_8051 Step PC 0x063a MOV 0x03,A direct 0x03 is 0x03 nPC 0x063c DIAGPROC TYP.typ_60.U6007_8051 Step PC 0x063a MOV 0x03,A direct 0x03 is 0x06 nPC 0x063c DIAGPROC VAL.val_63.U6307_8051 Step PC 0x063a MOV 0x03,A direct 0x03 is 0x07 nPC 0x063c DIAGPROC SEQ.seq_82.U8203_8051 Step PC 0x063a MOV 0x03,A direct 0x03 is 0x02 nPC 0x063c DIAGPROC MEM32.mem32_33.U3303_8051 Step PC 0x063a MOV 0x03,A direct 0x03 is 0x00 nPC 0x063c
But not the MEM32.
Studying the DIPROC firmware I dumped and the MEM32 schematic, there is no way it can get the correct address (0xc for MEM0 and 0xe for MEM2), because the chip which drives the signal is attached to PORT-2 and the firmware reads the address from PORT-1.
My tentative conclusion is that the DIPROC runs a different firmware on MEM32 boards, and there is some convincing evidence to support that:
The chip is a P8752, an EPROM model, where the other boards have a P8052AH mask-ROM model.
The checksum listed on the paper-label on the chip "0184" matches the entry in the BOM, but not the firmware I have dumped.
If only DIPROC chip had been socketed, this would have been a trivial matter, but it is soldered in, on all the three MEM32 boards we have.
2021-09-12 Schematics → KiCad → SystemC
I have spent some time in trains recently, and used it to redraw 25-ish pages of schematics in KiCad.
Each board has it's own KiCad project, from which a netlist is exported.
These netlists are chewed on by a python3 program which produces relatively well-structured SystemC source code.
And now the SystemC source code also compiles.
Next it needs to be hooked up with my own i8051 emulator and the IOC-processor emulation we already have running.
/phk
2021-09-03 Power Supply Replacement
Peter made a new back-plate for our replacement power-supplies:
We reuse the rails from the original power supply so it slides right into place:
I have not attached the control-cable through which the R1000 can instruct the PSU to raise or lower the voltage for diagnostic purposes, in fact I am not sure I have the nerves to experiment with that at all.
The current through the two orange garden-hoses is 155-165 Ampere, and people are apt to exclaim "but it is only 5 volts, what's the problem ?"
The problem is resistance, and since I have the numbers from my check at hand, let's go through the math:
On the two output terminals of the PSU, the voltage is 5.1146 Volt.
On the two huge spade lugs it is already down to 5.1111 Volts.
The difference is only 3.5 milliVolts, 0.0035 Volt, but at 160 Ampere that means that the contact surfaces between the PSU's terminals and the spade lugs get heated by 0.56 Watt.
At the copper conductors crimped into the spade lugs, the voltage is 5.1034 Volts, a loss of 7.7 milliVolts and 1.23 Watts of heating.
At the copper in the other end of the cables we are down to 5.0439 Volts, a whopping 59.5 milliVolts lost and 9.52W at heating.
Yes, the two cables do indeed warm noticeably when the machine runs.
From the copper in the cable to the terminals on the backplane we loose another 8.5 milliVolts, another 1.36 Watt of heating.
In total the loss from the PSU to the backplane is 79.2 milliVolts, and 12.67Watt.
Using Ohms law we find that this is a resistance of 0.0792 Volt / 160 Ampere = 0.000495 Ohm.
And since we are doing this:
The load resistance of the computer, as seen from the PSU is: 5.1146 Volt / 160 Ampere = 0.032 Ohm.
Or as normal people know it: A short-circuit.
I suspect that reason the original power-supply is ailing, (See more about this in our logbook) is that somebody at some time in its history did not clean the output terminals before clamping into a machine, and the increased resistance meant it had to deliver a higher voltage to satisfy the feedback circuit on the RESHA board.
It is absolutely trivial to end up with a contact resistance of several milliohms or even ten, if great care is not taken at these amperages, and for each milliohm extra resistance, the PSU must deliver 160 Ampere * 0.001 Ohm / 5 V = 3.2% higher output voltage.
By not connecting the control-cable to the power-supply, there is no risk of this happening, the power-supply outputs the voltage I have trimmed it to, 5.1146 Volts, and if there is a bad connection, the computer will see a lower voltage and act flaky, but the magic smoke stays put.
2021-08-24 Finally getting somewhere
We have started to really investigate the R1000 disk contents now, and as we do so, it have become clear that we would have been severely time-constrained on our regular thursday evenings: Some commands simply takes hours to run, in some cases many hours.
For instance the lib.space() command does pretty much what the unix du(1) command does, but it takes 3+ hours if you run it from the top of the filesystem.
Having the machine here in my own lab, means I can turn it on, start such a command and go about my regular business while it chews its way through the disk images.
We aslso have a network gateway now, a small FreeBSD VM where all TCP options have been disabled, so the CMC network processor will talk to it. Accounts will be handed out generously, mail me your ssh-key.
We need to figure out a solution for terminal emulation, but in the mean time FTP is usable to go spelunking in the filesystem.
Hopefully the information we gathered that way will enable me to finally complete AutoArchaeologist modules for the backup-tape and the raw disk images.
/phk
2021-08-03 Long overdue update
Datamuseum.dk has lost the practically free lease on our huge basement, and have not yet found new digs, so we are busy moving the entire collection into storage.
Or rather, not quite everything, the R1000 has taken up residence in my personal lab for the duration, so that we can continue our activites with this wonderful piece of kit.
Predictably the trouble-prone power supply threw a tantrum as first point of business, and I have therefore relieved it of duty.
We received a couple of the really huge 3kW supplies from PAM, and after a careful check & renovation in Peter's lab, they will probably be ready for service.
In the mean-time the R1000 will be powered by our fall-back solution: A "Cosel PBA1000F-5" which delivers 5V/200 Ampere, more than plenty for the 150-160A which the R1000 draws, and a "Cosel PBW50F-12-N" for the ±12V, which only drives the fans and the RS-232 port level converters, because we use the "SCSI-SD" rather than the original 5¼" full-height "Wren" disk-drives.
Next up is configuring a firewall/jump-host so the rest of the team can access the machine across the net on whatever evening we can agree on.
The R1000 will only be turned on when needed, even with the new much more efficient power-supplies, the R1000 draws a full kilowatt, which my A/C has to get rid of again, running it 24*365 would cost approx €3000 in electricity.
/phk
2021-06-29 Ten outstanding PROMs read
A scheduling conflict made it possible for me to visit the collection this afternoon, where I unsoldered the ten bipolar PROMS from the spare /200 board, read them, and remounted them in sockets.
Checksums match the BOM document, but one of the FIU proms have a surprising content, more about that when I have investigated it further.
Also scanned the schematics again, this time in 600dpi grayscale. It is clearly an improvement, but I have a hard time getting used to 2.5GB pdf files. Will have to look into ways to reduce the size.
Still working on brute-forcing the MEM32 GAL chips.
/phk
2021-06-15 PSU trouble diagnosed
The trouble mentioned on 2021-05-25 have been diagnosed: The -12V supply is sick.
Pulling the -12V fuse and using a bench-supply instead, the machine works fine, and draws 1.4A from the -12V rail, most of it from the fans on that rail.
2021-06-10 Reading programmable chips
I had a couple of chances to visit the collection after work, and have read as many of the programmable chips as could.
They go under Bits:Keyword/RATIONAL_1000/SW in the bit archive.
Thanks to the Schematic Bill Of Materials document, I am able to verify the checksums my reads, except where revisions differ between the BOM and what I find on the PCBs.
The following chips are outstanding:
- FIU PA020-01 - Not socketed
- FIU PA021-01 - Not socketed
- FIU PA022-01 - Not socketed
- FIU PA023-01 - Not socketed
- FIU PA024-01 - Not socketed
- SEQ PA040-02 - Not socketed
- SEQ PA041-01 - Not socketed
- SEQ PA042-01 - Not socketed
- SEQ PA043-01 - Not socketed
- SEQ PA044-01 - Not socketed
- MEM32 GAL-SETGAL-01 - Checksum mismatch, read=0xaf2d, BOM=0xa744
- MEM32 GAL-DISTGAL-02 - Bad read, copy protected? BOM has -01
- MEM32 GAL-MARGAL-02 - Bad read, copy protected? BOM has -01
- MEM32 GAL-MUXLGAL-02 - Bad read, copy protected? BOM has -01
- MEM32 GAL-DIBGAL-02 - Bad read, copy protected? BOM has -01
- MEM32 GAL-TPARGAL-02 - Bad read, copy protected? BOM has -01
2021-06-03 Researching SystemC
I have spent some time researching SystemC as a possible vehicle for simulating the R1000 CPU.
SystemC is "just" a bunch of C++ classes which can express a system emulation at pretty much any level of abstraction, from gate level to IP-component composed SoC device.
There are two huge attractions in SystemC.
The first is that we can use the IOC, the EXPeriments in the DFS filesystem and the "Diagnostic Archipelago" to debug the simulation.
The second is that we can start with a simulation at chip-level, by capturing net-lists from the schematics, and debug that until it works correctly (but likely very slow), and then improve performance by replacing the chip-level modules with higher level abstractions written in C++.
The bad news is that we would need to capture the net-lists to do so, and there are 417 dense pages in the schematic PDF files.
The best way to approach that task, would probably be to create our own symbol library in KiCad, such that the symbols match the Rational symbols used in the schematics, both with respect to pin placement and naming (the symbols in the schematics use the "Bit0 is MSB" convention), so that the result will look as identical as possible to the original schematics.
The KiCad produced net-list, would then have to be post processed into a SystemC "net-list", but that seems to be a mostly trivial data-conversion without too many problems.
The i8052 diag processors, and the IOC's 68k20 processor would need to be tied into the SystemC simulation, but that also seems doable.
While SystemC can be resynthesized into HDL languages (Verilog, VHDL), and while it could be fun to have a R1000 running in a FPGA, I have not research that aspect beyond browsing a document listing all the C++ things you are not allowed to do in that case. At the very least it would require either a HDL or real HW for the 68k20 and the i8052s.
A key question is how fast a chip-level simulation in SystemC actually would run?
If we are talking kHz clock rates, optimization would have to start very early in the process, if we are into MHz, we can postpone that to later.
I have talked to two SystemC specialists about it, and they were not very optimistic about the speed of such a simulation.
The best way forward is probably to make a prototype, for instance of the state machines around one of the i8052 processors, to get an idea how tedious the process would be, and learn something useful about how the diagnostic subsystem works along the way.
/phk
2021-05-27 i8052 Diagnostic Processor mask rom read
I managed to read out the mask rom contents of the i8052 diagnostic processor I borrowed from the /200 VAL card, now available in a bitarchive near you: Bits:30002517.
More about how the readout was done
A cursory disassembly looks like it makes sense.
2021-05-25 R1000 Stille working - sorta
For the first time in half a year I had a chance to enter our collection and of course I tried to boot the R1000.
It worked fine, and I shut it down again.
Then sometime later I remembered I wanted to dump the NVRAM content from the boards, powered it on and got no output on the console.
Thanks to the handy LEDs on the RESHA board, it took me only a moment to realize that the +12V supply was missing.
That prevented the console RS-232 driver from sending usable voltages down the line, perfectly explaining why I got no output.
Hypothesis at this point: We probably violated the PSU's minimum load specification. Reasoning: It is built to supply four 5¼" full-height high performance SCSI drives, but because we use the SCSI-SD emulator, it is only loaded by the two RS-232 ports.
I also read the EEPROMs from the ENP100 processor in PAM's machine, and a handful of 512x8 bipolar PROMS from the i8052/diag circuit on the /200 VAL board. These files are in the bitarchive under Bits:Keyword/RATIONAL_1000/SW
2021-03-14 Microcode spelunking
Now that the IOC-emulator has definitively told us which microcode bits go to which cards, work has (re)started to figure out what they mean.
This work takes place in a separate github repos called [R1kMicrocode] and we can already produce plots such as this one, showing the microinstructions involved in floating point arithmetic:
And listings of the actual microcodes:
2860: seq_cond_sel 0a VAL.ALU_LT_ZERO(late) seq_en_micro 0 seq_latch 1 fiu_len_fill_lit 74 zero-fill 0x34 fiu_len_fill_reg_ctl 1 Load Literal Load Literal fiu_load_oreg 1 hold_oreg fiu_oreg_src 0 rotator output fiu_tivi_src 4 fiu_var typ_alu_func 1a PASS_B typ_b_adr 16 CSA/VAL_BUS typ_c_adr 3b GP 0x4 typ_c_mux_sel 0 ALU val_a_adr 15 ZERO_COUNTER val_alu_func 1a PASS_B val_b_adr 05 GP 0x5 ioc_adrbs 2 typ ioc_fiubs 1 val ioc_tvbs 2 fiu/val
2021-03-06 IOC emulation usable
After a month, the IOC part of the emulator is now sufficiently usable to allow us to start working on the other parts of the system:
R1000-400 IOC SELFTEST 1.3.2 512 KB memory ... [OK] Memory parity ... [OK] I/O bus control ... [OK] I/O bus map parity ... [OK] I/O bus transactions ... [OK] PIT ... [OK] Modem DUART channel ... Warning: DUART crystal out of spec! ... [OK] Diagnostic DUART channel ... [OK] Clock / Calendar ... Warning: Calendar crystal out of spec! ... [OK] Checking for RESHA board RESHA EEProm Interface ... [OK] Downloading RESHA EEProm 0 - TEST Downloading RESHA EEProm 1 - LANCE Downloading RESHA EEProm 2 - DISK - Warning: Detected Checksum Error Downloading RESHA EEProm 3 - TAPE - Warning: Detected Checksum Error DIAGNOSTIC MODEM ... DISABLED RESHA DISK SCSI sub-tests ... [OK] Local interrupts ... [OK] Illegal reference protection ... [OK] I/O bus parity ... [OK] I/O bus spurious interrupts ... [OK] Temperature sensors ... [OK] IOC diagnostic processor ... [OK] Power margining ... [OK] Clock margining ... [OK] Selftest passed Restarting R1000-400S March 7th, 20EO at 19:54:42 Logical tape drive 0 is an 8mm cartridge tape drive. Logical tape drive 1 is declared non-existent. Logical tape drive 2 is declared non-existent. Logical tape drive 3 is declared non-existent. Booting I/O Processor with Bootstrap version 0.4 Boot from (Tn or Dn) [D0] : Kernel program (0,1,2) [0] : File system (0,1,2) [0] : User program (0,1,2) [0] : Initializing M400S I/O Processor Kernel 4_2_18 Disk 0 is ONLINE and WRITE ENABLED Disk 1 is ONLINE and WRITE ENABLED Logical Tape 0, physical drive 0 is declared in the map but is unreachable. IOP Kernel is initialized Initializing diagnostic file system ... [OK] ==================================================== Restarting system after loss of AC power CrashSave has created tombstone file R1000_DUMP1. READ_NOVRAM_DATA.TYP READ_NOVRAM_DATA.VAL READ_NOVRAM_DATA.FIU READ_NOVRAM_DATA.SEQ READ_NOVRAM_DATA.M32 >>> Automatic Crash Recovery is disabled >>> NOTICE: the EPROM WRT PROT switch is OFF (at front of RESHA board) <<< >>> WARNING: the system clock or power is margined <<< CLI/CRASH MENU - options are: 1 => enter CLI 2 => make a CRASHDUMP tape 3 => display CRASH INFO 4 => Boot DDC configuration 5 => Boot EEDB configuration 6 => Boot STANDARD configuration Enter option [enter CLI] : 1 CLI> x novram READ_NOVRAM_DATA.TYP READ_NOVRAM_DATA.VAL READ_NOVRAM_DATA.FIU READ_NOVRAM_DATA.SEQ READ_NOVRAM_DATA.M32 Options are: 0 => Exit. 1 => Display novram contents. 2 => Modify novram contents. 3 => Change TCP/IP board serial number. Enter option : 1 Part Serial Artwork ECO Date of Board Number Number Revision Level Manufacture IOC 49 10295 3 13 10-JUL-92 VAL 0 0 0 0 ??-???-?? TYP 0 0 0 0 ??-???-?? SEQ 0 0 0 0 ??-???-?? FIU 0 0 0 0 ??-???-?? MEM0 0 0 0 0 ??-???-?? RESHA 41 10272 3 13 24-JUN-92 TCP/IP (CMC) board serial number is 1671 Options are: 0 => Exit. 1 => Display novram contents. 2 => Modify novram contents. 3 => Change TCP/IP board serial number. Enter option : 0 CLI> Abort : Other error Console interrupt From CLI >>> NOTICE: the EPROM WRT PROT switch is OFF (at front of RESHA board) <<< >>> WARNING: the system clock or power is margined <<< CLI/CRASH MENU - options are: 1 => enter CLI 2 => make a CRASHDUMP tape 3 => display CRASH INFO 4 => Boot DDC configuration 5 => Boot EEDB configuration 6 => Boot STANDARD configuration Enter option [enter CLI] : 6 --- Booting the R1000 Environment --- READ_NOVRAM_DATA.TYP READ_NOVRAM_DATA.VAL READ_NOVRAM_DATA.FIU READ_NOVRAM_DATA.SEQ READ_NOVRAM_DATA.M32 READ_NOVRAM_DATA.SEQ READ_NOVRAM_DATA.FIU READ_NOVRAM_DATA.TYP READ_NOVRAM_DATA.VAL READ_NOVRAM_DATA.M32 LOAD_HRAM_32_0.FIU LOAD_HRAM_1.FIU ALIGN_CSA.VAL ALIGN_CSA.TYP LOAD_CONFIG.M32 CLEAR_TAGSTORE.M32 CLEAR_PARITY_ERRORS.M32 CLEAR_PARITY.FIU CLEAR_PARITY.VAL CLEAR_PARITY.TYP CLEAR_PARITY.SEQ Loading from file M207_54.M200_UCODE bound on March 17, 1993 at 2:33:02 PM Loading Register Files and Dispatch Rams .... [OK] Loading Control Store .............. [OK] LOAD_BENIGN_UWORD.TYP SET_HIT.M32 INIT_MRU.FIU CLEAR_HITS.M32 PREP_RUN.TYP PREP_RUN.VAL PREP_RUN.IOC PREP_RUN.SEQ PREP_RUN.FIU CLEAR_PARITY_ERRORS.M32 FREEZE_WORLD.FIU RUN_NORMAL.TYP RUN_NORMAL.VAL RUN_CHECK.SEQ RUN_CHECK.IOC RUN_CHECK.M32 RUN_NORMAL.FIU
2021-02-07 The foundation of an Emulator
I have started at github project with a R1000 Emulator: [R1000.Emulator]
It uses [Karl Stenerud's "Musashi" 68K emulation] to emulate the IOC processor, and I have built skeleton I/O devices around it, and it gets surprisingly far, for a single weekends work:
R1000-400 IOC SELFTEST 1.3.2 512 KB memory ... [OK] Memory parity ... [OK] I/O bus control ... [OK] I/O bus map parity ... [OK] I/O bus transactions ... [OK] PIT ... [OK] Modem DUART channel ... Warning: DUART crystal out of spec! ... [OK] Diagnostic DUART channel ... [OK] Clock / Calendar ... Warning: Calendar crystal out of spec! ... [OK] Checking for RESHA board RESHA EEProm Interface ... [OK] Downloading RESHA EEProm 0 - TEST Downloading RESHA EEProm 1 - LANCE Downloading RESHA EEProm 2 - DISK Downloading RESHA EEProm 3 - TAPE Local interrupts ... [OK] Illegal reference protection ... [OK] I/O bus parity ... [OK] I/O bus spurious interrupts ... [OK] Temperature sensors ... [OK] IOC diagnostic processor ... [OK] Power margining ... [OK] Clock margining ... [OK] Selftest passed Restarting R1000-400S February 8th, 2021 at 11:53:44 Logical tape drive 0 is an 8mm cartridge tape drive. Logical tape drive 1 is declared non-existent. Logical tape drive 2 is declared non-existent. Logical tape drive 3 is declared non-existent. Booting I/O Processor with Bootstrap version 0.4 Boot from (Tn or Dn) [D0] : Kernel program (0,1,2) [0] : File system (0,1,2) [0] : User program (0,1,2) [0] : Initializing M400S I/O Processor Kernel 4_2_18 IOP Kernel is initialized I/O Processor Kernel Crash: error 0806 (hex) at PC=000049F8 Trapped into debugger. RD0 00000000 RD1 00000002 RD2 00000002 RD3 0000000A RD4 00000013 RD5 00000006 RD6 0000001E RD7 00000005 RA0 0000E800 RA1 0003FADA RA2 00000954 RA3 0003FAD8 RA4 0003FFF4 RA5 0003FA96 RA6 0003FB70 ISP 0000FAB8 PC 0000A158 USP 0003FA96 ISP 0000FAB8 MSP 00000000 SR 2704 VBR 00000000 ICCR 00000009 ICAR 00000000 XSFC 0 XDFC 0 @
Some of the self-test routines ar skipped over, to postpone implementation of those I/O facilities until later.
The crash is after the KERNEL has initialized and called FS, which has zero'ed its data-segment and called PROGRAM.
/phk
2021-01-28 Disassembler found
A full text search for instruction strings revealed the machine code for a disassembler:
[⟦2fa0095f7⟧ Disassembler'code]
Comparing the result from this file with the previous guesses confirmed that this is a disassembler.
This has enabled decoding the remaining parts of the instruction set.
2021-01-01 Our first Rosetta Stones
A brute-force text-search at all bit-positions of all heap-segments, we have found our first "Rosetta Stones":
[⟦a564c4d78⟧ Numeric_Primitives'spec]
[⟦c5cdc1bc4⟧ Numeric_Primitives'body]
[⟦cb8e43375⟧ Numeric_Primitives'code]
and
Both of them are very simple, and that has enabled us to unravel the basic Float and Discrete mathematical operators, as well as some comparison and flow-control operations.
Happy New Year
/phk