Rational/R1000s400/Logbook

Fra DDHFwiki
Spring til navigation Spring til søgning

Current Status

Test Wall Clock SystemC Ratio Exp run Exp fail
expmon_reset_all 119.747 0.025898 1/4623.8 0 0
expmon_test_fiu 1764.892 17.850875 1/98.9 95 0
expmon_test_ioc 3006.305 11.204664 1/268.3 29 0
expmon_test_mem32 171593.144 264.129948 1/649.7 45 5
expmon_test_seq 1586.243 9.307945 1/170.4 108 12
expmon_test_typ 8092.592 7.464258 1/1084.2 73 0
expmon_test_val 7399.792 7.429965 1/995.9 66 0
novram 47.347 0.010041 1/4715.3 0 0

2022-05-08 Slowly making way

As can be seen in the table above, the main DRAM array now works on the emulated MEM32 board.

It takes 48 hours to run that test, because the entire DRAM array is tested 16 times, very comprehensively:

  TESTING TILE  4 -  TILE_MEM32_DATA_STORE

  DYNAMIC RAM DATA PATH TEST                                 PASSED
  DYNAMIC RAMS ADDRESS TEST                                  PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 0                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 1                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 2                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 3                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 4                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 5                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 6                        PASSED
  DYNAMIC RAM ZERO TEST - LOCAL SET 7                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 0                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 1                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 2                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 3                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 4                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 5                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 6                        PASSED
  DYNAMIC RAM ONES TEST - LOCAL SET 7                        PASSED

  TILE  4 -  TILE_MEM32_DATA_STORE                           PASSED


While "FAILURE" is printed five times on the console, there is actually only two failing experiments:

  TESTING TILE  3 -  TILE_MEM32_TAGSTORE

  TAGSTORE SHORTS/STUCK-ATS TEST                             PASSED
  TAGSTORE ADDRESS PATTERN TEST                              PASSED
  TAGSTORE PARITY TEST1                                      PASSED
  TAGSTORE PARITY TEST2                                                FAILED

            FAILING EXPERIMENT IS :  TEST_TAGSTORE_PARITY_2

  TAGSTORE RAMS ZERO TEST                                    PASSED
  TAGSTORE RAMS ONES TEST                                    PASSED
  LRU UPDATE TEST                                                      FAILED
            FAILING EXPERIMENT IS :  TEST_LRU_UPDATE
  
  TILE  3 -  TILE_MEM32_TAGSTORE                                       FAILED

Despite some effort, we have still not figured out what the problem is. We suspect a timing issue near or with the tag-RAM.

2022-04-16 A long overdue update

As can be seen in the table above, the simulated SEQ board is down to 12 FAILURE messages, and what the table does not show is that the MEM32 board simulation completes now, but takes more than 24 hour to do, and which makes the daily CI cron(8) job fail catastrophically.

The bug which have taken us almost a month to fix turned out to be the i8052 emulator's CLC C, Complement Carry, instruction not complementing, in a DIPROC bytecode-instruction we had not previously encountered: Calculate Even/Odd parity for a multi-byte word.

Along the way we have attended to much other stuff, tracing, python code for decoding scan-chains, "mega components" etc. and, notably, python generated component SystemC models.

Initially all 12 thousand electrical networks in the simulated part of the system were a sc_signal_resolved instance.

Sc_signal_resolved is the most general signal type in SystemC, having four possible levels, '0', '1', 'Z' and 'X' and allowing multiple 'writers', but it is therefore also the slowest.

Migrating to faster types, bool for single wire binary networks and uint%d_t for single-driver binary busses, requires component models for all the combinations we may encounter, and writing those by hand got old really fast.

For true Tri-state signals, we will still need to use the sc_signal_resolved type, but a lot of Tri-state output chips are used as binary drivers, by tying their OE pin to ground, so relying on the type of a component to tell us what type its output has misses a lot of optimization opportunities.

And thus we now have Python "models" of components, which automatically produce adapted SystemC component models.

Here is an example of the 2149 SRAM model:

class SRAM2149(PartFactory):

    ''' 2149 CMOS Static RAM 1024 x 4 bit '''

    def state(self, file):
        file.fmt('''
                |       uint8_t ram[1024];
                |       bool writing;
                |''')

    def sensitive(self):
        for node in self.comp:
            if node.pin.name[0] != 'D' and not node.net.is_const():
                yield "PIN_" + node.pin.name

    def doit(self, file):
        ''' The meat of the doit() function '''

        super().doit(file)

        file.fmt('''
                |       unsigned adr = 0;
                |
                |       BUS_A_READ(adr);
                |       if (state->writing)
                |               BUS_DQ_READ(state->ram[adr]);
                |
                |''')

        if not self.comp.nodes["CS"].net.is_pd():
            file.fmt('''
                |       if (PIN_CS=>) {
                |               TRACE(<< "z");
                |               BUS_DQ_Z();
                |               next_trigger(PIN_CS.negedge_event());
                |               state->writing = false;
                |               return;
                |       }
                |''')

        file.fmt('''
                |
                |
                |       if (!PIN_WE=>) {
                |               BUS_DQ_Z();
                |               state->writing = true;
                |       } else {
                |               state->writing = false;
                |               BUS_DQ_WRITE(state->ram[adr]);
                |       }
                |       TRACE(
                |           << " cs " << PIN_CS?
                |           << " we " << PIN_WE?
                |           << " a " << BUS_A_TRACE()
                |           << " dq " << BUS_DQ_TRACE()
                |           << " | "
                |           << std::hex << adr
                |           << " "
                |           << std::hex << (unsigned)(state->ram[adr])
                |       );
                |''')

Notice how the code to put the output in high-impedance "3-state" mode is only produced if the chip's CS pin which is not pulled down.

Note also that the code handles the address bus and data bus as a unit, by calling C++ macros generated by common python code. This allows the same component model to be used for wider "megacomp" variants of the components.

This is particularly important for the MEM32 board, which has 64(Type)+64(Value)+9(Ecc) DRAM chips in each of the two memory banks. The simulation runs much faster with just two "1MX64" and one "1MX9" components, than it does with 137 "1MX1" components in each bank.

This optimization is what disabused us of the notion that the CHECK_MEMORY_ONES.M32 experiment hung, it did not, it just took several hours to run - and it is run once for each of the eight "sets" of memory.

With the current 11 failures, the entire MEM32 test takes 140 seconds of simulated time, 7½ hours in our fastest "megacomp2" version of the schematics on our fastest machine.

However our "CI" machine is somewhat slower, and runs the un-optimized "main" version of the schematics, which means the next daily "CI" run is started before the previous one completed, and with them using the same filenames, they both crash.

So despite the world distracting us with actual work, travel, talks, social events, and notably the first ever opening of Datamuseum.dk for the public, we are still making good progress.

2022-03-06 Do not optimize until it works, unless …

It is very old wisdom in computing that it does not matter how fast you can make a program which does not work, and usually we stick firmly to that wisdom.

However, there are exceptions, and the R1000-emulator is one of them.

When the computer was designed, the abstract architecture had to be implemented with the available chips in the 74Sxx and later 74Fxx families of TTL chips, and there being no 64 bit buffers in those families, a buffer for one of the busses was decomposed into 8 parallel 8 bit busses, each running through a 74F245 chip, etc.

In hardware the 8 chips operate in parallel, in software, at least with SystemC, they are sequential, so there is a performance impact.

What is worse, there is a debugging impact as well, because instead of the trace-file telling what the state of the 64 bits are, in a single line, it contains eight lines of 8 bits, in random order.

Therefore we operate with three branches in the R1000.HwDoc github repository: "main", "optimized" and "megacomp".

"Main" is the schematics as they are on paper. That is the branch reported in the table above.

"Optimized" is primarily deduplication of multi-buffered signals, that is signals where multiple outputs in parallel are required to drive all the inputs of that signal, a canonical example being the address lines of the DRAM chips on the MEM32 board.

Finally in "megacomp" we invent our own chips, like a 64 bit version of the 74F245, whereby we both improve the clarity of the schematics, and make the simulation run faster, almost twice as fast as "main" at this point.

Here is the same table as above, for the "megacomp" branch, and run on the fastest CPU currently available to this project:

Test Wall Clock SystemC Ratio Exp run Exp fail
expmon_reset_all 51.787 0.026151 1/1980.3 0 0
expmon_test_fiu 1275.507 17.799928 1/71.7 95 0
expmon_test_ioc 1018.086 11.231571 1/90.6 29 0
expmon_test_mem32 5331.993 30.000000 1/177.7 28 9
expmon_test_seq 1183.407 13.081077 1/90.5 108 32
expmon_test_typ 3629.642 7.468383 1/486.0 73 2
expmon_test_val 3625.022 7.434761 1/487.6 66 0
novram 69.302 0.035584 1/1947.6 0 0

Note that the megacomponents has caused one of the TYP tests to fail, so the old wisdom does apply after all. (The table shows two failures because both the individual test and the entire test-script reports "FAILED" on the console.)

2022-03-05 We will not need to emulate the ENP-100

The R1000/s400 has two ethernet interfaces, one is on the internal IO/bus and can be directly accessed by the IOC and, presumably, the R1000 CPU, the other is on a VME daughter-board, mounted on the RESHA board.

Strangely enough, the TCP/IP protocol only seems to be supported on the latter, whereas the "direct" ethernet port is for use only in cluster configurations.

The VME board is a "ENP-100" from Communication Machinery Corp. of 125 Cremona Drive, Santa Barbara, CA 93117".

R1000 enp100.jpg

The board contains a full 12.5 MHz 68k20 computer, including boot-code EPROMs, 512K RAM, two serial ports and a Ethernet interface.

The firmware for this board is downloaded from the R1000 CPU, and implements a TCP/IP stack, including TELNET and FTP services.

Interestingly, the TCP/IP implementations ignores all packets with IP or TCP options, so no contemporary computers can talk with it, until "modern" options are disabled.

We have no hardware documentation for the ENP-100 board, but we expect emulation is feasible, given enough time and effort.

Fortunately it seems the R1000 can boot without the ENP-100 board, it complains a bit, but it boots.

That takes emulation of the ENP-100 out of the critical path, and makes it optional to even do it.

2022-03-01 The fish will appreciate this

Below the R1000 two genuine and quite beefy Papst fans blow cooling air up through the card-cage.

For a machine which most likely will end up in a raised-floor computing room, it can be done no other way.

However, if the machine is housed anywhere else, an air-filter is required to not suck all sorts of crap into the electronics.

And of course, air-filters should be maintained, so we pulled out the fan-tray and found that the filter mat was rapidly deteriorating, literally falling apart.

Not being air-cooling specialists, we initially ordered normal fan-filters, the kind that looks like loose felt made of plastic, but the exhaust temperature on the top of the machine climbed to over 54°C.

So what was the original filter material, and where could we buy it?

It looks a lot like the material used on the front of the obscure but deservedly renowned concrete Rauna Njord speakers, designed by Bo Hansson in the early 1980'ies, and that material also fell to pieces after about a decade.

Surfing fora where vintage hifi-nerds have restored Rauna Njord we found "PPI 10 Polyureathane foam" mentioned, and that transpires to be a what water-filters for aquariums are made from.

A trip to the local aquarium shop got us a 50x50x3cm sheet of filter material, and the promise that the fish will really appreciate us buying it.

We cut a 12.5cm wide stripe and parted it lengthwise in two slices of roughly equal thickness, using two wooden strips as guides and a sharp bread-knife.

It is almost too bad that one cannot see the sporty blue color when it is mounted below the R1000:

Luftfilter1.png

The two small pieces in the middle is the largest fragment of the old air filter and an off-cut from the 17mm thick slice:

Luftfilter2.png

In the spirit of scientific inquiry, we measured the temperature with both thicknesses.

With the 17mm thick filter, the exhaust to rose to above 52°C.

With the 13mm thick filter, it stabilized around 41°C.

That is a pretty convincing demonstration of the conventional wisdom, that axial fans should push air, not pull it.

So why are the filters on the "pull" side of the fans in the R1000, when the fan-tray is plenty deep for filters to be mounted on the "push" side?

Maybe this is an after-market modification, trying to convert an unfiltered "data-center fan-tray" into a filtered "office fan-tray" ?

2022-02-20 TEST_VAL.EM passes

We're making progress.

Now we are going focus on TYP, where most of the failing tests have something to do with parity checking.

2022-02-12 TEST_FIU.EM passes

As can be seen on the status above, there are no longer any failures on the FIU board when running TEST_FIU.EM.

The speed in the table above is when simulating the unadultered schematics.

Concurrent with fixing bugs we are working on two levels of optimized schematics, one where buffered signals are deduplicated, and one where we use "megacomponents", for instance 64 bit wide versions of 74F240 etc.

The "megacomp" version of FIU runs twice as fast, 1.4% of hardware speed.

2022-01-10 SystemC performance is weird

As often aluded to, the performance of a SystemC simulation is … not ideal … from a usability point of view, so we spend a lot of time thinking about it and measuring it, and it is not at all intuitive for software people like us.

Take the following (redrawn) sheet from the IOC schematic (click for full size)

IOC RESPONSE FIFO

This is the 2048x16 bit FIFO buffer through which the IOC sends replies the the R1000 CPU.

None of the tests in the "TEST_IOC.EM" file gets anywhere near this FIFO, yet the simulation runs 15% faster if this sheet is commented out, because this sheet uses a lot of free-running clocks:

   1 * 2X~     @ 20 MHz        20 MHz
   2 * H2.PHD  @ 10 MHz        20 MHz
   2 * H1E     @ 10 MHz        20 MHz
   1 * H2E     @ 10 MHz        10 MHz
   1 * Q1~      @ 5 MHz         5 MHz
   2 * Q2~      @ 5 MHz        10 MHz
   1 * Q3~      @ 5 MHz         5 MHz
   ----------------------------------
   Simulation load             90 MHz

Where the clocks feed into edge sensitive chip, for instance "Q1~" to "FOREG0" (left center), only one of the flanks need to be simulated, but when state sensitive gates like "2X~" into "FFNAN0A" (near the top), the "FOO" class instance is called for both flanks, effectively doubling the frequency of the 10MHz clock signal.

To make matters even worse, there is an identical FIFO feeding requests the opposite way, from R1000 to IOC, on the next sheet.

And to really drive the point home, all the simulation runs will have to include the IOC board.

In SystemC a FIFO is one of the primitive objects, which can simulate these two pages much faster than this, but to do that we need enough of the machine simulated well enough, to run the experiments which tests the FIFOs.

Until then, we can save oceans of time by simply commenting these two FIFOs out.

2022-01-08 Making headway with FIU

We are making headway with the simulated FIU board, currently 19 tests fail, the 16 "Execute from WCS" and three parity-related tests. We hope the 16 WCS tests have a common cause.

On the FIU we have found the first test-case which depends on undocumented behaviour: TEST_ABUS_PARITY.FIU fails if OFFREG does not have even parity when the test starts.

Simulating the IOC and FIU boards, the simulation currently clocks around 1/380 of hardware speed, if the TYP, VAL and SEQ boards are also simulated, speed drops to 1/3000 of hardware speed. Not bad, not certainly not good.

We have started playing with "mega-symbols" for instance 64bit versions of the 74F240, 74F244 and 74F374. There is a speed advantage, but the major advantage right now is that debugging operates on the entire bus-width at the same time.

2021…2012

  • 2014-2018 - The project got stuck for lack of a sufficiently beefy 5V power-supply, and then phk disappeared while he built a new house.

Many thanks to

  • Erlo Haugen
  • Grady Booch
  • Grek Bek
  • Pierre-Alain Muller
  • Pascal Leroy
  • Michael Druke
  • Pascal Leroy