kernel 2.0.31 vs 2.0.30

Úterý Říjen 28 13:35:48 CET 1997

Kdyz se tady otevrela otazka, jestli a proc prejit na 2.0.31,
tak je mozna zajimavy nasledujici prispevek, ktery reaguje na to,
ze nekteri meli s 2.0.31 podivne problemy (sig11 pri kompilaci,
nesmyslne vystupy z ls apod), ktere nevykazovala verze starsi.

Upozorneni: je to DLOUHE. (Ale zajimave.)

--Pavel Kankovsky aka Peak (troja.mff.cuni.cz network administration)

---------- Forwarded message ----------
Date: Tue, 28 Oct 1997 01:09:44 -0600 (CST)
From: Doug Ledford <dledford na dialnet.net>
To: Mike Frisch <mfrisch na saturn.tlug.org>
Cc: mustang-list na redhat.com, d Brian Hall <brihall na pcisys.net>,
    redhat-list na redhat.com, Dave Wreski <dave na nic.com>
Subject: Re: 2.0.31 source (WAS: RE: Looking to upgrade/downgrade... )

On 28-Oct-97 Mike Frisch wrote:
>On Mon, 27 Oct 1997, Dave Wreski wrote:
>
>> I've even compiled many kernels using 'make -j' which says compile in
>> parallel, which essentially uses all the system resources I've got, and
>> haven't had any problems..
>> 
>> Sounds like there is something else wrong here -- change your RAM, or see
>> the sig11 web page:
>
>No, sorry, I don't buy it...  This hardware has _never_ experienced any
>SIG11s (or abnormal behaviour indicating a hardware problem) and does not
>experience _any_ SIG11s with 2.0.29.  I can build a kernel with "make -j5" 
>and bring the system to it's knees, but it _never_ SIG11s under 2.0.29. 
>2.0.31 gets regular SIG11s.  As much as that page likes to convince people
>it's not a software problem, in this case I am totally convinced it is.

Without being too blunt (hopefully), if you are so convinced that things are
absolutely software related, then you should review some basics concepts of
hardware/software diagnostics.

For a more complete explanation of the above assertion, please follow along:

In your email, you indicate that your computer system has problems compiling
source code when running 2.0.31.  You also indicate that you don't have
these problems when using 2.0.29.  The problem reports seems to indicate a
generic problem with compiling, not specifically that only kernel compiles
fail.  It also seems to imply that the 2.0.31 source code itself is not the
problem (since we are assuming you had to succeed at compiling 2.0.31 under
some earlier kernel and haven't noted a problem with that compilation). 
>From this evidence, you have made a blanket indictment of the 2.0.31 kernel.
 You have also given a blanket assertion that the hardware is not at fault.

Now, your conclusion is not supported by your evidence, and here's why. 
These are the facts that you have exposed:

1.  Compiles complete successfully under 2.0.29
2.  Compiles fail under 2.0.31
3.  The success or failure of a compile appear to be related to the running
    kernel in use, not to the source code being compiled.
4.  Your current hardware works fine with kernel version 2.0.29

These are the implicit presumptions that your listed facts require in order
to support your conclusion:

A.  If any given piece of hardware works reliably/properly under kernel
    version 2.0.29, then it will also work under 2.0.31 unless the 2.0.31
    kernel is broken.
B.  The SIG11 errors that occur under 2.0.31 when compiling are a result
    of either the software or hardware not working properly under 2.0.31.
C.  In accordance with presumption A, the problem can not be the hardware.

Conclusion: Therefore, the hardware in your system is not at fault, but the
2.0.31 kernel is instead the culprit of your SIG11 errors.

At first glance, someone might think the above statements were logical. 
However, given enough time and effort (or past experience), counter examples
to presumption A can be provided.  The existence of a counter example to a
presumption is sufficient to nullify the validity of that presumption.  Now,
without going into a full blown logical proof (which I can upon request),
suffice it to say that one such example is the changes from 1.2.13 to
1.2.13-LMP kernels.  Certain core string operations in the kernel were
changed in the later kernel version in order to improve performance.  The
overall speed impact was roughly a 1 to 2% performance gain, which seems
rather small.  However, given that the new code was only used in small areas
of the kernel, an overall 1 to 2% performance increase indicates a much
greater impact inside the changed code itself (the changes were in such core
areas as include/asm-i386/string.h, etc and were mostly comprised of some
re-writes of commonly used in-line assembly language routines to increase
the speed of those routines).  It was demonstrated that on various systems,
these changes went unnoticed aside from the performance increase.  On other
systems, these changes caused data corruption and SIG11 errors (as a result
of corrupted data).  On the systems that had trouble, it was also found that
reducing the aggregate memory access timings in the motherboard BIOS was
sufficient to solve the problem.  Another truth in the computer world
(again, I skip the proof, but can provide it upon request) is that the
motherboard chipset is responsible for controlling CPU and DMA access to
main memory, and responsible for arbitrating between the two in times of
conflict, and responsible for performing refresh operations on that same
main memory.  With properly operating hardware and properly configured
memory access timings, it is not possible for a programmer to overrun the
available memory bandwidth and cause resultant data corruption without
either intentionally or accidentally re-writing the timing values in the
chipset registers.  Therefore, if the same code works improperly at a given
RAM timing spec, and then works properly at another RAM timing spec, it
indicates either a faulty piece of hardware or faulty RAM timing specs in
the failing case.  Knowing this counter example and the assertion of chipset
responsibility and failure modes allows us to nullify presumption A.  Without
presumption A, presumption C also falls.  When those two presumptions are
removed, you are left with the following conclusion:

Conclusion: Therefore, it is unknown if the SIG11 problems that occur on my
system are a result of faulty hardware or faulty software.

Now, if we add to your argument the following facts:

5.  We now have knowledge of prior instances where increases in kernel
    performance revealed latent hardware problems.
6.  Kernel 2.0.31 has known performance increases in several areas, such
    as the tulip and aic7xxx drivers and the TCP/IP subsystem in general.
7.  These performance increases are known to increase Bus-Master DMA memory
    accesses and/or CPU memory accesses.
8.  Increases in DMA and CPU memory accesses are relevant to latent RAM
    refresh deficiencies.

We can now update the conclusion as follows:

Conclusion:  Therefore, it is unknown if the SIG11 problems that occur on my
system are a result of faulty hardware or software, and additionally it
would be premature to rule out or to stop investigating the hardware
relation in these problems.

Now, based on this conclusion, there are several areas left to investigate.

First, there is no sense in ruling out hardware, so a possible area to
investigate is the RAM timings in the BIOS.  Also, the cache timings in the
BIOS (or for a conclusive test of the caching system, disable both CPU
internal cache and external cache entirely and see how things go). 
Additionally, the IDE timings in the BIOS (assuming you use IDE drives).

First steps to take in investigating the kernel.  First, make sure PCI
BRIDGE OPTIMIZATIONS are turned off as this code does directly fiddle with
chipset registers and on the off chance that it is mistakenly fiddling with
your RAM timings, you don't want it to interfere.  Also, disable DMA mode on
your IDE driver for the purpose of testing.  You can also try compiling your
kernel with the target being a 486 instead of pentium or other CPU to make
sure that GCC optimizations aren't causing problems (although GCC
optimization problems can legitimately be either hardware or software
problems in the end as this effectively slows the kernel down in most cases).

If no single one of these things helps you, then there is probably some
incompatibility between some piece of hardware you have and the newer
kernel, most likely in some driver.  If one of these things does help you,
or some combination of these things helps you, then it would depend on
exactly what that thing/combination was as to were to proceed next in either
fixing the problem or determining if it is hardware or software caused.

----------------------------------
E-Mail: Doug Ledford <dledford na dialnet.net>
Date: 28-Oct-97
Time: 01:09:48
----------------------------------