select bug?

Úterý Září 28 19:57:15 CEST 1999

On 28 Sep 1999 10:10:03 +0100, Pavel Machek wrote:

:> IMHO jasny kandidat na otazku do linux-kernel.
:> Poslete ju Vy, alebo mam pripadnu blamaz
:> riskovat sam? :-)

Uz som poslal :-) Zatial to nikoho nezaujalo, az
na "napad" nastavovat velkost buffera cez ioctl ...
Zrejme to chce rovno urobit patch, ale natolko
zase do jadra nevidim, aby som vedel odhadnut
dosledky nejakeho zasahu. Pre informaciu prikladam
text, ktory som do toho l-k poslal.

: Vypada to ze pollwrnorm se opravdu vraci az kdyz je pipa prazda... A
: rekl bych ze PIPE_BUF je opravdu tim duvodem.

Tak sa zda. Lenze to je podla mna skutocne pruser
(na druhej strane ma milo prekvapilo, ze Linux
zvlada 75000 context-switchov za sekundu
na uz pomerne odkvitnutom hardware) :-)

=== snip ===
Hello,

I am summarizing a discussion that took place in
a local mailing list.

Given following program

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/time.h>

int can_write(int fd)
{
      fd_set fds;
      struct timeval tv = {0, 0};
      FD_ZERO(&fds);
      FD_SET(fd, &fds);
      return select(fd + 1, NULL, &fds, NULL, &tv);
}

int main()
{
      int p[2];
      pipe(p);
      write(p[1], "x", 1);
      printf("%d\n", can_write(p[1]));
      return 0;
}

the output is that there is no possibility to write
to a pipe filedescriptor, when there is something
in the pipe that was not read. The behaviour
is the same regardless of the pipe being a loop
in the same process or 'regular' interprocess pipe.

The theory is that if the PIPE_BUF is set to the actual
kernel buffer size for the pipe, it is in fact impossible
to do anything else - the pipe writes up to PIPE_BUF
are guaranteed to be atomic and if the descriptor selects
for write, it should be guaranteed that the next write
does not block. As the amount of the data in the send
is not known at the time of select, it is impossible
to let the user do a write when the remaining size of
the kernel buffer is less than PIPE_BUF.

This is a bad news - many programs communicate via pipes
and the protocol is not always request-response - a simple
signalling of some non-fd events is often also done this
way. This situation forces completely unnecessary context
switching between such processes. If I do a simple
two-process test sending one byte each other:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/time.h>

int main()
{
    int p[2];
    pipe(p);

    if (fork())
    {
        int x = 0;
        close(p[0]);
        while(1)
	{
            fd_set fds;
            struct timeval tv = {1, 0};
            FD_ZERO(&fds);
            FD_SET(p[1], &fds);
            if (select(p[1] + 1, NULL, &fds, NULL, &tv) > 0)
	    {
                write(p[1], "x", 1);
	        x++;
     	        if (! (x % 100000))
     	            fprintf(stderr, ".");
	    }
	}
    }
    else
    {
        char foo;
	close(p[1]);
        while(1)
	   read(p[0], &foo, 1);
    }
}

the context-switch rate (as seen in /proc/stat) rockets
to about 75000 / sec, the data rate is half of that and the
user:system time ratio is about 20:80 (2.2.12, PPro 166 MHz)

The possible solution seems to either enlarge the buffer
in kernel, or to reduce the PIPE_BUF - the ratio 1:2
seems optimal.
=== snip ===

Zdravi
-- 
				Stano