[Unix] Pipes for parallel processing

You have a list of most common English words and want to check all domains like https://WORD.com. It's a good chance that such a domain is registered. Check it with curl.

Obviously, this is a problem for parallel processing. Run hundred curl processes in parallel. Go PL book has many examples like that. They do this via Go channels.

Of course, Go PL is cool. But my philosophy is to use simplest possible tool for a task. I know that Go channel is sophisticated/advanced Unix pipe. Can I use bash and pipes for that?

Here I create pipe, run workers in background, process a list of English words and feed to pipe. Then I send 'stop' string to pipe, for each worker and wait for complete shutdown.

#!/usr/bin/env bash

# run it as: cat words.txt | ./main.sh

WORKERS=128
PIPE_FNAME=testpipe
mkfifo $PIPE_FNAME
mkdir out

# run all workers:
for i in $(seq 1 $WORKERS)
do
        ./worker.sh $PIPE_FNAME $i &
done

# read from stdin and pass it to workers
while read line
do
        #echo main: $line
        echo $line > $PIPE_FNAME
        # flush pipe. https://stackoverflow.com/questions/3348614/how-to-flush-a-pipe-using-bash
        dd if=$PIPE_FNAME iflag=nonblock of=/dev/null 2> /dev/null
done

# send stop to workers:
for i in $(seq 1 $WORKERS)
do
        echo stop > $PIPE_FNAME
        dd if=$PIPE_FNAME iflag=nonblock of=/dev/null 2> /dev/null
done

# wait all workers to exit:
wait

rm $PIPE_FNAME

And this is worker.sh, sucking from pipe and run curl. Exits if it got 'stop' string:

#!/usr/bin/env bash
while true;
do
        tmp=$(cat $1)

        if [ "$tmp" = "stop" ]; then
                echo $2 stopping
                exit
        fi

        if [ -z "$tmp" ]
        then
                true
                #echo empty string received
        else
                echo $2 received: $tmp
                DOMAIN=https://$tmp.com
                FNAME=out/$tmp.com
                if [ ! -f $FNAME ]; then
                        curl --connect-timeout 5 $DOMAIN > $FNAME 2>/dev/null
                        #curl --connect-timeout 5 $DOMAIN > $FNAME
                fi
        fi
done

This may be a simplest possible solution, but problems exist:


As seen on reddit: 1, 2, HN, twitter.

(the post first published at 20220827.)


List of my other blog posts.

Subscribe to my news feed

Yes, I know about these lousy Disqus ads. Please use adblocker. I would consider to subscribe to 'pro' version of Disqus if the signal/noise ratio in comments would be good enough.