Skip navigation

Parallel processing is easy in the shell.

a | b | c

Will run a, b and c in parallel, passing data from one to the next. There is some sequencing here, implicit in the way each of the commands does its business. If you are passing a file through commands that process one line at a time, there will be a lovely sequence of lines flowing down the pipes, in staggered formation, and the number of lines being processed at once approaches the number of commands in the chain.

It sounds like a good idea to make these chains really long, if you want to maximise that easy parallelism. However, there’s only so much you can do, passing data all in one line. Named pipes get you around that, and there’s some support already. Use of the tee command, along with command substitution, allows you to “fan out” a tree of pipes. This gives us even more parallelism, but your endpoints are then probably going to have to be files. Which means waiting for all the fun to stop before continuing. There needs to be a way to close these trees back up, and turn it into a graph. Then you can fan out, getting different bits of data out of the input, then merge those bits of data together to form the output. Sounds easy to me!

I propose two mechanisms, an easy way of constructing temporary named pipes and referring to them later and a command to merge these named pipes together. I shall call the first utility leaf, referring to the bottom of a pipe tree, and I shall call the graph constructor eet, to refer to its role as (sort of) the inverse of tee.

Leaf creates some temporary pipes, and binds them to the variables passed in. eet deletes the pipes and unsets the associated variables when done with them. eet takes a command string to execute, with the pipe names appearing where file names would, suffixed by @.

e.g.

leaf p 
cat /proc/cpuinfo  \
  | tee >(egrep 'physical\ id' | cut -d : -f 2 > $p) \
        | egrep 'core\ id' | cut -d : -f 2 \
        | eet paste p@ - | sort -u | wc -l

Tells you how many processors there REALLY are.

I’ve done a tentative implementation below. This represents a couple of hours’ work, so it’s understandably awful. It sort of works.

Update: the command to unset the pipe variables after they’ve been used didn’t do anything, I’ve now wrapped it in an eval, which seems to do the trick (I know…)


function leaf {
  for arg in "$@"; do
    local pipe=$(make-leaf-pipe)
    eval "export $arg=$pipe"
  done
}


function make-leaf-pipe {
  if [ $TMPDIR ]; then local tmp=$TMPDIR/pipe-graphs
                  else local tmp=/tmp/pipe-graphs; fi
  if [ ! -e $tmp ]; then mkdir $tmp; fi
  local pipe=$(mktemp -u $tmp/XXXXXXXXXXXXX)
  mkfifo $pipe
  echo $pipe
}


function eet {
  local pipen= pipes= argt= arglst= cmd=$1
  shift
  for arg in "$@"; do 
    case "$arg" in
      *@) argt=$(sed 's/\(.*\)@/\1/' <<< $arg)
          pipen="$pipen $argt"
          argt=$(eval "echo \$$argt")
          pipes="$pipes $argt"
          arglst="$arglst $argt"
        ;;
      *)  arglst="$arglst $arg"
        ;;
    esac
  done
  $cmd $arglst
  rm $pipes
  eval "unset $pipen"
}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: