Skip navigation

Monthly Archives: January 2009

Parallel processing is easy in the shell.

a | b | c

Will run a, b and c in parallel, passing data from one to the next. There is some sequencing here, implicit in the way each of the commands does its business. If you are passing a file through commands that process one line at a time, there will be a lovely sequence of lines flowing down the pipes, in staggered formation, and the number of lines being processed at once approaches the number of commands in the chain.

It sounds like a good idea to make these chains really long, if you want to maximise that easy parallelism. However, there’s only so much you can do, passing data all in one line. Named pipes get you around that, and there’s some support already. Use of the tee command, along with command substitution, allows you to “fan out” a tree of pipes. This gives us even more parallelism, but your endpoints are then probably going to have to be files. Which means waiting for all the fun to stop before continuing. There needs to be a way to close these trees back up, and turn it into a graph. Then you can fan out, getting different bits of data out of the input, then merge those bits of data together to form the output. Sounds easy to me!

I propose two mechanisms, an easy way of constructing temporary named pipes and referring to them later and a command to merge these named pipes together. I shall call the first utility leaf, referring to the bottom of a pipe tree, and I shall call the graph constructor eet, to refer to its role as (sort of) the inverse of tee.

Leaf creates some temporary pipes, and binds them to the variables passed in. eet deletes the pipes and unsets the associated variables when done with them. eet takes a command string to execute, with the pipe names appearing where file names would, suffixed by @.

e.g.

leaf p 
cat /proc/cpuinfo  \
  | tee >(egrep 'physical\ id' | cut -d : -f 2 > $p) \
        | egrep 'core\ id' | cut -d : -f 2 \
        | eet paste p@ - | sort -u | wc -l

Tells you how many processors there REALLY are.

I’ve done a tentative implementation below. This represents a couple of hours’ work, so it’s understandably awful. It sort of works.

Update: the command to unset the pipe variables after they’ve been used didn’t do anything, I’ve now wrapped it in an eval, which seems to do the trick (I know…)


function leaf {
  for arg in "$@"; do
    local pipe=$(make-leaf-pipe)
    eval "export $arg=$pipe"
  done
}


function make-leaf-pipe {
  if [ $TMPDIR ]; then local tmp=$TMPDIR/pipe-graphs
                  else local tmp=/tmp/pipe-graphs; fi
  if [ ! -e $tmp ]; then mkdir $tmp; fi
  local pipe=$(mktemp -u $tmp/XXXXXXXXXXXXX)
  mkfifo $pipe
  echo $pipe
}


function eet {
  local pipen= pipes= argt= arglst= cmd=$1
  shift
  for arg in "$@"; do 
    case "$arg" in
      *@) argt=$(sed 's/\(.*\)@/\1/' <<< $arg)
          pipen="$pipen $argt"
          argt=$(eval "echo \$$argt")
          pipes="$pipes $argt"
          arglst="$arglst $argt"
        ;;
      *)  arglst="$arglst $arg"
        ;;
    esac
  done
  $cmd $arglst
  rm $pipes
  eval "unset $pipen"
}

Reading through R6RS. It seems to actually present a really nice system for programming in, even if it’s pretty big compared to R5RS. It feels seriously serious now. Looking through the library section on syntax-case, I found this:

Using datum->syntax, it is even possible to break hygiene entirely and write macros in the style of old Lisp macros.  The lisp-transformer procedure defined below creates a transformer that converts its input into a datum, calls the programmer’s procedure on this datum, and converts the result back into a syntax object scoped where the original macro use appeared.

(define lisp-transformer
  (lambda (p)
    (lambda (x)
      (syntax-case x ()
        [(kwd . rest)
         (datum->syntax #’kwd
           (p (syntax->datum x)))])

It’s nice to know, after spending all that time and effort trying to ensure hygiene, it’s that easy to break.