Saturday, 12 September 2015

Automated Testing in Docker (part 5 of 5)

The first batch of student assignments has been graded, and so this is the penultimate post in this series. Today I want to write a little about the experience of testing via docker. The final post will be some time later as I want to build a queuing system for tests - the long-term idea is that students will have access to the testing facility before them submission, rather than a teacher using it before grading.

Battery consumption on the mac.

Using docker on the mac rapidly makes the Terminal application the most energy intensive. Opening it up inside Activity Monitor shows that the VBoxHeadless process is running and consuming power in-between test uses. The simple solution is virtually suspend/resume the machine:

VBoxManage vmcontrol pause default
...
VBoxManage vmcontrol resume default

Underspecification in the assignment.

Students seem to have a natural genius for spotting parts of the assignment that are underspecified or ambiguous, then as a weakly coordinated group they spread out among the possibilities that are just within the specification and their work samples the boundary condition of what is allowed. It's like watching a particle filter explore environmental constraints. Anyway, the compiler assignment they were given in this year forgot to mention some things:

  • The output is weakly defined; there is no split between what should go to stdout, which filenames the dot representation should be saved under etc.
  • The filenames in the submission tarball are not defined; the reference to the previous assignment is ambiguous and depends on finding abbreviations that make sense to an english speaker. If ass1-int was the name of the first directory then the second should be ass2-comp, but this depends on cultural experience of abbreviation. Amazingly one student in the batch did actually deduce this from the context, but amongst the others were reasonable estimates of ass2-int, ass1-int and ass2.
  • A problem that occurs in both assignments is that the gold standard of "make a working compiler for this language" was been weakened to support a range of passing grades. The output of the various target grades is not strictly monotonic, so the output for a student targeting a grade E is not compatible with a grade D for example.
  • The "easier" result for a low grade is harder to check. This was not obvious while writing the assignment as the focus is upon what they need to do (code) in order to get a passing mark. Because the output for a grade E on the second assignment is not executable - other than a human reading the CFG in the final pdf file - it requires more knowledge to evaluate if the traces in the testcase are properly shown in the output. Some student forgot edge labels making their CFGs ambiguous, others missed blocks in the CFG or confused ordering. It is not obvious how to make the "easier result" easier to validate for the student. This point requires some thought.

False positives on plagiarism.

One of the onerous duties of a teacher is that we need to be aware of plagiarism and spot when work has been duplicated. We must also avoid false positives as the process for handling student plagiarism does not handle abort cleanly (people are humans and tend to get pissy when accussed of something they have not done). In this batch it looked like there was a clear case of copied. Very clear cut. Almost too easy to spot.

Experience indicates that the more obvious something is the harder we have to work to verify that it is true. So in this case the individual tests had to be rerun manually and the code examined against them to check what was happening. Luckily I have a high background level of disbelief so after checking minutely within the testing environment it was time to dismantle the testing environment and manually unpack and stage the submissions to verify that the testing environment had not magically made their codes look the same. In this case this level of technical paranoia was justified as the testing environment had indeed altered their codes to make them look the same. Bad evil docker.

So what happened? If we recall the file-sytem inside a container is a union-fs that stacks parts of the FS tree together from more basic pieces. Docker tries to be smart about this so it uses a cache of those previous pieces. Reruning the same build instructions from the same source files (which have been reloaded with a fresh submission) does not propagate dependencies - Docker is not make. It does not check if the source has changed, it has a different set of rules for when to rebuild dependencies and when to reuse the cached FS layer. Bad docker.

When tools misbehave we just turn them off, and at least Docker is flexible:

docker build --no-cache -t=current_test .

This disables all caching and ensures that we build the container only from the submitted work and the pieces of the base image - never caching state from another submission. Phew, narrowly avoided disaster.

Return codes.

As far as I can tell Docker may be flakey and crappy with processing the return-codes of RUN commands within build scripts. Or not. The caching issue above may have contributed to executing the wrong executable. It is hard to pin down an exact failure, and the various bug-reports against Docker for this type of issue have been closed. This could have been a ghost issue as a result of fixing the caching problem above. Either way it does seem safer to write out the return-codes as a result of the test rather than using them to decide if the test continues:

RUN ./comp /testcases/1.lua >redirect.dot; echo $? >/testresults/1.retcode

Performance

Well, it ain't quick. But on the other hand it is much faster than typing in the tests manually. Each run takes about 15 seconds at the moment (scraping the various filenames together and handling every variation in the underspecified assignment takes many many minutes). It is definitely slower than something I would stick behind a CGI gateway and give students access to at the moment. One speed-up will be attained by smashing together various RUN commands into a single shell command separated by semicolons. But it will still be slow enough to be some kind of batch system. Also, having up to 50 students hitting the testing server at the same time may introduce a latency of 10-15 minutes.

Code that breaks

There are generally two ways that student submission fail to terminate cleanly, either they crash or they hang. Programs that seg-fault are shown in the build output. Need to find a way to record this automatically, at the moment it is manual. Something very nice that is normally hard to do is to recover a stack-trace for the failure.

Running an interactive container on the current_test image allows the installation of gdb, manually flip on debug info in the student makefile and then replicate the crash inside a debugger. This allowed me to email stack traces to students that submitted crashing code. I am suitably happy with this :) I also need to think of a way to automate this process.

Hanging code is a bit harder and I suspect that I will find a cpu quota option in Docker somewhere and put a cpu-timelimit in the assignment specification. This will interact with the performance issues above and affect the maximum latency on a submission queue.

Conclusions

Grading student work is quite a subjective process that depends on an objective testing of the artefact they submit. The testing process is quite error-prone and fragile, much more so that many teachers would like to admit. For the testing process to really be objective it must be made robust; reproducibility of the testing used to inform a grading decision is vital to minimising the subjectivity of the decision. My initial experiences from trialling this approach suggest that it has great potential. I estimate that when students get access to the testing mechanism it could have great benefits for them.

Monday, 7 September 2015

Automated Testing in Docker (part 4 of 5)

Running tests in Docker and getting results


Let's speed up a bit and wrap up a testing sequence inside Docker. This is the Dockerfile that we use to build the container. This is only a first step to check that we can get the results that we need - these six tests are not dependent on one another and so failure in one should not prevent the rest from executing. This build script does replicate much of the testing. Something that is apparent immediately is that this way of working is akin to single-user mode. The file-system can be stomped over (making various top-level directories and ignoring the rest), the system runs as root so we can ignore file-permissions and users completely.

FROM ubuntu_localdev
ADD student.tgz /testroot/
COPY 1.lua /testcases/
COPY 2.lua /testcases/
COPY 3.lua /testcases/
COPY 4.lua /testcases/
COPY 5.lua /testcases/
WORKDIR /testroot/ass1-int
RUN ls
RUN make
RUN mkdir /testresults/
RUN ./int >/testresults/noargs
RUN ./int /testcases/1.lua >/testresults/1
RUN ./int /testcases/2.lua >/testresults/2
RUN ./int /testcases/3.lua >/testresults/3
RUN ./int /testcases/4.lua >/testresults/4
RUN ./int /testcases/5.lua >/testresults/5

This runs and we can jump in to get results directly (picking a student that did well in the assignment helps - test the positive test cases first and then go back to test the failures) :

$ docker build -t=current_test .
Sending build context to Docker daemon 326.7 kB
Step 0 : FROM ubuntu_localdev
 ---> f899773591db
Step 1 : ADD student.tgz /testroot/
 ---> Using cache
 ---> 7efb17601604
Step 2 : COPY 1.lua /testcases/
 ---> Using cache
 ---> e1981c1bd075
Step 3 : COPY 2.lua /testcases/
 ---> Using cache
 ---> a1097f1868e5
Step 4 : COPY 3.lua /testcases/
 ---> Using cache
 ---> 29438a35f659
Step 5 : COPY 4.lua /testcases/
 ---> Using cache
 ---> 185197776daf
Step 6 : COPY 5.lua /testcases/
 ---> Using cache
 ---> 5b10271f9ed8
Step 7 : WORKDIR /testroot/ass1-int
 ---> Using cache
 ---> d5dc06b80cd7
Step 8 : RUN ls
 ---> Using cache
 ---> 39d93f0e0d73
Step 9 : RUN make
 ---> Using cache
 ---> dcfa1cb02748
Step 10 : RUN mkdir /testresults/
 ---> Using cache
 ---> d1f6ece4509a
Step 11 : RUN ./int >/testresults/noargs
 ---> Using cache
 ---> ef45f84616c6
Step 12 : RUN ./int /testcases/1.lua >/testresults/1
 ---> Using cache
 ---> 349c50894154
Step 13 : RUN ./int /testcases/2.lua >/testresults/2
 ---> Using cache
 ---> c5586940a4fb
Step 14 : RUN ./int /testcases/3.lua >/testresults/3
 ---> Using cache
 ---> 888c850b9164
Step 15 : RUN ./int /testcases/4.lua >/testresults/4
 ---> Using cache
 ---> f75184fa8fa8
Step 16 : RUN ./int /testcases/5.lua >/testresults/5
 ---> Using cache
 ---> 61ec01ac3dba
Successfully built 61ec01ac3dba

Have an interactive poke around to check the results make sense...

docker run -it current_test /bin/bash
root@a02b5139bb7f:/testroot/ass1-int# ls /testresults
1  2  3  4  5  noargs
root@a02b5139bb7f:/testroot/ass1-int# cat /testresults/noargs
digraph graphname {

n1 [label="startProg"];n1 -> n3;
... <snip big-ass file>

Oh yes, now I remember that they output the Graphviz dot-format for an intermediate representation in that assignment. So that means that our testing container isn't equipped to deal with the test. Have we struck disaster, or is Docker so amazingly choice that we can overcome this? Stay tuned to find out..

$ docker restart moon_base1
moon_base1
$ docker attach moon_base1
root@b8e173f8481f:/# apt-get install graphviz

Lots of snipped apt-get output snipped, the gist of which was that installing the missig packages will probably not screw up the development image so lets go ahead and do that. Then we simply rebuild the test... If we think of the container as a delta from an image then we've edited the image so when we add the two together we get a new container:

$ docker commit moon_base1 ubuntu_localdev
74b59407d66fb5fd67fbb3255f9f435f079476bcac9ea259978ad85cf0005260
$ docker build -t=current_test .
Sending build context to Docker daemon 326.7 kB
Step 0 : FROM ubuntu_localdev
... *snip* ...
Removing intermediate container 341861334c56
Successfully built eebb3556c88f
$ docker run -it current_test /bin/bash
root@c993bd4fc1d0:/testroot/ass1-int# ls /testresults
1  2  3  4  5  noargs
root@c993bd4fc1d0:/testroot/ass1-int# dot -Tpdf /testresults/1 -o /testresults/1.pdf
root@c993bd4fc1d0:/testroot/ass1-int# ls -l /testresults
total 44
-rw-r--r-- 1 root root   614 Sep  7 13:32 1
-rw-r--r-- 1 root root 13852 Sep  7 13:35 1.pdf
-rw-r--r-- 1 root root  1891 Sep  7 13:32 2
-rw-r--r-- 1 root root  3552 Sep  7 13:32 3
-rw-r--r-- 1 root root  2748 Sep  7 13:32 4
-rw-r--r-- 1 root root  6772 Sep  7 13:32 5
-rw-r--r-- 1 root root    76 Sep  7 13:32 noargs

Very nice Rodney. We just rewrote the past a little because the testing can be applied automatically to the new environment. Cool, so how do we get the results out to look at on the host machine?
$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
c993bd4fc1d0        current_test        "/bin/bash"         3 minutes ago       Exited (0) 3 seconds ago                        sad_turing
$ docker cp sad_turing:/testresults/1.pdf .
$ ls -l 1.pdf
-rw-r--r--  1 andrewmoss  staff   13852 Sep  7 15:35 1.pdf

Excellent Smithers... A full testing cycle. Next time we will sort out the independence of the individual tests and maybe even do some actual grading...

A nanopass compiler in Haskell (part 4 of many)

A further explosion of boilerplate.


There are two standard candidates for a function that must be defined generically over the language trees: parsing and pretty-printing. In the compiler code I'm using parsec for parsing, mainly because of familiarity, but the number of "parsing in parsec details" does does compare nicely to the number of "writing boilerplate details" so I will focus on pretty printing as the generic example.

The original plan was to use two separate typeclasses to represent the genericity [Spellchecker: I do not know if that is a real word either] in the code. A typeclass called Pool that would include functions defined for each node, and a second for each target generic function (e.g. Pretty, Parse etc). This did not work out well. I have decided that I cannot express this in Haskell - but I'm still not sure if the emphasis in that claim is upon me or Haskell :( On the bright-side it looks as if Typeable and Data from the SYB approach will take the role of the Pool class in this plan eventually.

When this plan broke down, shortly after making contact with the enemy, it crept into an attempt to wodge the generic functions directly into the Pool class. A foolish effort that will now be documented to make its failure entirely manifest.

type Indent = (Int,String)
oneline (tabs,text) = (indent tabs) ++ text ++ "\n"
indent depth = foldl (++) "" $ replicate depth "  "

class Pool a where
        pretty :: a -> String
 pretty = (foldl (++) "") . map oneline . expand
 expand :: a -> [Indent]

A more natural-seeming signature would map a Pool a onto a String directly. The input language is layout-sensitive, so the pretty printer needs to build the correct indenting. If a String is produced by each pretty node then the rules for line-breaking or swallowing internal indent become trickier. Instead the Indent type is used as an intermediate representation so that a soft-linebreak can merge adjacent lines more easily. The actually pretty printing is handled in the generic expand calls.

instance Pool Lit where
 expand (Lit n) = [(0,show n)]

None of these instance declarations are boiler-plate - this is the actual body of the generic function for each case. The new boilerplate that we need to add is the following:

instance Pool ParseE where
        expand (ParseE1 e) = expand e
        expand (ParseE2 e) = expand e
        expand (ParseE3 e) = expand e
        expand (ParseE4 e) = expand e
        expand (ParseE5 e) = expand e
        expand (ParseE6 e) = expand e
        expand (ParseE7 e) = expand e

instance Pool ParseS where
        expand (ParseS1 s) = expand s
        expand (ParseS2 s) = expand s
        expand (ParseS3 s) = expand s
        expand (ParseS4 s) = expand s

Yes this makes me die a little on the inside to sit and type in. This is pure boilerplate in the sense that we could cut and paste it with minor edits, but it needs to be kept up to date with the code despite its low semantic complexity. The original plan for splitting the typeclass in two parts was an estimate of how much of this would be acceptable. If we denote the number of generic functions by n, and the number of constructors in the various tree languages by m then this approach requires nm instances to work (plus the pattern synonyms and the actual tree declarations which are purely to provide unique names for each tag constructor). In the split approach it was hoped that n+m instances would be enough.

Far too much guffy code to write this way. Next we need to finish reading the SYB papers and working through some examples of Typeable and Data.

A nanopass compiler in Haskell (part 3 of many)

Writing lots of boilerplate


Sometimes when it is not clear what is the best way to proceed: proceeding in any direction is maximal. In this case there are nice approaches to solving the problem in the previous post. But I don't the trade-offs between them as I've never used before. So deciding which is the best approach requires some experience of any approach. This is the kind of circular design tarpit that many programmers wrestle with. Realising this means that we can pick a good direction to head in: out, away, not-here!

Let's proceed in what is obviously the worst approach to gain some experience in what makes it so bad. Or in other words, lets write lots of boilerplate before we try out the "Scrap your own boilerplate techniques". So we need a pool of nodes to describe a simple language:

-- Expressions
data Lit = Lit Integer
  deriving(Eq,Show)
data Var = Var String
  deriving(Eq,Show)
data Do s       = Do [s]
  deriving(Eq,Show)
data Case e = Case String [(Maybe Integer,e)]
  deriving(Eq,Show)
data Call e     = Call String [e]
  deriving(Eq,Show)
data Add e  = Add e e
  deriving(Eq,Show)
data Sub e  = Sub e e
  deriving(Eq,Show)

-- Statements
data Assign e   = Assign String e
  deriving(Eq,Show)
data Activate e = Activate String [e]
  deriving(Eq,Show)
data Receive e  = Receive String [String] e
  deriving(Eq,Show)
data Return e   = Return e
  deriving(Eq,Show)
data Send e     = Send String e
  deriving(Eq,Show)

The division into expression nodes and statements is somewhat arbitrary: if we were writing a single language the relative placement of these nodes within trees would be constrained by their types. It seems good to remember that distinction as we try a more generic approach. Occasionally snippets of documentation reveal where not to point a loaded gun with respect to our own footwear.

The first language we specify over the pool of nodes is a simple parse tree, at this point the typing for expression/statement nodes appears immediately:

data ParseE     = ParseE1 Lit
  | ParseE2 Var
  | ParseE3 (Do ParseS)
  | ParseE4 (Case ParseE)
  | ParseE5 (Add ParseE)
  | ParseE6 (Sub ParseE)
  | ParseE7 (Call ParseE)
  deriving(Eq,Show)

data ParseS  = ParseS1 (Assign ParseE)
  | ParseS2 (Activate ParseE)
  | ParseS3 (Receive ParseE)
  | ParseS4 (Return ParseE)
  deriving(Eq,Show)

The two languages are simple definitions of which partition of the pool of nodes can appear at a given "kind" of node in the tree. They are mutually recursive by necessity; the definition of expressions and statements within the parsed language is mutually recursive. Each constructor within ParseE/ParseS acts purely as a tag - it specifies the choice of which pool node can appear below.

These tags form a wrapping for the pool nodes - there are no "raw" constructors held in the tree. Each of the pool constructors is encapsulated by a specific constructor for the tree class. This is of course quite ugly and unwieldy in practice as we need to use this wrapping everywhere the pool node would be operated upon:

f (ParseE1 (Lit n)) = g (ParseE2 (Var "newname"))

It seems quite ugly to need to do this and yet it seems entirely necessary to bypass the monomorphism restriction. Disabling the restriction is an approach that I'm not familiar with yet. At this point a familiar feeling descends for a researcher: this is ugly and it annoys me enough to fix it, yet other people have been doing this for years... Luckily we no-longer live in an age where this means we need to spend a long time searching for the name of the concept that will fix this for us. Search engines have removed the search problem in the opposite direction, but finding the know name of a thing is still a difficult. So let's rely on social media and people's willingness to take that final step in research dissemination - telling other people how stuff works. Beautiful.

{-# LANGUAGE PatternSynonyms #-}
{-# LANGUAGE FlexibleInstances #-}
pattern PLiteral x  = ParseE1 (Lit x)
pattern PVariable x  = ParseE2 (Var x)
pattern PDo  x  = ParseE3 (Do x)
pattern PCase x y = ParseE4 (Case x y)
pattern PAdd x y = ParseE5 (Add x y)
pattern PSub x y  = ParseE6 (Sub x y)
pattern PCall x y = ParseE7 (Call x y)
pattern PAssign x y = ParseS1 (Assign x y)
pattern PActivate x y = ParseS2 (Activate x y)
pattern PReceive x y z = ParseS3 (Receive x y z)
pattern PReturn x = ParseS4 (Return x)
...
f (PLit n) = g (PVar "newname")

Pattern Synonyms solve this problem of crufty constructor tags perfectly, *tips hat toward danidiaz*, so we get to keep pattern matching over a language that is defined independently of the pool nodes, and thus the mutual recursion between expressions and statements does not pollute the pool definitions. We still have a hope of writing generic code on the pool nodes that we can execute over any language tree that we define this way. A hope. But we have not yet done anything with the language tree so celebration of the lack of boilerplate code would be a tad premature.

The next step is to define a generic function that actually does something...

Friday, 4 September 2015

Automated Testing in Docker (part 3 of 5)

Today the plan is to use the moon_base1 container as a staging point to build the testing environment, then clone it into a specialised container with the code to test. This is surprisingly easy, and suggests that Docker has been designed very well. But first a quick aside about the Mac. Docker sits on-top of the namespaces support in the Linux kernel, and provides instances of a linux installations packaged into a container. When it is running on a linux host this is straight-forward. But what about running a Docker container on a mac (or even windows)? Well, this is where things get really nice.

A short aside on running linux containers on OS-X


Docker Toolbox packages together everything needed to:

  • Run a custom linux distro inside VirtualBox as a headless machine.
  • Run the docker daemon inside the virtual machine.
  • Provides a mac client that connects seamlessly to the daemon over an internal (simulated) ethernet bridge.
This is actually breath-taking, in the "oh my god that is so cool I cannot actually breath" sense. One interface for a application container - that can be instantiated directly on a linux host using namespaces for performance, or (optionally) deployed inside a virtual machine for maximum isolation. There are known hypervisor (or were, I have not checked if they've been fixed), but an exploit to break the container and then the hypervisor layer is such a specific attack surface that I find I can sleep really quite easily at night. (again: CivEng security students who want a challenging project, feel free to come and discuss this).

There is only one glitch here from my perspective: for ease of use the standard toolbox sets up a writeable share into the /Users directory that can be mounted from inside the container. Obviously this is useful in most target applications for docker, but in this application we want to disable it. The toolbox names the virtual machine "default" ... err, by default.

$ VBoxManage list vms
....
"default" {long hex id number}
$ VBoxManage showvminfo default
... (snip)
Shared folders:
Name: 'Users', Host path: '/Users' (machine mapping), writable
... (more snip)
$ docker-machine stop default
$ VBoxManage sharedfolder remove default --name Users
$ docker-machine start default
Starting VM...
Started machines may have new IP addresses. You may need to re-run the `docker-machine env` command
$ eval "$(docker-machine env default)"
$ VBoxManage showvminfo default | grep Shared
Shared folders:    <none>

That produces a nice warm feeling of satisfaction. Now of course we will need to load the files that we need into the container in a different way...

Back to building the specialised testing containers

If we remember that a Docker image is a union fs and a Docker container is a set of processes running over that file system then everything is nice and straightforward. We will jump into the moon_base1 container and directly install what we need as a dev environment. Then we can export the file system from this container into a new image that we can spawn individual containers from. Lovely jubbley.

$ docker restart moon_base1
$ docker attach moon_base1
root@b8e173f8481f:/# apt-get update
root@b8e173f8481f:/# apt-get -y install gcc make flex bison
root@b8e173f8481f:/# ^d
$ docker commit moon_base1 ubuntu_localdev
f899773591dbf38ddbd68aa73ffba3c26a151a84f385bb0892ab78677560d48d
$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
ubuntu_localdev     latest              f899773591db        7 minutes ago       281.9 MB
ubuntu              latest              91e54dfb1179        2 weeks ago         188.4 MB
$ docker run -it --name test1 ubuntu_localdev
root@dd851d126938:/# which gcc
/usr/bin/gcc
root@dd851d126938:/# which make
/usr/bin/make
root@dd851d126938:/# which bison
/usr/bin/bison
root@dd851d126938:/# which flex
/usr/bin/flex
root@dd851d126938:/# exit

Cool, so the important thing to understand here is that if we modify the filesystem in the test1 container it will not affect the moon_base1 container. The read-write layer in the union filesystem of the moon_base1 container becomes a readonly layer in the test1 container filesystem. The commit (export) and then new start have the effect of copying the moon_base1 filesystem into the new container, not aliasing / sharing it. The idea is that we will run our tests in this clean container and then destroy it afterwards without affecting the moon_base1.

So far we've used mainly interactive tools to explore docker and poke around images and containers to get a feel for the thing. But now, it's on. Docker's main interface is a simple scripting language called Dockerfiles. This gives us a simple syntax to automate building containers and images. First we need a build context - this really wants to be an empty directory as it will be transmitted to the docker daemon for execution of the script.

$ mkdir build_context
cd build_context
cp ../student.tgz .
vi Dockerfile

Obvious you don't have to vi if you feel it is too oldskool. Atom is lovely. Or, if you feel that vi is one of them there new-fangled devices then feel free to fire up ed, or just use a here-document to fill the file. We are editor-neutral in this place. But regardless, our first Dockerfile will contain:

FROM ubuntu_localdev
ADD student.tgz /testroot/

The FROM directive tells Docker which image is to be used as the starting point for the build. The ADD command is a slightly-magic version of copy that does things like unpack tarballs into the target file-system. We can take the result for a quick spin:

$ docker build -t=current_test .
Sending build context to Docker daemon 321.5 kB
Step 0 : FROM ubuntu_localdev
 ---> f899773591db
Step 1 : ADD student.tgz /testroot/
 ---> 7efb17601604
Removing intermediate container c27c7f81935f
Successfully built 7efb17601604
$ docker run -it --name inside_test current_test /bin/bash
root@10007f8dee6f:/# ls /
bin   dev  home  lib64  mnt  proc  run   srv  testroot  usr
boot  etc  lib   media  opt  root  sbin  sys  tmp       var
root@10007f8dee6f:/# cd testroot
root@10007f8dee6f:/testroot# ls
ass1-int
root@10007f8dee6f:/testroot# ls ass1-int/
Makefile  case2  case4  int       out.dot  parser.tab.c  parser.y
case1     case3  case5  lex.yy.c  out.png  parser.tab.h  scanner.flex
root@10007f8dee6f:/testroot#

Which looks a lot like a student submission for the assignment being graded. Progress! Onwards!

Thursday, 3 September 2015

Automated Testing in Docker (part 2 of 5)

So what is Docker?

Have you heard of Docker? You probably have—everybody’s talking about it. It’s the new hotness. Even my dad’s like, “what’s Docker? I saw someone twitter about it on the Facebook. You should call your mom.”
- Chris Jones 
 Hmmm. Still not entirely clear? Imagine that you want to completely isolate a process from the rest of your machine. It should not be able to interact with any other process, or the file-system, or access the network. First lets think how we would do this. The most obvious approach is to build a virtual machine with a separate OS install within it. The VM will form a boundary around the process running inside and prevent it from accessing anything that we don't set up explicitly. Cool. Now lets think of how clunky it would be to use.

The virtualised machine has separate I/O - e.g. a window pretending to be a monitor and mouse movements mapping to a virtual device. If we want to interact with it through a terminal we need to host sshd inside the machine to let us in. This is quite a complex mess just to run a command in an isolated environment.

Docker takes a different approach - a container is not a fully virtualised machine. Rather than simulate a computer for an OS to run within (e.g. as VirtualBox does) Docker relies on the kernel partitioning the machine into isolated pieces that cannot communicate with one another. Is this as secure as real virtualisation? Well that is largely an open question right now - there are known exploits against hypervisors, and it is probable that the namespaces support in the kernel will get a lot testing and patching. It is quite hard to quantify which approach is currently the more secure, and which has the potential to be most secure. (Random aside: if you are a CivEng student in security looking for a project idea then this would be viable - get in touch if you are interested).

Docker uses a client-server architecture - a resident daemon handles these lightweight virtual machines that partition up resources. The file-system of the host machine is not partitioned, instead separate mount-points are used to access an entirely separate file-system in loop-back mode. The really neat part here is the use of a union file system. Standard chunks of the OS can be mounted read-only (and thus shared between different machines) with the writable parts of the current application sitting in overlaid layers.

So how does all of this help us in isolated testing and reproducibility?

The lowest layer in the file-system will be a standard disk image of a particular OS install. We will then turn this docker image into a docker container which can be specialised as a dev environment to build the units under test. This container will not be used directly - every test will clone this container to produce a one-off environment that runs the test, collects the results and then gets destroyed. This guarantees that state cannot propagate between tests; the actions in one test cannot pollute the behaviour of another.

Gosh this sounds complicated, thankfully we live in the future and the developers of Docker have made this all insanely easy:

$ docker run -it --name moon_base1 ubuntu /bin/bash

Docker makes really nice randomised names for container automatically, but we will want to refer back to this frequently as our starting container. The name ubuntu is a tag in the Docker repository that will links to ubuntu:latest. When I run it today Docker resolves this to 14.04.2, works out which parts of the union fs are missing and downloads them for me. The -i means send my terminal input into the container, and the -t means use the terminal for output.

root@b8e173f8481f:/# uname -a 
Linux b8e173f8481f 3.14-2-amd64 #1 SMP Debian 3.14.15-2 (2014-08-09) x86_64 x86_64 x86_64 GNU/Linux

Absolutely awesome, the process of building the machine is completely automated, and the /bin/bash part of the run command-line has launched a shell process within the container with its standard streams mapped back to the controlling terminal. Gandalf's beard! So what happens next is the really interesting bit though:

root@b8e173f8481f:/# ^d 
$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

So passing the EOF to bash kills it as it does in a normal terminal, and as that was the running (encapsulated) process Docker has killed the container. This is the standard use-case for Docker (application deployment) so handling one-shot containers is a joy. Even better though, is the way in which Docker has killed the container. It is not deleted from the system until we tell it, but rather it is sitting in a frozen state:

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                   PORTS               NAMES
b8e173f8481f        ubuntu              "/bin/bash"         2 hours ago         Exited (0) 2 hours ago                       moon_base1
$ docker restart moon_base1
$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
b8e173f8481f        ubuntu              "/bin/bash"         2 hours ago         Up 2 seconds                            moon_base1

Docker remembers that when we specified the /bin/bash binary to hold in the container we told it to bind the standard streams. When it restarts the process the streams are bound once again. People who are used to working in a screen session because ssh tends to drop long-term connections will then become really quite moist with excitement:
$ docker reattach moon_base1
root@b8e173f8481f:/#

So Docker gives us the tools that we need to quickly setup, tear-down and reuse containers that wrap up particular installations of an OS. Next we need to specialise them to the particular environment to run a single test.

Automated Testing in Docker (part 1 of 5)

Everyone has an itch that they want to scratch using automated testing. Mostly these itches lie in a unique place depending on which parts of their job they find the most routine and boring. As a teacher I need to assess my students. It's not a particularly fun, or rewarding, part of the job. Sadly though it is highly necessary. In any given term I run through the following list hundreds of times:

  1. Retrieve the submitted item from the next student.
  2. Unpack their submission.
  3. Perform some syntactic checks: depending on the course this could be:
    • Pass their code through a compiler to see if their executable will be graded.
    • Run a custom parser against the syntax specified in their assignment.
  4. Run a suite of tests
    • Sometimes this involves feeding specific input to their executable and checking the output.
    • Sometimes this involves running shell commands in a known environment and checking how it has changed afterwards, in addition to any explicit output.
  5. Do some semantic analysis: i.e. read their code, poke around a bit and estimate their understanding.
  6. Take the results of the testing, combine it with an estimate of their understanding and award them a grade.
Some of these steps are easier than others. Some of these steps are more boring than others. It is only the final step that I actually get paid to do: the rest are just functional dependencies. Looking at this process from a student's perspective, steps 1-5 often produce more important information than step 6. During the first five steps there is strong possibility that they will learn something new about their submission. Step six is really just a confirmation.

Learning something new about their submission is a highly profitable development for a student, and as a teacher I want to keep them in that state as often as I can get away it. Possibly it is because interactive processes drive retention rates higher and improves learning outcomes, in the modern parlance. Or possibly it is because that is the bit that is actually teaching.

So, we're all highly accomplished computer scientists and formidable command-line ninjas. Why not just write a script for the first five steps? The simplest answer is that it involves running untrusted code in a live production environment. We don't do that round here. Normally, an experienced teacher will do step five ahead of step four. Just incase. It's not that many students would place something malicious in their submission, although honestly a few would. It's because...
"I must have put a decimal point in the wrong place or something. Shit! I always do that. I always mess up some mundane detail."     
- Michael Bolton
Accidental damage from code is much larger risk than facing a malicious adversary (who knows that you are about to grade their work). So the real problem is isolation. How can we script the first five steps in a way that isolates them from the real system. Hello Docker.