Monday, 19 October 2015

System Deployment (part 7 of x)

The Mechanise server is back up and running, with a mirror of each git repository on it. Before teaching starts for the winter term it needs the web-server used for the courses to be brought back into life. This is complicated by a massive rewrite / redesign of large chunks of it that will probably stretch long into the term.

The web server.


This should be a simple beast: python code using the twisted library for HTTP processing. Large chunks of the site are content served dynamically to each student. Over the years it has been a testbed for pedagogic projects: generating unique assignments for students, integrating automatic testing into the submission system and other crazy ideas. As a result it has sprawled out of control, and the software architecture looks inspired by Picasso having a merry old time high on weapon-grade LSD.

The first step is get the deployment system working again on the server. When the git repository hosting the server is updated a post-update hook springs into life:

  • Copy the source tree and resources into the production tree.
  • Kill the old server.
  • Respawn the noob.
Git hooks are a strange mess of server-side state that is not versioned... Inside the bare repository on the server we update files in the .git/hooks directory that git will execute during certain actions. The post-update hook is the one that will redeploy the server:

#!/bin/sh

echo "Website updated from commit" | logger -t gitolite
GIT_WORK_TREE=/var/www/thesite git checkout -f | logger -t gitolite
chmod 755 -R /var/www/thesite
chown git:git -R /var/www/thesite
curl -s http://localhost/restart
sleep 2
top -bn1 | grep python
ps -A --forest | grep -C1 python
tail /var/log/syslog

Like all archeologists we can find evidence of panic among the primitive people. The sleep followed by a dump of info is a sure sign that something did not work once, and that the confirmation used to debug that was so comforting that it was never removed.

Using a URL to kill the server is asking for trouble, currently we check the incoming transport that it originated on the 127 interface. This should not be spoof-able, but if it is then we can use a random number in the file-system to lock this request. This approach works better than a direct kill from the gitolite user as:

  • No worries about serialisation; if we are in the processing hook for the restart page then any file I/O for another request is done.
  • No worries about privileges to kill a process belonging to another user with introducing a privilege escalation attack.

First we need the user that will run the server, and their home-directory. The user www-data is already installed on debian for this purpose:

main@face:~$ grep www-data /etc/passwd
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
main@face:~$ sudo su www-data
[sudo] password for main:
This account is currently not available.
main@face:~$ sudo su -s /bin/bash www-data
www-data@face:/home/main$ cd
bash: cd: /var/www: No such file or directory

It is designed not to permit casual use - once upon a time it was a dreadful security hole when people would forget to set a strong password on the account, or even worse leave the default in place. We actually like it that way, so we will create the home directory that it needs and leave it disabled so that only the root user can log into it by forcing a different shell.

root@face:/home/main# mkdir /var/www
root@face:/home/main# ls -ld /var/www
drwxr-xr-x 2 root root 4096 Oct 19 09:34 /var/www
root@face:/home/main# chown git:git /var/www
root@face:/home/main# ls -ld /var/www
drwxr-xr-x 2 git git 4096 Oct 19 09:34 /var/www
root@face:/home/main# echo >/var/www/webservice <<EOF

#!/bin/bash
cd /var/www/thesite
while true
do
authbind python server.py 2>&1 | logger -t www
echo "Web server exited, restarting" | logger -t www
sleep 2
done
EOF

root@face:/home/main# chown www-data:www-data /var/www/webservice
root@face:/home/main# chmod 744 /var/www/webservice
root@face:/home/main# su -s /bin/bash www-data
www-data@face:/home/main$ cd
www-data@face:~$ ls -al
total 12
drwxr-xr-x 2 git git 4096 Oct 19 09:37 .
drwxr-xr-x 13 root root 4096 Oct 19 09:34 ..
-rwxr--r-- 1 www-data www-data 167 Oct 19 09:37 webservice
www-data@face:~$ ./webservice
./webservice: line 3: cd: /var/www/thesite: No such file or directory
^C
^D
root@face:/home/main# mkdir /var/www/thesite
root@face:/home/main# chown git:git /var/www/thesite
root@face:/home/main# chmod 755 /var/www/thesite
root@face:/home/main# touch /etc/authbind/byport/80
root@face:/home/main# chown www-data:www-data /etc/authbind/byport/80
root@face:/home/main# chmod 500 /etc/authbind/byport/80
root@face:/home/main# ls -l /etc/authbind/byport/80
-r-x------ 1 www-data www-data 0 Oct 19 10:17 /etc/authbind/byport/80

Cool. The user has just enough privileges to execute the service script, but it cannot do anything else as it has no write permission anywhere. The gitolite user owns the www-data home-directory and the yet to be created thesite directory inside it. This is the target that we perform the bare checkout into each time the repo is updated. The basic workflow is like this:

  • Dev work happens off the server using -local to run the server in the non-production environment.
  • Deployment happens when the the dev-commits are pushed back up to face.
  • The update-hook fires:
    • The production source tree is updated with the new code.
    • The old server is killed.
    • The service script spawns a new server after a couple of seconds.

Friday, 16 October 2015

System Deployment (part 6 of x)

Time to wander into a slightly different topic: now that the mechani.se domain is back up and running it is time to setup a gitolite3 installation on face.mechani.se

Problem Context for gitolite.


My way of working on a linux system has evolved over the years because of some specific desires:

  • I work on several machines - my environment should always be the same
  • I dislike the hassle of maintaining backups - but I need to know that they are in-place, and every-time I switch machine I am effectively doing a restore.
  • Switching machines should break my context as little as possible.
The third point is the killer; during a session on one machine I build up a thick and fecund context. Depending on the work it may be edits to source, environmental variables, command history and other forms of state that are local to the machine. Over time each machine acquires a layering of packages and installed artefacts (libraries, modules, random pieces of related source). Even seemingly inconsequential parts of the machine state are useful: the collection of workspaces and windows, positions and combinations are all memory prompts for a particular task. 

The original dream (probably not even mine, these things tend to be infectious) was a teleporting environment: effectively hibernate a machine and transport the hibernated image to a new machine to be restored. These are the idle dreams of a grad student who works late through the night and doesn't want to start fresh when he trudges into the office. These dreams never quite found traction, although years of experimenting allowed them to morph into something more useful.

Virtualbox introduced teleportation a few years ago. The reality involves more suckage than the dream. Image files are large and cumbersome to transport. Somewhere I have a custom diff utility that syncs the hard-drive overlays inside a .VDI against a target-set to reduce the diff to a few hundred megs at a time (possible over my positively rural ADSL as well as on a flash drive). It just didn't really work out for us. Versioning entire OS images, avoiding branches and generally playing the maintain-the-repo-consistency game on a larger scale was even less fun that you would think.

It turns out that the answer is very boring and simple - its more of a process than an artefact.

Version control for everything


Most of work falls neatly into two categories:
  • Things that I know I will want to archive.
  • Things that will definitely be deleted after trying something out.
This clean taxonomic split is the stuff that programmers live for. It suggests a scratch directory that never needs to be backed up or transported off machine, and a set of version control repositories for everything that I would displeased to lose. That is where people balk at the idea of how much hassle it would be to keep everything in sync, and the server-side issue of maintaining a bare-bones repository against each of those projects. I wrote a script. Well, truth be told I wrote quite a few over the years find the right way to work, but in the end they all collapsed into a single script.

Much like a sordid-ploy by Sauron there is a single repository that rules them all, gitenv:
  • Every configuration file in my home directory is linked into this folder. Using git on the dense representation (the folder) allows version control over a sparse hierarchy of files (the overlay on my home directory). This is a nice trick. In some places these are symlinks to prevent chaos, and in other places we go straight for the jugular with hard-links (e.g. the .ssh directory is hard-linked to a directory inside the repository so that everything within can be versioned and shared across multiple machines).
  • Shell history for each machine is stored so history across all machines is searchable.
  • Custom bin directory, for all the magic.
  • A secrets file. Yup, these are a terrible idea, but then again so is losing access to a password. Which is why mine is encrypted using gpg and the plaintext contents never touch disk. In theory. Although my security needs are not particularly challenging and everytime I screw the passphrase I end up splashing the contents across the file-system. Yay!
This repository has a very strange property for source-control: the files within act as if they are in continuous change. Normally the state of a repository's contents acts discretely: things do not change in-between git commands. But linking the known_hosts file, and the shell history into this repository means that the contents are always dirty. Because it is alway dirty it always needs to merge against the remote - so each machine has a slightly different history for this repository. It is challenging to work with.

Everything else is simple in comparison, there is a single independent repository for each:
  • Project - independent tree for a piece of source, with docs and test data.
  • Course - all materials and archives of student submissions.
  • Document collections - articles, books etc
  • Web server - each active server has its contents in source control - these repositories have post-commit hooks to deploy the master branch live on the machine.
This means that each machine that I use has a collection of a few dozen repositories. This would be a serious pain to maintain by hand. Instead one script takes care of the difficult between the continuous environment repository and the server (and its mirrors), and then works out how close to consensus the rest of the repositories are. Where the actions to establish consensus are simple (i.e. the repository is purely ahead or behind) the script brings it into line automatically. This makes things sane.

Transporting state between machines is the same as using backup/restore. This is absolutely essential - it means that the backup and the restore mechanism are in use every day. When you positively, absolutely need to rely on a system, make sure that you eat your own dog-food. Mmmm chewy. The weird thing about my backup and restore system is that any two machines rarely have exactly the same contents - but they all chase the same consensus state, and progress towards synchronisation is monotonic. This is actually good enough to make sure that nothing is every lost.

Gitolite3


Git servers are nice and easy. Manually keeping track of repository details is an absolute pain in the arse. Thankfully gitosis, and now gitolite have made that process incredibly simple. Despite that simplicity I have not yet worked out how to integrate this into the preseeded process, so for now this is a dangling live piece of state on the server. [Note to self: seeing it like this it is quite obvious running this with the sudo flipped around, root or root->git should make it easy]

Each git server needs a user dedicated to gitolite3:

sudo adduser --system --shell /bin/bash --gecos 'Git version control' --group --disabled-password --home /home/git git
# Copy public key into git.pub
sudo su git
cp ../main/rsa_git.pub git.pub
gitolite setup -pk git.pub

The docs make it look much more complex, but on debian if you have installed the gitolite3 package this is all there is to it. Don't reuse a key - it may seem easier in the short-term but it actually makes things much more complex in the long term. Dedicate a key to git, and use an agent properly!

All the repositories inside gitolite are bare - this is the point of a server, guaranteed push. This has been running quite happily against a single server or years, as I do the upgrade I'm setting up a second mirror for the git server. I haven't tried automated the sync between mirrors yet - there is a bit of thought to be had first about whether or not pushes are guaranteed in a system with mirrors. I'm sure it will be fun to find out :)

I am always forgetting the admin repo URLs as there are slight differences in git-urls under the different protocol prefixes, but here it is as simple as:

git clone git@face.mechani.se:gitolite-admin face-gitolite-admin

Inside the admin repo the key is already in place so the config layout becomes completely uniform, conf/gitolite.conf looks like:

repo randomforests
    RW+ = git

repo paperBase
    RW+ = git

So now the allsync.py script needs some tweaks to handle multiple remotes as mirrors...



Thursday, 15 October 2015

System Deployment (5 of x)

Fixing the dodgy network settings. The debian-installer finds a hostname from DHCP of another site running on my providers service. Manually editing /etc/hostname and then rebooting to solve this.

Seems like the static configuration worked out well. Seeing this output gives me a nice warm feeling:

main@face:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 *:ssh *:* LISTEN
tcp6 0 0 [::]:ssh [::]:* LISTEN

I like to know exactly what is running on a server, and how many access points it adds for an attacker. In this state the box should be safe until another openssh exploit is discovered, and they do not happen every year. Good enough.

Next I need to bring up bind9 for the domain (ha, there goes any security hopes :). Every step now will be done twice:

  • Live steps on the server (configuration drift)
  • Replicating the changes in the file-sytem overlay inside the remastering .iso

At the end I'll blast away the live system to check the installer can rebuild it properly. This won't happen very often, because the KVM simulated access to the raw disk is as slow as it is possible to be. The automated install (minus the manual modprobe for the driver, which is still bugging the crap out of me) takes about 40 minutes. Not so nice.

Bind9

The best way to configure bind9 is using an old configuration that works :) Otherwise proceed very carefully, and read as much as you can about security errors in bind9 configurations. For recovery of old files I took a not so elegant approach to backing up the live server: dd the disk and copy it to a workstation. Mounting this loopback (before killing the old server) gives me a simple way to recover things.

Disaster: apparently I did not dd the disk offsite. Instead I took the shortcut of tar'ing the entire file-system as it was running. This obviously made some sort of sense at the time. For inexplicable reasons it means that the /root directory is missing, along with the .bash_history inside it that was a record of the steps in building the server directly. Lesson: next time be ye not a dick and dd the disk offsite.

Not to worry, I deliberately cultivate an anally-rententive OCD-level journal of work. The steps to build the server will be in there... Disaster the 2nd: as I was running on 16-hour workdays with little sleep when the server was first installed there is a blank spot of a month in my notes. Oh bugger. Lesson: kind of unclear...

Well it can't be that hard to lookup again, I've already written the zone files once... Oh look, there's the zone files and everything is sealed inside a chroot jail so that I can run bind9 as a non-privleged user. That's a really cool idea, how the hell did I do that then?

Hmm, so it all changed with jessie then? I don't think that systemd and I will become friends after I prize its charred tentacles off of my drive.

The late_command in the preseed.cfg is getting a bit busy so we'll add a new script to the iso called chroot_bind9.bash:

#!/bin/bash
mkdir -p /var/bind9/chroot/{etc,dev,var/cache/bind,var/run/named}
mknod /var/bind9/chroot/dev/null c 1 3
mknod /var/bind9/chroot/dev/random c 1 8
chmod 660 /var/bind9/chroot/dev/{null,random}
mv /etc/bind /var/bind9/chroot/etc   # Installed default from package
ln -s /var/bind9/chroot/etc/bind /etc/bind
cp /etc/localtime /var/bind9/chroot/etc/
chown -R bind:bind /etc/bind/*
chmod 775 /var/bind9/chroot/var/{cache/bind,run/named}
chmod 775 /var/bind9/chroot/var/{cache/bind,run/named}
chgrp bind /var/bind9/chroot/var/{cache/bind,run/named}
cp /cdrom/initd_bind9 /etc/init.d/bind9
# Next line is deliberately fragile - should be checked / rewritten if there is a major update to bind9
# Also - this is gnu sed style argument, not bsd.
sed -i 's:PIDFILE=/var/run/named/named.pid:PIDFILE=/var/bind9/chroot/var/run/named/named.pid:' /etc/init.d/bind9
echo "\$AddUnixListenSocket /var/bind9/chroot/dev/log" > /etc/rsyslog.d/bind-chroot.conf
# Skip service restarts as we will reboot soon

The makefile needs to be updated to get the new info into the .iso:

remaster.iso: copy
cp preseed.cfg copy/
cp isolinux.cfg copy/isolinux/
cp /home/amoss/.ssh/rsa_face.pub copy/
cp chroot_bind9.bash copy/
chmod +x copy/chroot_bind9.bash
cp mechani.db copy/
tar czf copy/overlay.tgz -C config etc home/main
genisoimage -b isolinux/isolinux.bin -c isolinux/boot.cat -o remaster.iso -J -R -no-emul-boot -boot-load-size 4 -boot-info-table copy/

And lastly the preseed is updated to execute the new script:

d-i preseed/late_command string \
tar xzf /cdrom/overlay.tgz -C /target ; \
in-target chown -R main:main /home/main ; \
in-target chown root:root /etc/hosts ; \
in-target chown root:root /etc/ssh/sshd_config ; \
chmod 700 /target/home/main/.ssh ; \
in-target chown main:main /home/main/.ssh/authorized_keys ; \
chmod 600 /target/home/main/.ssh/authorized_keys ; \
/cdrom/chroot_bind9.bash

Again, for emphasis: none of this has been tested yet - but what fun is life if we do not live dangerously, eh? After a robust exchange of views with my registrar about the quality of service they have rebuild the entente cordial by manually flicking some switches somewhere, and lo and behold:

dig face.mechani.se +trace
; <<>> DiG 9.8.3-P1 <<>> face.mechani.se +trace
;; global options: +cmd
. 14196 IN NS j.root-servers.net.
. 14196 IN NS d.root-servers.net.
. 14196 IN NS c.root-servers.net.
. 14196 IN NS i.root-servers.net.
. 14196 IN NS h.root-servers.net.
. 14196 IN NS l.root-servers.net.
. 14196 IN NS k.root-servers.net.
. 14196 IN NS f.root-servers.net.
. 14196 IN NS g.root-servers.net.
. 14196 IN NS e.root-servers.net.
. 14196 IN NS m.root-servers.net.
. 14196 IN NS b.root-servers.net.
. 14196 IN NS a.root-servers.net.
;; Received 228 bytes from 8.8.8.8#53(8.8.8.8) in 89 ms

se. 172800 IN NS a.ns.se.
se. 172800 IN NS b.ns.se.
se. 172800 IN NS c.ns.se.
se. 172800 IN NS d.ns.se.
se. 172800 IN NS e.ns.se.
se. 172800 IN NS f.ns.se.
se. 172800 IN NS g.ns.se.
se. 172800 IN NS i.ns.se.
se. 172800 IN NS j.ns.se.
;; Received 492 bytes from 192.203.230.10#53(192.203.230.10) in 123 ms

mechani.se. 86400 IN NS face.mechani.se.
;; Received 63 bytes from 130.239.5.114#53(130.239.5.114) in 175 ms

face.mechani.se. 1800 IN A 46.246.89.132
mechani.se. 1800 IN NS face.mechani.se.
;; Received 63 bytes from 46.246.89.132#53(46.246.89.132) in 49 ms


Is good, no? The domain has only been offline for a month due to a "routine upgrade" :) Next up I will restore the gitolite configuration and mirror my lonely git server...

Wednesday, 14 October 2015

System Deployment (4 of x)

The server is back up and running, having managed to do a hdd install inside the KVM environment for the first time. Yay!. There is not anything running on it yet. There is a rough todo list before the site is back up.

  • Find out why the static network config is broken, and disable dhcp, verify that sshd is the open port open on the machine.
  • Stick the bind9 config back on the machine and bring the DNS back to life.
  • Install gitolite and setup barebones mirrors of the git server on gimli.
  • Put the web-user back in, and setup the hooks to run the server, redeploy from git.
But first a brief detour through the boot sequence (need to convert this to lecture slides)...

Linux Boot Sequence


This has changed over the years, and will probably change again. These details are against the current stable branch of jessie (late 2015).

Step 0: BIOS

The BIOS is effectively the firmware for the specific PC that it is running on. It exists in order to bootstrap the machine: the kernel to load is somewhere in the machine storage. The BIOS should contain enough of the storage-specific drivers to access the storage and load in the next stage for boot. The first BIOS was introduced on the original IBM PC in 1983. It has not changed much since then. In 1983 the range of peripheral devices available was limited; the BIOS was meant to function as a hardware abstraction layer in an era when that meant accessing the console, keyboard and disk driver. This layer is ignored by modern kernels.

The BIOS on every machine (the industry is currently in a transition to UEFI to replace this completely) follows a simple specification to find and execute the Master Boot Record (MBR).
For each disk in the user-specified boot order (e.g. HDD, cdrom, usb etc) :

  • Load sector 0 (512 bytes) into memory at 7C00h.
  • Verify the two-byte signature: 7DFEh=55h, 7DFFh=AAh.
  • Jump to the code in 16-bit real mode with the following register setup:
    • Code Segment = 0000h
    • Instruction Pointer = 7C00h

Step 1: MBR (stage 1)

The MBR used in DOS / Windows has changed over the years to include support for Logical Block Addresses / Plug'n'Play and other extension. The BIOS-MBR interface must remain constant to guarantee that the boot sequence will work without knowing the specific combination of BIOS and O/S on the machine.

It is easy to access the MBR from a live linux system as it is at a fixed location on the disk, for example if we are on the first scsi disk in the system:

dd if=/dev/sda of=mbr_copy bs=512 count=1
dd if=mbr_copy of=/dev/sda bs=512 count=1

Booting Linux almost always means booting the GRUB MBR. If we want to see how that works then we can just disassemble the code in the mbr:

objdump -D -b binary -mi386 -Maddr16,data16 /usr/mdec/mbr

mbr_copy:     file format binary
Disassembly of section .data:
00000000 <.data>:
   0:   eb 63                   jmp    0x65
...
65: fa cli
66: 90 nop
67: 90 nop
68: f6 c2 80 test $0x80,%dl
6b: 74 05 je 0x72
6d: f6 c2 70 test $0x70,%dl
70: 74 02 je 0x74
72: b2 80 mov $0x80,%dl
74: ea 79 7c 00 00 ljmp $0x0,$0x7c79

Here we can see a check on a parameter passed in by the BIOS (%dl indicates which disk was booted), a quick check to see if it is a fixed or removable disk and then a direct call into a BIOS routine by its real-memory address. (the API for the BIOS is a list of addresses and register setups).

All of the stage 1 functionality has to fit into 510 bytes of 16-bit real-mode code. This is not a lot. To make life more interesting there is a data-structure embedded inside this code in a standard format, this is the partition table for the drive that gives us four primary partitions. To be read and written by standard tools this table must be at specific locations inside the sector. When we access this table, e.g. something like:

sudo fdisk /dev/sda

The fdisk tools needs to access the table without executing any code in the MBR to do so. This reduces the space for executable code to 446 bytes (4x 16-byte table entries). This is enough to find and locate a larger boot-stage on the disk, using the raw BIOS routines and execute this second stage. An incredibly detailed (and thus useful) walkthrough of the booting scheme can be read on the grub mailing list.

Step 3 : MBR (stage 2) 

The second stage boot-loader is much larger. In the windows world this is called the VBR and is loaded directly from the beginning of the partition to boot. GRUB (in the classic MBR scheme) loads its second stage from a gap in the disk - the rest of the first cylinder on the disk is used as padding to align the first partition on a cylinder boundary. This padding consists of 63 sectors, or 31.5K of space. This is enough space to identify a boot partition with a known file system and load files (not raw sectors) from there. Typically the second stage of GRUB will display a splash screen with a menu and allow the user to select what to boot next. For a windows install this means chainloading the initial sectors from the installed partition. For a linux install it means loading a kernel file into memory, along with an initial ramdisk holding a bootstrap filesystem and passing control to the kernel.

The splashscreen menu is control by a file-format for GRUB2 that looks like this:

menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz
}

The root is the partition that contains the files for grub. Numbering is somewhat chaotic, harddisks are numbered from zero: hd0, hd1... while partitions are numbered from 1. So (hd0,1) is the first partition on the first disk. This corresponds to the partitioning scheme in the previous post (1GB partition to hold the kernel, initrd and .iso imagers for installers). The second stage will mount the ext2 file-system on that partition, then the filenames ('/vmlinuz') are absolute paths in that file-system. The kernel accepts some arguments from the boot-loader, panic=20 is very useful during development: when the kernel panics and refuses to boot it reboots the system back to grub after 20 seconds. When you don't have a physical keyboard to stab ctrl-alt-del on, this one saves a lot of coffee mugs from acts of extreme violence.

On some distros (gentoo springs to mind, although they may have updated it since I used it last) these config files are edited directly on the boot drive, normally under /boot/grub. Debian has some extra support for building the configuration. Each menuitem becomes an executable script under /etc/grub.d, so for example the above becomes /etc/grub.d/11_remaster by wrapping the contents in a here-document:

#!/bin/sh -e
cat << EOF
menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz
}
EOF

Step 4: Kernel / initrd

GRUB mounted the boot drive in order to load the kernel files, but the kernel cannot access this mount: we want the kernel to be independent of the boot-loader, and looking through memory for these data-structures would represent a huge dependency. So the kernel will need to access the hardware and load the file-system itself. Standard boot-strip problem: where are the drivers to do this? They of course on the disk. Bugger.

In Linux the solution is quite elegant. Start with a / file-system already mounted, including all the necessary drivers. Use it to mount the real / file system on the disk, and then move the original out of the way. This is much easier to achieve, and doesn't involve its own bootstrap problem because we can just serialise the contents of a ramdisk directly onto the harddrive. The kernel can be booted with this ramdrive read back into memory. The tool for doing this is cpio, the contents get gzipped and GRUB knows how to load and unzip the initrd.gz directly into memory for the kernel to use during boot.

An important principe here is reuse of existing tools: the initrd is a standard filesystem (before the serialisation and zipping) so we can edit any part of the boot-image by mounting it and using standard tools on it. This is what the cycleramd.bash script in the previous post did to insert the KVM para-virtualisation drivers.

Step 5: Init / inittab

Once the kernel has finished initialising itself, and gained access to the root file system it can proceed with booting the rest of the system. The /sbin/init executable is responsible for bring the system up to the required level. In an ideal world this is a simple isolate piece of code controlled by a single plaintext configuration file called /etc/inittab. Trying to pull this from my desktop system fell over the slimey entrails of systemd, which I will describe another time. The inittab from the hd-media initrd is a simpler example to work from:

# /etc/inittab
# busybox init configuration for debian-installer

# main rc script
::sysinit:/sbin/reopen-console /sbin/debian-installer-startup

# main setup program
::respawn:/sbin/reopen-console /sbin/debian-installer

# convenience shells
tty2::askfirst:-/bin/sh
tty3::askfirst:-/bin/sh

# logging
tty4::respawn:/usr/bin/tail -f /var/log/syslog

# Stuff to do before rebooting
::ctrlaltdel:/sbin/shutdown > /dev/null 2>&1

# re-exec init on receipt of SIGHUP/SIGUSR1
::restart:/sbin/init

The configuration file defines what programs we should connect to the ttys that we can access through alt-f1, alt-f2 etc. On a desktop we would expect to see the X server connected to a login manager. Different forms of shutdown are associated with commands. On a desktop we would see different runlevels associated with the programs executed to get there. In this ramdisk we see how to turn the linux system into a one-program kiosk system, in this case to run the installer. Neat.

Step 6: Rest of the system

This depends entirely on the programs launched from init. The single-application kiosk image boots a very different system from a typical server (ha, what is that?) or a typical desktop environment. Seeing how the debian-installer manages its boot gives me some very evil ideas for single-process servers in a locked down environment that I may explore later. For lecture slides I should probably also take some time to describe on another day:

  • The modern GPT scheme and UEFI
  • Booting from cdrom and usb.
  • Way to launch server processes
  • Access to the debian-installer source on anonscm.
  • Alterative preseeding approach with the config on the ramdisk.

Sunday, 11 October 2015

System Deployment (3 of x)

The desktop installer seems to work (its in day-to-day use now). Currently it builds an e17 desktop on-top of Debian, with enough support to rebuild everything in my repositories. It is stable enough to host the build system for remastering a server: the target environment.

Server overview.

The server is a VPS (virtual private server) running in a data-center belonging to the provided. The virtual machine runs inside a KVM hosting environment. Physical access is simulated through VNC  - importantly this connects to the host rather than the guest so it remains available during reboot and reinstall. Unfortunately the keymap is a bit screwed up and there is no way to change it. Most(!?!) important punctuation can be found by setting the local keymap to US, but it is a minimal environment.

The simulated CD can only be changed by someone with admin privileges on the host, so it requires a support ticket and 24-48hr turn-around time. For this reason it is left set to a virgin debian installer image.

Bootstrap of the installation environment.

One issue that crops up straight away is that although the netinst image can find the virtual driver for installation - it cannot find it directly at boot. This is not a problem for a simple clean install. But the re-installer will use the hd-media kernel/initrd to boot the cdrom image. And this system cannot find the virtual drive.

KVM uses paravirtualisation, so the kernel will need the virtio drivers (in particular virtio_blk) and
these are not in the hd-media images by default. The initial environment will look like this:

Partition 1: 1000MB, ext2, bootable
   /vmlinuz - default kernel image from hd-media
   /initrd.gz - modified hd-media initial ramdisk with extra modules for virtio
   /remaster.iso - preseeded debian installer for the target installer
Partition 2: 1000MB, swap
Partition 3: Remaining space, ext4, mounted as /
Grub installed on the MBR
  - Standard menuitem to boot /dev/vda3 into the target system.
  - Extra menuitem to boot kernel/initrd from /dev/vda1

The standard installer is used through VNC to partition the disk and get a working system onto /dev/vda3. To rebuild the initrd we use the following script in the desktop environment. This saves a huge amount of work: the environment created by the jessie installer is the environment that the jessie installer was built inside.

#!/bin/bash
rm -rf tmpramd
mkdir tmpramd
(cd tmpramd && gunzip -c ../hdmedia/initrd.gz | cpio -i)
cp /lib/modules/3.16.0-4-amd64/kernel/drivers/block/virtio_blk.ko tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/block/
mkdir tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/virtio
cp /lib/modules/3.16.0-4-amd64/kernel/drivers/virtio/*ko tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/virtio/
cp /lib/modules/3.16.0-4-amd64/kernel/drivers/block/virtio_blk.ko tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/block/
#sed -e 's:start-udev$:&\n/sbin/modprobe virtio_blk:' tmpramd/init >newinit
#mv newinit tmpramd/init
#chmod +x tmpramd/init
echo >>tmpramd/etc/rcS.d/virtio /sbin/modprobe virtio_blk
chmod +x tempramd/etc/rcS.d/virtio
#echo virtio_blk >>tmpramd/etc/modules
(cd tmpramd && find . | cpio -H newc -o | gzip >../cycled.gz)
scp cycled.gz main@face:initrd.gz

As the comments in the script show there are several ways to do this that do not seem to work. I don't why. If you have any idea please leave a comment. Trying to use /etc/modules to force loading the driver does nothing - perhaps this was 2.4 only thing that has been long superceded in the kernel? Inserting the modprobe into the init for the ramdisk just causes a kernel panic when it fails. Inserting the modprobe into the rcS.d means it is called later in init, when control is passed to the debian-installer-init script. This seems to work. [edit: no it doesn't. Some kind of stateful problem in testing, it looked like it worked but this is currently broken. Will update in a later post]. Inside the clean debian install we create /etc/grub.d/11_remaster and execute update-grub.

#!/bin/sh -e
cat << EOF
menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz
}
EOF

This puts us in the position where we can execute the preseeded installer directly from the harddrive to build the target system. There is no way to avoid the partitioning step inside the preseeded installer, so it is vital that the partitions made in the original clean install are identical to those made in the preseeded installer. Overwriting the partition table with the same data does not lose any data on the disk.

The preseeded installer.

As with the desktop install the preseed file is wrapped inside the .iso for the installer. Looks very similar, same partitioning scheme: 1GB for installer images, 1GB for swap, rest for a single / file-system. No dynamic network, hardcoded to the static setup of the target server. The overlay that gets untar'd at the end overwrites the sshd config. No passwords, no root access. Only strong-passphrase keys, the public halves being in the .iso and converted directly into an .ssh/authorized_keys file. There is a single random string with the password for the main user, but this can only be used over VNC. Basic package load for the server.

d-i debian-installer/language string en
d-i debian-installer/country string SE
d-i debian-installer/locale string en_SE.UTF-8
d-i keyboard-configuration/xkb-keymap select sweden

d-i netcfg/choose_interface select eth0

# To pick a particular interface instead:
#d-i netcfg/choose_interface select eth1

# To set a different link detection timeout (default is 3 seconds).
# Values are interpreted as seconds.
#d-i netcfg/link_wait_timeout string 10

# If you have a slow dhcp server and the installer times out waiting for
# it, this might be useful.
#d-i netcfg/dhcp_timeout string 60
#d-i netcfg/dhcpv6_timeout string 60

# If you prefer to configure the network manually, uncomment this line and
# the static network configuration below.
d-i netcfg/disable_autoconfig boolean true

# If you want the preconfiguration file to work on systems both with and
# without a dhcp server, uncomment these lines and the static network
# configuration below.
#d-i netcfg/dhcp_failed note
#d-i netcfg/dhcp_options select Configure network manually

# Static network configuration.
d-i netcfg/get_ipaddress string 46.246.89.132
d-i netcfg/get_netmask string 255.255.255.128
d-i netcfg/get_gateway string 46.246.89.129
d-i netcfg/get_nameservers string 8.8.8.8
d-i netcfg/confirm_static boolean true

# IPv6 example
#d-i netcfg/get_ipaddress string fc00::2
#d-i netcfg/get_netmask string ffff:ffff:ffff:ffff::
#d-i netcfg/get_gateway string fc00::1
#d-i netcfg/get_nameservers string fc00::1
#d-i netcfg/confirm_static boolean true

d-i netcfg/get_hostname string face
d-i netcfg/get_domain string mechani.se
d-i netcfg/hostname string face
d-i netcfg/wireless_wep string # Disable that annoying WEP key dialog.

### Mirror settings
d-i mirror/protocol string ftp
d-i mirror/country string se
d-i mirror/ftp/hostname string ftp.se.debian.org
d-i mirror/ftp/directory string /debian

### Account setup
# Skip creation of a root account (normal user account will be able to
# use sudo).
d-i passwd/root-login boolean false
# Alternatively, to skip creation of a normal user account.
#d-i passwd/make-user boolean false

# Root password, either in clear text
#d-i passwd/root-password password abc
#d-i passwd/root-password-again password abc
# or encrypted using an MD5 hash.
#d-i passwd/root-password-crypted password [MD5 hash]

# To create a normal user account.
d-i passwd/user-fullname string The main user
d-i passwd/username string main
d-i passwd/user-password password xxxxxxxx
d-i passwd/user-password-again password xxxxxxxx

d-i clock-setup/utc boolean true
d-i time/zone string Europe/Stockholm
d-i clock-setup/ntp boolean true

d-i partman-auto/disk string /dev/vda
d-i partman-auto/method string regular
# Manual use of the installer on face reports 30.1GB
d-i partman-auto/expert_recipe string \
remasterPart :: \
1000 1000 1000 ext2 \
$primary{ } $bootable{ } \
method{ keep } \
. \
1000 1000 1000 linux-swap \
$primary{ } \
method{ swap } format{ } \
. \
15000 15000 150000 ext4 \
$primary{ } $bootable{ } \
method{ format } format{ } \
use_filesystem{ } filesystem{ ext4 } \
mountpoint{ / } \
.
#d-i partman/choose_recipe select atomic
d-i partman-auto/choose_recipe select remasterPart
d-i partman/confirm_write_new_label boolean true
d-i partman/choose_partition select finish
d-i partman/confirm boolean true
d-i partman/confirm_nooverwrite boolean true
d-i partman-basicmethods/method_only boolean false
d-i partman-md/confirm boolean true

# Package setup
tasksel tasksel/first multiselect minimal
d-i pkgsel/include string openssh-server git python python-dateutil sudo bind9 bind9-host gitolite3 binutils dnsutils authbind curl
popularity-contest popularity-contest/participate boolean false

# MBR
d-i grub-installer/only_debian boolean true
d-i grub-installer/with_other_os boolean true
d-i grub-installer/bootdev string /dev/vda
d-i finish-install/reboot_in_progress note

d-i preseed/late_command string \
tar xzf /cdrom/overlay.tgz -C /target ; \
in-target chown -R main:main /home/main ; \
in-target chown root:root /etc/hosts ; \
in-target chown root:root /etc/ssh/sshd_config ; \
chmod 700 /target/home/main/.ssh ; \
in-target chown main:main /home/main/.ssh/authorized_keys ; \
chmod 600 /target/home/main/.ssh/authorized_keys



Sunday, 4 October 2015

System Deployment (part 2 of x)

There is not much interesting to say about the debian installer any more: what used to be complex has become quite simple. Hit the appropriate buttons for localisation, choose a standard task selection and drive layout then wait around for the system to build.

The first thing that we want to do is to automate this process. The debian installer was written with scripting support in mind. Preseeding is a technique for supplying the answers to each installer prompt in a simple text format. If the installer can see the preseed file then it can execute the entire installation and reboot into the new system without any user input.

What do we need to make this work?
  1. A preseed file to configure the installer.
  2. A method of getting the preseed file into the installer image.
Let's see how to solve the second problem first.

Remastering the installer image to include a preseed file.

This is easy to do in a running linux system. It seems to be next to impossible in a modern OSX version (the newest debian installers use UDF rather than ISO and the specific UDF used does not mount under OSX).

remaster.iso: copy
cp preseed.cfg copy/
        cp ~/.ssh/id_dsa.pub copy/
genisoimage -b isolinux/isolinux.bin -c isolinux/boot.cat -o remaster.iso -J -R -no-emul-boot -boot-load-size 4 -boot-info-table copy/

copy: debian-8.2.0-amd64-netinst.iso
mkdir copy 2>/dev/null || true
mkdir loop 2>/dev/null || true
mount debian-8.2.0-amd64-netinst.iso loop/
rsync -rav loop/ copy
umount loop rm -rf loop

debian-8.2.0-amd64-netinst.iso: curl -LO http://cdimage.debian.org/debian-cd/8.2.0/amd64/iso-cd/debian-8.2.0-amd64-netinst.iso

A makefile may not be the best way to do this, but it is fast and cheap. We don't want to grab the original installer each time - we want to cache it to save bandwidth. We don't want to unpack the image again unless we have to. A makefile is very natural way to express building from caches.

Unfortunately there is something funky here - it could be that using the directory copy as a target is causing problems with time-stamping for the make logic. Either way I've had to kill the copy directory a few times to get changes to propagate. This mostly works.

Using a file called preseed.cfg in the root directory of the installer image is supported directly by the debian installer. The ssh key file is not used directly by the installer but it is explained later in this post.

Writing a preseed file.

Writing a preseed file is like any other configuration - start with a working example (in this case the jessie example file). Run it to check it works. Tweak it until it does the right thing. In this case the right thing is a desktop install targeted at a virtualbox VM for development work. There is an extra unused partition on the harddrive - this is for storing the .ISO on directly so we can do re-installs direct from the harddrive without needing access to the virtual cdrom.

d-i debian-installer/language string en
d-i debian-installer/country string SE
d-i debian-installer/locale string en_SE.UTF-8
d-i keyboard-configuration/xkb-keymap select sweden
d-i netcfg/choose_interface select auto
d-i netcfg/get_hostname string unassigned-hostname
d-i netcfg/get_domain string unassigned-domain

d-i netcfg/hostname string psuedo2
d-i netcfg/wireless_wep string

d-i mirror/protocol string ftp
d-i mirror/country string se
d-i mirror/ftp/hostname string ftp.se.debian.org
d-i mirror/ftp/directory string /debian

d-i passwd/root-login boolean false
d-i passwd/user-fullname string Amoss
d-i passwd/username string amoss
d-i passwd/user-password password abc
d-i passwd/user-password-again password abc

d-i clock-setup/utc boolean true
d-i time/zone string Europe/Stockholm
d-i clock-setup/ntp boolean true

d-i partman-auto/disk string /dev/sda
d-i partman-auto/method string regular
d-i partman-auto/expert_recipe string                         \
      remasterPart ::                                         \
              1000 1000 1000 ext2                             \
                      $primary{ } $bootable{ }                \
                      method{ keep }                          \
              .                                               \
              1000 1000 1000 linux-swap                       \
                      $primary{ }                             \
                      method{ swap } format{ }                \
              .                                               \
              15000 15000 150000 ext3                         \
                      $primary{ } $bootable{ }                \
                      method{ format } format{ }              \
                      use_filesystem{ } filesystem{ ext3 }    \
                      mountpoint{ / }                         \
              .
d-i partman/choose_recipe select remasterPart
d-i partman-partitioning/confirm_write_new_label boolean true
d-i partman/choose_partition select finish
d-i partman/confirm boolean true
d-i partman/confirm_nooverwrite boolean true
d-i partman-basicmethods/method_only boolean false

tasksel tasksel/first multiselect minimal
d-i pkgsel/include string openssh-server xorg e17 xdm terminology
popularity-contest popularity-contest/participate boolean false

d-i grub-installer/only_debian boolean true
d-i grub-installer/with_other_os boolean true
di finish-install/reboot_in_progress note

d-i preseed/late_command string \
mkdir /target/home/amoss/.ssh ; \
in-target chown amoss:amoss /home/amoss/.ssh ; \
chmod 700 /target/home/amoss/.ssh ; \
cp /cdrom/id_dsa.pub /target/home/amoss/.ssh/authorized_keys ; \
in-target chown amoss:amoss /home/main/.ssh/authorized_keys ; \
chmod 600 /target/home/amoss/.ssh/authorized_keys

Some words about security. 


Don't really set your password to abc - even when you are just testing this in a machine sitting behind NAT. It just sets you up for fail when you change the network configuration in virtualbox and forget to update the password. The idea is to set a randomised password for the only user account and disable root access.

The late_command script at the end is very specific to my configuration, but it should be adaptable for anyone. I have one ssh key that I use to log into desktop machines. It has a strong passphrase on it as sometimes I store the private key in places that I do not entirely trust (and thus it needs to be strong enough to survive brute-force attacks for the expected lifetime of the key).

I place the public-half of this key into the disk image inside the makefile. The installer then creates a .ssh directory for the single user and sets the public key as an authorised key for login to the new desktop. I then have a single way into the newly installed image - ssh to the virtualbox bridge IP address using the key. Once I have securely logged in then I can set the password to something memorable - this avoids storing the password in the .ISO image that is being built. Even if we never release the .ISO into the wild storing a plaintext password in it of any value is simply a bad idea.

Saturday, 3 October 2015

System Deployment (1 of x)

The time has come for a new series of posts, as a bit of background: these series of posts are sketches of new material that relates to one of my courses. In this case the course is the basic linux introduction, which will be getting some new material. The subject to be developed in this series is how we should build and deploy linux systems. I will be covering both desktop and server builds. The information on desktop system is partly a record of how I have finally tamed the jungle of systems that I work on, and partly a description for students who may be installing a linux system to use within the course. The information on server systems is partly a record of how the course server was built, and how it works, and it is partly a guide for students: although they will not be installing a server system within the work on the course, the information on how to do so is a useful body of knowledge to take away from the course as it may be something they need to do in future courses.

On a personal note, the course server was taken off line a couple of weeks ago for a major upgrade. Two things went wrong: the VPS provider had claimed they allow installation from a custom .ISO, which is what prompted this work originally. They did not mean they allowed the customer to upload a custom .ISO, they meant that support staff could do it through a ticket. This means I may need to find a new provider, as there are problems with running a hdd-install inside a para-virtualised system that I may not be able to overcome. The other problem is that I've been off work with some health problems for the past three weeks, and it is now too close to the start of the course to believe this will be working in time. It seems likely that this year the course will be run from an old copy inside It's Learning, and that deployment of new material on the course server will probably be delayed until a different academic year. Slippage is a bitch.

Overview


Regardless of the OS, and regardless of the machine there is an observation that is both timeless and ubiquitous: when a machine is first installed it is fast and stable. Over time it becomes less so. The computer industry has some similarities with the sale of used cars: we can make that "new car smell" that lasts until you get your purchase home and start to use it. Then things go slowly downhill.

Some people would have you believe that the reasons for this are difficult and not clearly understood. Because of this there is a human tendency to attribute agency, and assume somewhere in the process we are deliberately being screwed. I would tend towards a different explanation, that I believe most programmers would agree with. At installation time a computer system exists in a state of low-entropy. The variations between systems are due to different sets of drivers on different hardware, or different selections of features and services. The installer is simply a program that builds a known target state on the system. As a programmer I always like it when my program is trying to build a value that I know. Good times.

During most uses of a system it experiences unpredictable, unknown, changes. The system tends towards a higher state of entropy. Every piece of software that is installed, every change of driver, every update or upgrade of code creates more uncertainty about the state of the system. Eventually it reaches a state of maximum entropy - the heat-death of a computer, at which point an expert is summoned to "wipe all of that crap off and install it fresh". The cycle continues, after all thermodynamics does not take prisoners and the outcome is inevitable.

Problem statements.


1. System upgrades create more entropy than system installs.








This should be read as a diagram of "non-commutativity", i.e normal use + upgrade ≠ upgrade + normal use. The circles should be seen as estimates of "valid possible states / configurations". The amount of uncertainty / entropy is indicated by the number of circles.

2. Configuration drift destroys robustness.
Robustness should be interpreted as doing the least surprising thing. Editing the configuration of a live system until it seems to work does not create the least surprise.

3. Even the fastest human expert imposes latency on fixing a system.
If I type really fast then I can mimic a very slow script. Maybe.

4. Reliability is easiest to achieve through redundancy. Robustness requires returning to a known state.
When something breaks it needs a backup. If something has become questionable it helps if the backup is not also screwed.

5. The higher the degree of system entropy the less secure the system.
If we know things then we can rule out attack surfaces. Pop quiz: which is more secure, a dynamic IP or a static one. Sensible answer: why would it make a difference, the information exchange between the dhcp server and dhcp client only needs to convey MAC addresses and IP addresses. Real world answer: programmers are lazy idiots, and both dhcp servers and clients use bash to process strings in the protocol. Unless patched and verified using DHCP opens a system to the shellshock exploit.

Solution (to be developed over series)


Automated installation and deployment.

Ok, so can we unpack this a little to get an idea of why this would be a solution (all details in upcoming posts):
1. Avoid upgrading the system, capture the target as a delta from the default install, reapply the delta to an install of the new version. This is "the other pathway" on the diagram above.
2. Never reconfigure anything on the target. Preseed and script all configuration and installation. This allows deployment of the system into a development environment with some reassurance that the config matches the live system.
3. A preseeded installer can rebuild a system in about 10 minutes. Not to the clean OS install, but to a fully working system with working data downloaded from source control.
4. Configuration changes are captured in the version control for the system that builds the installer.
5. All configuration is documented. In the longer term it seems interesting to make the target system read-only to enforce this - running in a similar manner to a live distribution. This closes all the holes - if there is no root on the target box, then no attacker can own it.

Saturday, 12 September 2015

Automated Testing in Docker (part 5 of 5)

The first batch of student assignments has been graded, and so this is the penultimate post in this series. Today I want to write a little about the experience of testing via docker. The final post will be some time later as I want to build a queuing system for tests - the long-term idea is that students will have access to the testing facility before them submission, rather than a teacher using it before grading.

Battery consumption on the mac.

Using docker on the mac rapidly makes the Terminal application the most energy intensive. Opening it up inside Activity Monitor shows that the VBoxHeadless process is running and consuming power in-between test uses. The simple solution is virtually suspend/resume the machine:

VBoxManage vmcontrol pause default
...
VBoxManage vmcontrol resume default

Underspecification in the assignment.

Students seem to have a natural genius for spotting parts of the assignment that are underspecified or ambiguous, then as a weakly coordinated group they spread out among the possibilities that are just within the specification and their work samples the boundary condition of what is allowed. It's like watching a particle filter explore environmental constraints. Anyway, the compiler assignment they were given in this year forgot to mention some things:

  • The output is weakly defined; there is no split between what should go to stdout, which filenames the dot representation should be saved under etc.
  • The filenames in the submission tarball are not defined; the reference to the previous assignment is ambiguous and depends on finding abbreviations that make sense to an english speaker. If ass1-int was the name of the first directory then the second should be ass2-comp, but this depends on cultural experience of abbreviation. Amazingly one student in the batch did actually deduce this from the context, but amongst the others were reasonable estimates of ass2-int, ass1-int and ass2.
  • A problem that occurs in both assignments is that the gold standard of "make a working compiler for this language" was been weakened to support a range of passing grades. The output of the various target grades is not strictly monotonic, so the output for a student targeting a grade E is not compatible with a grade D for example.
  • The "easier" result for a low grade is harder to check. This was not obvious while writing the assignment as the focus is upon what they need to do (code) in order to get a passing mark. Because the output for a grade E on the second assignment is not executable - other than a human reading the CFG in the final pdf file - it requires more knowledge to evaluate if the traces in the testcase are properly shown in the output. Some student forgot edge labels making their CFGs ambiguous, others missed blocks in the CFG or confused ordering. It is not obvious how to make the "easier result" easier to validate for the student. This point requires some thought.

False positives on plagiarism.

One of the onerous duties of a teacher is that we need to be aware of plagiarism and spot when work has been duplicated. We must also avoid false positives as the process for handling student plagiarism does not handle abort cleanly (people are humans and tend to get pissy when accussed of something they have not done). In this batch it looked like there was a clear case of copied. Very clear cut. Almost too easy to spot.

Experience indicates that the more obvious something is the harder we have to work to verify that it is true. So in this case the individual tests had to be rerun manually and the code examined against them to check what was happening. Luckily I have a high background level of disbelief so after checking minutely within the testing environment it was time to dismantle the testing environment and manually unpack and stage the submissions to verify that the testing environment had not magically made their codes look the same. In this case this level of technical paranoia was justified as the testing environment had indeed altered their codes to make them look the same. Bad evil docker.

So what happened? If we recall the file-sytem inside a container is a union-fs that stacks parts of the FS tree together from more basic pieces. Docker tries to be smart about this so it uses a cache of those previous pieces. Reruning the same build instructions from the same source files (which have been reloaded with a fresh submission) does not propagate dependencies - Docker is not make. It does not check if the source has changed, it has a different set of rules for when to rebuild dependencies and when to reuse the cached FS layer. Bad docker.

When tools misbehave we just turn them off, and at least Docker is flexible:

docker build --no-cache -t=current_test .

This disables all caching and ensures that we build the container only from the submitted work and the pieces of the base image - never caching state from another submission. Phew, narrowly avoided disaster.

Return codes.

As far as I can tell Docker may be flakey and crappy with processing the return-codes of RUN commands within build scripts. Or not. The caching issue above may have contributed to executing the wrong executable. It is hard to pin down an exact failure, and the various bug-reports against Docker for this type of issue have been closed. This could have been a ghost issue as a result of fixing the caching problem above. Either way it does seem safer to write out the return-codes as a result of the test rather than using them to decide if the test continues:

RUN ./comp /testcases/1.lua >redirect.dot; echo $? >/testresults/1.retcode

Performance

Well, it ain't quick. But on the other hand it is much faster than typing in the tests manually. Each run takes about 15 seconds at the moment (scraping the various filenames together and handling every variation in the underspecified assignment takes many many minutes). It is definitely slower than something I would stick behind a CGI gateway and give students access to at the moment. One speed-up will be attained by smashing together various RUN commands into a single shell command separated by semicolons. But it will still be slow enough to be some kind of batch system. Also, having up to 50 students hitting the testing server at the same time may introduce a latency of 10-15 minutes.

Code that breaks

There are generally two ways that student submission fail to terminate cleanly, either they crash or they hang. Programs that seg-fault are shown in the build output. Need to find a way to record this automatically, at the moment it is manual. Something very nice that is normally hard to do is to recover a stack-trace for the failure.

Running an interactive container on the current_test image allows the installation of gdb, manually flip on debug info in the student makefile and then replicate the crash inside a debugger. This allowed me to email stack traces to students that submitted crashing code. I am suitably happy with this :) I also need to think of a way to automate this process.

Hanging code is a bit harder and I suspect that I will find a cpu quota option in Docker somewhere and put a cpu-timelimit in the assignment specification. This will interact with the performance issues above and affect the maximum latency on a submission queue.

Conclusions

Grading student work is quite a subjective process that depends on an objective testing of the artefact they submit. The testing process is quite error-prone and fragile, much more so that many teachers would like to admit. For the testing process to really be objective it must be made robust; reproducibility of the testing used to inform a grading decision is vital to minimising the subjectivity of the decision. My initial experiences from trialling this approach suggest that it has great potential. I estimate that when students get access to the testing mechanism it could have great benefits for them.

Monday, 7 September 2015

Automated Testing in Docker (part 4 of 5)

Running tests in Docker and getting results


Let's speed up a bit and wrap up a testing sequence inside Docker. This is the Dockerfile that we use to build the container. This is only a first step to check that we can get the results that we need - these six tests are not dependent on one another and so failure in one should not prevent the rest from executing. This build script does replicate much of the testing. Something that is apparent immediately is that this way of working is akin to single-user mode. The file-system can be stomped over (making various top-level directories and ignoring the rest), the system runs as root so we can ignore file-permissions and users completely.

FROM ubuntu_localdev
ADD student.tgz /testroot/
COPY 1.lua /testcases/
COPY 2.lua /testcases/
COPY 3.lua /testcases/
COPY 4.lua /testcases/
COPY 5.lua /testcases/
WORKDIR /testroot/ass1-int
RUN ls
RUN make
RUN mkdir /testresults/
RUN ./int >/testresults/noargs
RUN ./int /testcases/1.lua >/testresults/1
RUN ./int /testcases/2.lua >/testresults/2
RUN ./int /testcases/3.lua >/testresults/3
RUN ./int /testcases/4.lua >/testresults/4
RUN ./int /testcases/5.lua >/testresults/5

This runs and we can jump in to get results directly (picking a student that did well in the assignment helps - test the positive test cases first and then go back to test the failures) :

$ docker build -t=current_test .
Sending build context to Docker daemon 326.7 kB
Step 0 : FROM ubuntu_localdev
 ---> f899773591db
Step 1 : ADD student.tgz /testroot/
 ---> Using cache
 ---> 7efb17601604
Step 2 : COPY 1.lua /testcases/
 ---> Using cache
 ---> e1981c1bd075
Step 3 : COPY 2.lua /testcases/
 ---> Using cache
 ---> a1097f1868e5
Step 4 : COPY 3.lua /testcases/
 ---> Using cache
 ---> 29438a35f659
Step 5 : COPY 4.lua /testcases/
 ---> Using cache
 ---> 185197776daf
Step 6 : COPY 5.lua /testcases/
 ---> Using cache
 ---> 5b10271f9ed8
Step 7 : WORKDIR /testroot/ass1-int
 ---> Using cache
 ---> d5dc06b80cd7
Step 8 : RUN ls
 ---> Using cache
 ---> 39d93f0e0d73
Step 9 : RUN make
 ---> Using cache
 ---> dcfa1cb02748
Step 10 : RUN mkdir /testresults/
 ---> Using cache
 ---> d1f6ece4509a
Step 11 : RUN ./int >/testresults/noargs
 ---> Using cache
 ---> ef45f84616c6
Step 12 : RUN ./int /testcases/1.lua >/testresults/1
 ---> Using cache
 ---> 349c50894154
Step 13 : RUN ./int /testcases/2.lua >/testresults/2
 ---> Using cache
 ---> c5586940a4fb
Step 14 : RUN ./int /testcases/3.lua >/testresults/3
 ---> Using cache
 ---> 888c850b9164
Step 15 : RUN ./int /testcases/4.lua >/testresults/4
 ---> Using cache
 ---> f75184fa8fa8
Step 16 : RUN ./int /testcases/5.lua >/testresults/5
 ---> Using cache
 ---> 61ec01ac3dba
Successfully built 61ec01ac3dba

Have an interactive poke around to check the results make sense...

docker run -it current_test /bin/bash
root@a02b5139bb7f:/testroot/ass1-int# ls /testresults
1  2  3  4  5  noargs
root@a02b5139bb7f:/testroot/ass1-int# cat /testresults/noargs
digraph graphname {

n1 [label="startProg"];n1 -> n3;
... <snip big-ass file>

Oh yes, now I remember that they output the Graphviz dot-format for an intermediate representation in that assignment. So that means that our testing container isn't equipped to deal with the test. Have we struck disaster, or is Docker so amazingly choice that we can overcome this? Stay tuned to find out..

$ docker restart moon_base1
moon_base1
$ docker attach moon_base1
root@b8e173f8481f:/# apt-get install graphviz

Lots of snipped apt-get output snipped, the gist of which was that installing the missig packages will probably not screw up the development image so lets go ahead and do that. Then we simply rebuild the test... If we think of the container as a delta from an image then we've edited the image so when we add the two together we get a new container:

$ docker commit moon_base1 ubuntu_localdev
74b59407d66fb5fd67fbb3255f9f435f079476bcac9ea259978ad85cf0005260
$ docker build -t=current_test .
Sending build context to Docker daemon 326.7 kB
Step 0 : FROM ubuntu_localdev
... *snip* ...
Removing intermediate container 341861334c56
Successfully built eebb3556c88f
$ docker run -it current_test /bin/bash
root@c993bd4fc1d0:/testroot/ass1-int# ls /testresults
1  2  3  4  5  noargs
root@c993bd4fc1d0:/testroot/ass1-int# dot -Tpdf /testresults/1 -o /testresults/1.pdf
root@c993bd4fc1d0:/testroot/ass1-int# ls -l /testresults
total 44
-rw-r--r-- 1 root root   614 Sep  7 13:32 1
-rw-r--r-- 1 root root 13852 Sep  7 13:35 1.pdf
-rw-r--r-- 1 root root  1891 Sep  7 13:32 2
-rw-r--r-- 1 root root  3552 Sep  7 13:32 3
-rw-r--r-- 1 root root  2748 Sep  7 13:32 4
-rw-r--r-- 1 root root  6772 Sep  7 13:32 5
-rw-r--r-- 1 root root    76 Sep  7 13:32 noargs

Very nice Rodney. We just rewrote the past a little because the testing can be applied automatically to the new environment. Cool, so how do we get the results out to look at on the host machine?
$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
c993bd4fc1d0        current_test        "/bin/bash"         3 minutes ago       Exited (0) 3 seconds ago                        sad_turing
$ docker cp sad_turing:/testresults/1.pdf .
$ ls -l 1.pdf
-rw-r--r--  1 andrewmoss  staff   13852 Sep  7 15:35 1.pdf

Excellent Smithers... A full testing cycle. Next time we will sort out the independence of the individual tests and maybe even do some actual grading...

A nanopass compiler in Haskell (part 4 of many)

A further explosion of boilerplate.


There are two standard candidates for a function that must be defined generically over the language trees: parsing and pretty-printing. In the compiler code I'm using parsec for parsing, mainly because of familiarity, but the number of "parsing in parsec details" does does compare nicely to the number of "writing boilerplate details" so I will focus on pretty printing as the generic example.

The original plan was to use two separate typeclasses to represent the genericity [Spellchecker: I do not know if that is a real word either] in the code. A typeclass called Pool that would include functions defined for each node, and a second for each target generic function (e.g. Pretty, Parse etc). This did not work out well. I have decided that I cannot express this in Haskell - but I'm still not sure if the emphasis in that claim is upon me or Haskell :( On the bright-side it looks as if Typeable and Data from the SYB approach will take the role of the Pool class in this plan eventually.

When this plan broke down, shortly after making contact with the enemy, it crept into an attempt to wodge the generic functions directly into the Pool class. A foolish effort that will now be documented to make its failure entirely manifest.

type Indent = (Int,String)
oneline (tabs,text) = (indent tabs) ++ text ++ "\n"
indent depth = foldl (++) "" $ replicate depth "  "

class Pool a where
        pretty :: a -> String
 pretty = (foldl (++) "") . map oneline . expand
 expand :: a -> [Indent]

A more natural-seeming signature would map a Pool a onto a String directly. The input language is layout-sensitive, so the pretty printer needs to build the correct indenting. If a String is produced by each pretty node then the rules for line-breaking or swallowing internal indent become trickier. Instead the Indent type is used as an intermediate representation so that a soft-linebreak can merge adjacent lines more easily. The actually pretty printing is handled in the generic expand calls.

instance Pool Lit where
 expand (Lit n) = [(0,show n)]

None of these instance declarations are boiler-plate - this is the actual body of the generic function for each case. The new boilerplate that we need to add is the following:

instance Pool ParseE where
        expand (ParseE1 e) = expand e
        expand (ParseE2 e) = expand e
        expand (ParseE3 e) = expand e
        expand (ParseE4 e) = expand e
        expand (ParseE5 e) = expand e
        expand (ParseE6 e) = expand e
        expand (ParseE7 e) = expand e

instance Pool ParseS where
        expand (ParseS1 s) = expand s
        expand (ParseS2 s) = expand s
        expand (ParseS3 s) = expand s
        expand (ParseS4 s) = expand s

Yes this makes me die a little on the inside to sit and type in. This is pure boilerplate in the sense that we could cut and paste it with minor edits, but it needs to be kept up to date with the code despite its low semantic complexity. The original plan for splitting the typeclass in two parts was an estimate of how much of this would be acceptable. If we denote the number of generic functions by n, and the number of constructors in the various tree languages by m then this approach requires nm instances to work (plus the pattern synonyms and the actual tree declarations which are purely to provide unique names for each tag constructor). In the split approach it was hoped that n+m instances would be enough.

Far too much guffy code to write this way. Next we need to finish reading the SYB papers and working through some examples of Typeable and Data.

A nanopass compiler in Haskell (part 3 of many)

Writing lots of boilerplate


Sometimes when it is not clear what is the best way to proceed: proceeding in any direction is maximal. In this case there are nice approaches to solving the problem in the previous post. But I don't the trade-offs between them as I've never used before. So deciding which is the best approach requires some experience of any approach. This is the kind of circular design tarpit that many programmers wrestle with. Realising this means that we can pick a good direction to head in: out, away, not-here!

Let's proceed in what is obviously the worst approach to gain some experience in what makes it so bad. Or in other words, lets write lots of boilerplate before we try out the "Scrap your own boilerplate techniques". So we need a pool of nodes to describe a simple language:

-- Expressions
data Lit = Lit Integer
  deriving(Eq,Show)
data Var = Var String
  deriving(Eq,Show)
data Do s       = Do [s]
  deriving(Eq,Show)
data Case e = Case String [(Maybe Integer,e)]
  deriving(Eq,Show)
data Call e     = Call String [e]
  deriving(Eq,Show)
data Add e  = Add e e
  deriving(Eq,Show)
data Sub e  = Sub e e
  deriving(Eq,Show)

-- Statements
data Assign e   = Assign String e
  deriving(Eq,Show)
data Activate e = Activate String [e]
  deriving(Eq,Show)
data Receive e  = Receive String [String] e
  deriving(Eq,Show)
data Return e   = Return e
  deriving(Eq,Show)
data Send e     = Send String e
  deriving(Eq,Show)

The division into expression nodes and statements is somewhat arbitrary: if we were writing a single language the relative placement of these nodes within trees would be constrained by their types. It seems good to remember that distinction as we try a more generic approach. Occasionally snippets of documentation reveal where not to point a loaded gun with respect to our own footwear.

The first language we specify over the pool of nodes is a simple parse tree, at this point the typing for expression/statement nodes appears immediately:

data ParseE     = ParseE1 Lit
  | ParseE2 Var
  | ParseE3 (Do ParseS)
  | ParseE4 (Case ParseE)
  | ParseE5 (Add ParseE)
  | ParseE6 (Sub ParseE)
  | ParseE7 (Call ParseE)
  deriving(Eq,Show)

data ParseS  = ParseS1 (Assign ParseE)
  | ParseS2 (Activate ParseE)
  | ParseS3 (Receive ParseE)
  | ParseS4 (Return ParseE)
  deriving(Eq,Show)

The two languages are simple definitions of which partition of the pool of nodes can appear at a given "kind" of node in the tree. They are mutually recursive by necessity; the definition of expressions and statements within the parsed language is mutually recursive. Each constructor within ParseE/ParseS acts purely as a tag - it specifies the choice of which pool node can appear below.

These tags form a wrapping for the pool nodes - there are no "raw" constructors held in the tree. Each of the pool constructors is encapsulated by a specific constructor for the tree class. This is of course quite ugly and unwieldy in practice as we need to use this wrapping everywhere the pool node would be operated upon:

f (ParseE1 (Lit n)) = g (ParseE2 (Var "newname"))

It seems quite ugly to need to do this and yet it seems entirely necessary to bypass the monomorphism restriction. Disabling the restriction is an approach that I'm not familiar with yet. At this point a familiar feeling descends for a researcher: this is ugly and it annoys me enough to fix it, yet other people have been doing this for years... Luckily we no-longer live in an age where this means we need to spend a long time searching for the name of the concept that will fix this for us. Search engines have removed the search problem in the opposite direction, but finding the know name of a thing is still a difficult. So let's rely on social media and people's willingness to take that final step in research dissemination - telling other people how stuff works. Beautiful.

{-# LANGUAGE PatternSynonyms #-}
{-# LANGUAGE FlexibleInstances #-}
pattern PLiteral x  = ParseE1 (Lit x)
pattern PVariable x  = ParseE2 (Var x)
pattern PDo  x  = ParseE3 (Do x)
pattern PCase x y = ParseE4 (Case x y)
pattern PAdd x y = ParseE5 (Add x y)
pattern PSub x y  = ParseE6 (Sub x y)
pattern PCall x y = ParseE7 (Call x y)
pattern PAssign x y = ParseS1 (Assign x y)
pattern PActivate x y = ParseS2 (Activate x y)
pattern PReceive x y z = ParseS3 (Receive x y z)
pattern PReturn x = ParseS4 (Return x)
...
f (PLit n) = g (PVar "newname")

Pattern Synonyms solve this problem of crufty constructor tags perfectly, *tips hat toward danidiaz*, so we get to keep pattern matching over a language that is defined independently of the pool nodes, and thus the mutual recursion between expressions and statements does not pollute the pool definitions. We still have a hope of writing generic code on the pool nodes that we can execute over any language tree that we define this way. A hope. But we have not yet done anything with the language tree so celebration of the lack of boilerplate code would be a tad premature.

The next step is to define a generic function that actually does something...