Friday, 14 October 2016

Automated Testing in Docker (part 6 of 5).

An extra postscript.

Testing submissions on the compiler course has been working well with docker over the last year. The latest version builds a report summary that is stored on the web-server, and then rendered into a viewable report for students. It seems to have made many platform-dependent bugs fixable for students, which has increased the quality of submissions. It has been successful enough that I ported it over to the Linux course where it has made an impact on the grading process in the course.

Then it all broke.

The testing process was performed on one of two machines, depending on where I was working at the time:

  • A Mac laptop, using boot2docker inside virtualbox.
  • A linux desktop, using local install of docker.
There were no observable differences between testing results from the two platforms, although this may have been something of a fluke. The desktop machine died two weeks ago and was replaced with a much newer desktop with a Skylake processor. Unfortunately the processor is currently unstable on debian Jessie so the desktop is running Ubunut 16.06 until the microcoding / libc issues are resolved.

Running Docker under Ubuntu.

The benefit of docker (over manipulating raw VM images) is the convenience that the cmd-line tools give for handling containers and images. The performance benefits are not so important for this application. But both of these attributes arise because Docker builds images on onion-filesystems, building up on readonly layers.

Switching to Ubuntu caused an unforeseen problem in the testing environment - all the core dumps disappeared. Investigating this revealed that the ulimit -c unlimited in the testing script was not sufficient to generate cores. The kernel checks /proc/sys/kernel/core_pattern to decide where to write the image.

In a docker container this is simply a read-only copy of the host! When /proc only served as an informative (reflective) interface to the kernel status this was not a problem. But not that /proc is also used as a configuration interface it means that details of the host are leaking into the container. In particular Ubuntu sets this to:

      |/usr/share/apport/apport %p %s %c %P

So that cores are piped into a reporting tool - which is not installed in the docker container, and is not the desired behaviour anyway.

Conclusion: using docker in its default mode on linux as a form of configuration management for a testing environment is fatally flawed.

Wrapping the linux docker inside boot2docker.

The official way to install docker does not seem to include a virtualized linux option. The VM approach is used on Windows and OS-X, but the installer for them (Docker-Toolkit) is not available on linux. So this needs to be done manually:

curl -L https://github.com/docker/machine/releases/download/v0.8.2/docker-machine-`uname -s`-`uname -m` -odocker-machine
chmod 755 docker-machine
sudo mv docker-machine /usr/local/bin/
sudo chown root:root /usr/local/bin/docker-machine
docker-machine create --driver virtualbox default

Yes, I shit you not. It really is that ugly to get it onto an Ubuntu system. Life now takes a turn for the more "interesting":

Error creating machine: Error in driver during machine creation: This computer doesn't have VT-X/AMD-v enabled. Enabling it in the BIOS is mandatory

It seems that VT-X is disabled by default on the HP EliteDesks. Enabling it allows the boot2docker image to run successfully (https://github.com/docker/machine/issues/1983), and then docker-machine nv default produces the right values to connect.

Note: all the old scripts use sudo docker, this still works - but it connects the daemon to a different docker server. Using docker as user (instead of sudo) shows the right machine where stuff works. This is confusing to use.

Standard install for the debian_localdev image used to test submissions:

docker run -it --name localdev debian /bin/bash
> apt-get update
> apt-get install gcc g++ clang gdb make flex bison graphviz vim
> ^d
docker commit localdev debian_localdev

After this we still get leakage from the server host - but boot2docker is quite minimal so we should be able to tolerate the configuration leakage.

cat /proc/sys/kernel/core_pattern
core

Need to remember to update the docker scripts before retesting all the submisssions..

Monday, 21 March 2016

Quick notes on unbreaking gitolite

Some quick notes on de-fuck-ifying a gitolite3 installation.


Here is post-hoc explanation of what probably happened:

  • Installed gitolite3 on debian using a key called git.pub
  • Lost the private half.
  • "Fixed" the problem with a new key called rsa_git.pub, this was manually inserted into .ssh/authorized_keys instead of redoing the install.
  • Stuff worked (for about 5 months) long enough to dispel any suspicious that it was all funky and rotten underneath.
  • Tried to update the admin repo to add a new repo - all hell broke loose.

Symptoms


After committing the admin change, things got kind of weird and then this happened:

git clone ssh://git@face.mechani.se/gitolite-admin.git admin_face
Cloning into 'admin_face'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

The key is in an agent that has been running for months, so that seems to be unpossible!

At this point I realised that I had screwed the admin key and googled how to fix it. The instructions for gitolite3 say:

When in doubt, run 'gitolite setup' anyway; it doesn't do any harm, though it may take a minute or so if you have more than a few thousand repos!

This is not in any way true. It is entirely possible that running setup on a live install will break that install. It is not a safe operation at all. Do not believe the lies: there is no cake.

After using this to install the new key something really bad happened, it looks like this:

FATAL: R any gitolite-admin rsa_git DENIED by fallthru
(or you mis-spelled the reponame)
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

So at this point things look really bad as access to the admin interface has been lost. This can be confirmed by trying to ssh directly into the git user on the server which shows this:


PTY allocation request failed on channel 0 
hello rsa_git, this is git@face running gitolite3 3.6.1-2 (Debian) on git 2.1.4

R W testing


Solution


A bit more googling shows that people then tend to panic and wipe their gitolite install and install from scratch to fix this. Instead I will now quote again from the gitolite3 docs:


Don't panic!

First, have a poke around the git user directory (if you no longer have a way to do this, i.e. you do not have root on the box, then go ahead and panic, that's probably the right approach). .gitolite/logs is very interesting and lets you reconstruct what has happened. More importantly:

.gitolite/conf is where the bare contents of your admin repo get blasted into!

So fix .gitolite/conf/gitolite.conf first to regain access (i.e. change the keynames on every repo).
This will not do anything at all, so note the presence of a file called gitolite.conf-compiled.pm that is obviously a cache. Delete it. This still does not work but at least the error message changes to indicate it needs that file. Finally, just run:

gitolite3 compile

This will regenerate the conf cache properly and let you back in. Problem solved, panic avoided



Monday, 19 October 2015

System Deployment (part 7 of x)

The Mechanise server is back up and running, with a mirror of each git repository on it. Before teaching starts for the winter term it needs the web-server used for the courses to be brought back into life. This is complicated by a massive rewrite / redesign of large chunks of it that will probably stretch long into the term.

The web server.


This should be a simple beast: python code using the twisted library for HTTP processing. Large chunks of the site are content served dynamically to each student. Over the years it has been a testbed for pedagogic projects: generating unique assignments for students, integrating automatic testing into the submission system and other crazy ideas. As a result it has sprawled out of control, and the software architecture looks inspired by Picasso having a merry old time high on weapon-grade LSD.

The first step is get the deployment system working again on the server. When the git repository hosting the server is updated a post-update hook springs into life:

  • Copy the source tree and resources into the production tree.
  • Kill the old server.
  • Respawn the noob.
Git hooks are a strange mess of server-side state that is not versioned... Inside the bare repository on the server we update files in the .git/hooks directory that git will execute during certain actions. The post-update hook is the one that will redeploy the server:

#!/bin/sh

echo "Website updated from commit" | logger -t gitolite
GIT_WORK_TREE=/var/www/thesite git checkout -f | logger -t gitolite
chmod 755 -R /var/www/thesite
chown git:git -R /var/www/thesite
curl -s http://localhost/restart
sleep 2
top -bn1 | grep python
ps -A --forest | grep -C1 python
tail /var/log/syslog

Like all archeologists we can find evidence of panic among the primitive people. The sleep followed by a dump of info is a sure sign that something did not work once, and that the confirmation used to debug that was so comforting that it was never removed.

Using a URL to kill the server is asking for trouble, currently we check the incoming transport that it originated on the 127 interface. This should not be spoof-able, but if it is then we can use a random number in the file-system to lock this request. This approach works better than a direct kill from the gitolite user as:

  • No worries about serialisation; if we are in the processing hook for the restart page then any file I/O for another request is done.
  • No worries about privileges to kill a process belonging to another user with introducing a privilege escalation attack.

First we need the user that will run the server, and their home-directory. The user www-data is already installed on debian for this purpose:

main@face:~$ grep www-data /etc/passwd
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
main@face:~$ sudo su www-data
[sudo] password for main:
This account is currently not available.
main@face:~$ sudo su -s /bin/bash www-data
www-data@face:/home/main$ cd
bash: cd: /var/www: No such file or directory

It is designed not to permit casual use - once upon a time it was a dreadful security hole when people would forget to set a strong password on the account, or even worse leave the default in place. We actually like it that way, so we will create the home directory that it needs and leave it disabled so that only the root user can log into it by forcing a different shell.

root@face:/home/main# mkdir /var/www
root@face:/home/main# ls -ld /var/www
drwxr-xr-x 2 root root 4096 Oct 19 09:34 /var/www
root@face:/home/main# chown git:git /var/www
root@face:/home/main# ls -ld /var/www
drwxr-xr-x 2 git git 4096 Oct 19 09:34 /var/www
root@face:/home/main# echo >/var/www/webservice <<EOF

#!/bin/bash
cd /var/www/thesite
while true
do
authbind python server.py 2>&1 | logger -t www
echo "Web server exited, restarting" | logger -t www
sleep 2
done
EOF

root@face:/home/main# chown www-data:www-data /var/www/webservice
root@face:/home/main# chmod 744 /var/www/webservice
root@face:/home/main# su -s /bin/bash www-data
www-data@face:/home/main$ cd
www-data@face:~$ ls -al
total 12
drwxr-xr-x 2 git git 4096 Oct 19 09:37 .
drwxr-xr-x 13 root root 4096 Oct 19 09:34 ..
-rwxr--r-- 1 www-data www-data 167 Oct 19 09:37 webservice
www-data@face:~$ ./webservice
./webservice: line 3: cd: /var/www/thesite: No such file or directory
^C
^D
root@face:/home/main# mkdir /var/www/thesite
root@face:/home/main# chown git:git /var/www/thesite
root@face:/home/main# chmod 755 /var/www/thesite
root@face:/home/main# touch /etc/authbind/byport/80
root@face:/home/main# chown www-data:www-data /etc/authbind/byport/80
root@face:/home/main# chmod 500 /etc/authbind/byport/80
root@face:/home/main# ls -l /etc/authbind/byport/80
-r-x------ 1 www-data www-data 0 Oct 19 10:17 /etc/authbind/byport/80

Cool. The user has just enough privileges to execute the service script, but it cannot do anything else as it has no write permission anywhere. The gitolite user owns the www-data home-directory and the yet to be created thesite directory inside it. This is the target that we perform the bare checkout into each time the repo is updated. The basic workflow is like this:

  • Dev work happens off the server using -local to run the server in the non-production environment.
  • Deployment happens when the the dev-commits are pushed back up to face.
  • The update-hook fires:
    • The production source tree is updated with the new code.
    • The old server is killed.
    • The service script spawns a new server after a couple of seconds.

Friday, 16 October 2015

System Deployment (part 6 of x)

Time to wander into a slightly different topic: now that the mechani.se domain is back up and running it is time to setup a gitolite3 installation on face.mechani.se

Problem Context for gitolite.


My way of working on a linux system has evolved over the years because of some specific desires:

  • I work on several machines - my environment should always be the same
  • I dislike the hassle of maintaining backups - but I need to know that they are in-place, and every-time I switch machine I am effectively doing a restore.
  • Switching machines should break my context as little as possible.
The third point is the killer; during a session on one machine I build up a thick and fecund context. Depending on the work it may be edits to source, environmental variables, command history and other forms of state that are local to the machine. Over time each machine acquires a layering of packages and installed artefacts (libraries, modules, random pieces of related source). Even seemingly inconsequential parts of the machine state are useful: the collection of workspaces and windows, positions and combinations are all memory prompts for a particular task. 

The original dream (probably not even mine, these things tend to be infectious) was a teleporting environment: effectively hibernate a machine and transport the hibernated image to a new machine to be restored. These are the idle dreams of a grad student who works late through the night and doesn't want to start fresh when he trudges into the office. These dreams never quite found traction, although years of experimenting allowed them to morph into something more useful.

Virtualbox introduced teleportation a few years ago. The reality involves more suckage than the dream. Image files are large and cumbersome to transport. Somewhere I have a custom diff utility that syncs the hard-drive overlays inside a .VDI against a target-set to reduce the diff to a few hundred megs at a time (possible over my positively rural ADSL as well as on a flash drive). It just didn't really work out for us. Versioning entire OS images, avoiding branches and generally playing the maintain-the-repo-consistency game on a larger scale was even less fun that you would think.

It turns out that the answer is very boring and simple - its more of a process than an artefact.

Version control for everything


Most of work falls neatly into two categories:
  • Things that I know I will want to archive.
  • Things that will definitely be deleted after trying something out.
This clean taxonomic split is the stuff that programmers live for. It suggests a scratch directory that never needs to be backed up or transported off machine, and a set of version control repositories for everything that I would displeased to lose. That is where people balk at the idea of how much hassle it would be to keep everything in sync, and the server-side issue of maintaining a bare-bones repository against each of those projects. I wrote a script. Well, truth be told I wrote quite a few over the years find the right way to work, but in the end they all collapsed into a single script.

Much like a sordid-ploy by Sauron there is a single repository that rules them all, gitenv:
  • Every configuration file in my home directory is linked into this folder. Using git on the dense representation (the folder) allows version control over a sparse hierarchy of files (the overlay on my home directory). This is a nice trick. In some places these are symlinks to prevent chaos, and in other places we go straight for the jugular with hard-links (e.g. the .ssh directory is hard-linked to a directory inside the repository so that everything within can be versioned and shared across multiple machines).
  • Shell history for each machine is stored so history across all machines is searchable.
  • Custom bin directory, for all the magic.
  • A secrets file. Yup, these are a terrible idea, but then again so is losing access to a password. Which is why mine is encrypted using gpg and the plaintext contents never touch disk. In theory. Although my security needs are not particularly challenging and everytime I screw the passphrase I end up splashing the contents across the file-system. Yay!
This repository has a very strange property for source-control: the files within act as if they are in continuous change. Normally the state of a repository's contents acts discretely: things do not change in-between git commands. But linking the known_hosts file, and the shell history into this repository means that the contents are always dirty. Because it is alway dirty it always needs to merge against the remote - so each machine has a slightly different history for this repository. It is challenging to work with.

Everything else is simple in comparison, there is a single independent repository for each:
  • Project - independent tree for a piece of source, with docs and test data.
  • Course - all materials and archives of student submissions.
  • Document collections - articles, books etc
  • Web server - each active server has its contents in source control - these repositories have post-commit hooks to deploy the master branch live on the machine.
This means that each machine that I use has a collection of a few dozen repositories. This would be a serious pain to maintain by hand. Instead one script takes care of the difficult between the continuous environment repository and the server (and its mirrors), and then works out how close to consensus the rest of the repositories are. Where the actions to establish consensus are simple (i.e. the repository is purely ahead or behind) the script brings it into line automatically. This makes things sane.

Transporting state between machines is the same as using backup/restore. This is absolutely essential - it means that the backup and the restore mechanism are in use every day. When you positively, absolutely need to rely on a system, make sure that you eat your own dog-food. Mmmm chewy. The weird thing about my backup and restore system is that any two machines rarely have exactly the same contents - but they all chase the same consensus state, and progress towards synchronisation is monotonic. This is actually good enough to make sure that nothing is every lost.

Gitolite3


Git servers are nice and easy. Manually keeping track of repository details is an absolute pain in the arse. Thankfully gitosis, and now gitolite have made that process incredibly simple. Despite that simplicity I have not yet worked out how to integrate this into the preseeded process, so for now this is a dangling live piece of state on the server. [Note to self: seeing it like this it is quite obvious running this with the sudo flipped around, root or root->git should make it easy]

Each git server needs a user dedicated to gitolite3:

sudo adduser --system --shell /bin/bash --gecos 'Git version control' --group --disabled-password --home /home/git git
# Copy public key into git.pub
sudo su git
cp ../main/rsa_git.pub git.pub
gitolite setup -pk git.pub

The docs make it look much more complex, but on debian if you have installed the gitolite3 package this is all there is to it. Don't reuse a key - it may seem easier in the short-term but it actually makes things much more complex in the long term. Dedicate a key to git, and use an agent properly!

All the repositories inside gitolite are bare - this is the point of a server, guaranteed push. This has been running quite happily against a single server or years, as I do the upgrade I'm setting up a second mirror for the git server. I haven't tried automated the sync between mirrors yet - there is a bit of thought to be had first about whether or not pushes are guaranteed in a system with mirrors. I'm sure it will be fun to find out :)

I am always forgetting the admin repo URLs as there are slight differences in git-urls under the different protocol prefixes, but here it is as simple as:

git clone git@face.mechani.se:gitolite-admin face-gitolite-admin

Inside the admin repo the key is already in place so the config layout becomes completely uniform, conf/gitolite.conf looks like:

repo randomforests
    RW+ = git

repo paperBase
    RW+ = git

So now the allsync.py script needs some tweaks to handle multiple remotes as mirrors...



Thursday, 15 October 2015

System Deployment (5 of x)

Fixing the dodgy network settings. The debian-installer finds a hostname from DHCP of another site running on my providers service. Manually editing /etc/hostname and then rebooting to solve this.

Seems like the static configuration worked out well. Seeing this output gives me a nice warm feeling:

main@face:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 *:ssh *:* LISTEN
tcp6 0 0 [::]:ssh [::]:* LISTEN

I like to know exactly what is running on a server, and how many access points it adds for an attacker. In this state the box should be safe until another openssh exploit is discovered, and they do not happen every year. Good enough.

Next I need to bring up bind9 for the domain (ha, there goes any security hopes :). Every step now will be done twice:

  • Live steps on the server (configuration drift)
  • Replicating the changes in the file-sytem overlay inside the remastering .iso

At the end I'll blast away the live system to check the installer can rebuild it properly. This won't happen very often, because the KVM simulated access to the raw disk is as slow as it is possible to be. The automated install (minus the manual modprobe for the driver, which is still bugging the crap out of me) takes about 40 minutes. Not so nice.

Bind9

The best way to configure bind9 is using an old configuration that works :) Otherwise proceed very carefully, and read as much as you can about security errors in bind9 configurations. For recovery of old files I took a not so elegant approach to backing up the live server: dd the disk and copy it to a workstation. Mounting this loopback (before killing the old server) gives me a simple way to recover things.

Disaster: apparently I did not dd the disk offsite. Instead I took the shortcut of tar'ing the entire file-system as it was running. This obviously made some sort of sense at the time. For inexplicable reasons it means that the /root directory is missing, along with the .bash_history inside it that was a record of the steps in building the server directly. Lesson: next time be ye not a dick and dd the disk offsite.

Not to worry, I deliberately cultivate an anally-rententive OCD-level journal of work. The steps to build the server will be in there... Disaster the 2nd: as I was running on 16-hour workdays with little sleep when the server was first installed there is a blank spot of a month in my notes. Oh bugger. Lesson: kind of unclear...

Well it can't be that hard to lookup again, I've already written the zone files once... Oh look, there's the zone files and everything is sealed inside a chroot jail so that I can run bind9 as a non-privleged user. That's a really cool idea, how the hell did I do that then?

Hmm, so it all changed with jessie then? I don't think that systemd and I will become friends after I prize its charred tentacles off of my drive.

The late_command in the preseed.cfg is getting a bit busy so we'll add a new script to the iso called chroot_bind9.bash:

#!/bin/bash
mkdir -p /var/bind9/chroot/{etc,dev,var/cache/bind,var/run/named}
mknod /var/bind9/chroot/dev/null c 1 3
mknod /var/bind9/chroot/dev/random c 1 8
chmod 660 /var/bind9/chroot/dev/{null,random}
mv /etc/bind /var/bind9/chroot/etc   # Installed default from package
ln -s /var/bind9/chroot/etc/bind /etc/bind
cp /etc/localtime /var/bind9/chroot/etc/
chown -R bind:bind /etc/bind/*
chmod 775 /var/bind9/chroot/var/{cache/bind,run/named}
chmod 775 /var/bind9/chroot/var/{cache/bind,run/named}
chgrp bind /var/bind9/chroot/var/{cache/bind,run/named}
cp /cdrom/initd_bind9 /etc/init.d/bind9
# Next line is deliberately fragile - should be checked / rewritten if there is a major update to bind9
# Also - this is gnu sed style argument, not bsd.
sed -i 's:PIDFILE=/var/run/named/named.pid:PIDFILE=/var/bind9/chroot/var/run/named/named.pid:' /etc/init.d/bind9
echo "\$AddUnixListenSocket /var/bind9/chroot/dev/log" > /etc/rsyslog.d/bind-chroot.conf
# Skip service restarts as we will reboot soon

The makefile needs to be updated to get the new info into the .iso:

remaster.iso: copy
cp preseed.cfg copy/
cp isolinux.cfg copy/isolinux/
cp /home/amoss/.ssh/rsa_face.pub copy/
cp chroot_bind9.bash copy/
chmod +x copy/chroot_bind9.bash
cp mechani.db copy/
tar czf copy/overlay.tgz -C config etc home/main
genisoimage -b isolinux/isolinux.bin -c isolinux/boot.cat -o remaster.iso -J -R -no-emul-boot -boot-load-size 4 -boot-info-table copy/

And lastly the preseed is updated to execute the new script:

d-i preseed/late_command string \
tar xzf /cdrom/overlay.tgz -C /target ; \
in-target chown -R main:main /home/main ; \
in-target chown root:root /etc/hosts ; \
in-target chown root:root /etc/ssh/sshd_config ; \
chmod 700 /target/home/main/.ssh ; \
in-target chown main:main /home/main/.ssh/authorized_keys ; \
chmod 600 /target/home/main/.ssh/authorized_keys ; \
/cdrom/chroot_bind9.bash

Again, for emphasis: none of this has been tested yet - but what fun is life if we do not live dangerously, eh? After a robust exchange of views with my registrar about the quality of service they have rebuild the entente cordial by manually flicking some switches somewhere, and lo and behold:

dig face.mechani.se +trace
; <<>> DiG 9.8.3-P1 <<>> face.mechani.se +trace
;; global options: +cmd
. 14196 IN NS j.root-servers.net.
. 14196 IN NS d.root-servers.net.
. 14196 IN NS c.root-servers.net.
. 14196 IN NS i.root-servers.net.
. 14196 IN NS h.root-servers.net.
. 14196 IN NS l.root-servers.net.
. 14196 IN NS k.root-servers.net.
. 14196 IN NS f.root-servers.net.
. 14196 IN NS g.root-servers.net.
. 14196 IN NS e.root-servers.net.
. 14196 IN NS m.root-servers.net.
. 14196 IN NS b.root-servers.net.
. 14196 IN NS a.root-servers.net.
;; Received 228 bytes from 8.8.8.8#53(8.8.8.8) in 89 ms

se. 172800 IN NS a.ns.se.
se. 172800 IN NS b.ns.se.
se. 172800 IN NS c.ns.se.
se. 172800 IN NS d.ns.se.
se. 172800 IN NS e.ns.se.
se. 172800 IN NS f.ns.se.
se. 172800 IN NS g.ns.se.
se. 172800 IN NS i.ns.se.
se. 172800 IN NS j.ns.se.
;; Received 492 bytes from 192.203.230.10#53(192.203.230.10) in 123 ms

mechani.se. 86400 IN NS face.mechani.se.
;; Received 63 bytes from 130.239.5.114#53(130.239.5.114) in 175 ms

face.mechani.se. 1800 IN A 46.246.89.132
mechani.se. 1800 IN NS face.mechani.se.
;; Received 63 bytes from 46.246.89.132#53(46.246.89.132) in 49 ms


Is good, no? The domain has only been offline for a month due to a "routine upgrade" :) Next up I will restore the gitolite configuration and mirror my lonely git server...

Wednesday, 14 October 2015

System Deployment (4 of x)

The server is back up and running, having managed to do a hdd install inside the KVM environment for the first time. Yay!. There is not anything running on it yet. There is a rough todo list before the site is back up.

  • Find out why the static network config is broken, and disable dhcp, verify that sshd is the open port open on the machine.
  • Stick the bind9 config back on the machine and bring the DNS back to life.
  • Install gitolite and setup barebones mirrors of the git server on gimli.
  • Put the web-user back in, and setup the hooks to run the server, redeploy from git.
But first a brief detour through the boot sequence (need to convert this to lecture slides)...

Linux Boot Sequence


This has changed over the years, and will probably change again. These details are against the current stable branch of jessie (late 2015).

Step 0: BIOS

The BIOS is effectively the firmware for the specific PC that it is running on. It exists in order to bootstrap the machine: the kernel to load is somewhere in the machine storage. The BIOS should contain enough of the storage-specific drivers to access the storage and load in the next stage for boot. The first BIOS was introduced on the original IBM PC in 1983. It has not changed much since then. In 1983 the range of peripheral devices available was limited; the BIOS was meant to function as a hardware abstraction layer in an era when that meant accessing the console, keyboard and disk driver. This layer is ignored by modern kernels.

The BIOS on every machine (the industry is currently in a transition to UEFI to replace this completely) follows a simple specification to find and execute the Master Boot Record (MBR).
For each disk in the user-specified boot order (e.g. HDD, cdrom, usb etc) :

  • Load sector 0 (512 bytes) into memory at 7C00h.
  • Verify the two-byte signature: 7DFEh=55h, 7DFFh=AAh.
  • Jump to the code in 16-bit real mode with the following register setup:
    • Code Segment = 0000h
    • Instruction Pointer = 7C00h

Step 1: MBR (stage 1)

The MBR used in DOS / Windows has changed over the years to include support for Logical Block Addresses / Plug'n'Play and other extension. The BIOS-MBR interface must remain constant to guarantee that the boot sequence will work without knowing the specific combination of BIOS and O/S on the machine.

It is easy to access the MBR from a live linux system as it is at a fixed location on the disk, for example if we are on the first scsi disk in the system:

dd if=/dev/sda of=mbr_copy bs=512 count=1
dd if=mbr_copy of=/dev/sda bs=512 count=1

Booting Linux almost always means booting the GRUB MBR. If we want to see how that works then we can just disassemble the code in the mbr:

objdump -D -b binary -mi386 -Maddr16,data16 /usr/mdec/mbr

mbr_copy:     file format binary
Disassembly of section .data:
00000000 <.data>:
   0:   eb 63                   jmp    0x65
...
65: fa cli
66: 90 nop
67: 90 nop
68: f6 c2 80 test $0x80,%dl
6b: 74 05 je 0x72
6d: f6 c2 70 test $0x70,%dl
70: 74 02 je 0x74
72: b2 80 mov $0x80,%dl
74: ea 79 7c 00 00 ljmp $0x0,$0x7c79

Here we can see a check on a parameter passed in by the BIOS (%dl indicates which disk was booted), a quick check to see if it is a fixed or removable disk and then a direct call into a BIOS routine by its real-memory address. (the API for the BIOS is a list of addresses and register setups).

All of the stage 1 functionality has to fit into 510 bytes of 16-bit real-mode code. This is not a lot. To make life more interesting there is a data-structure embedded inside this code in a standard format, this is the partition table for the drive that gives us four primary partitions. To be read and written by standard tools this table must be at specific locations inside the sector. When we access this table, e.g. something like:

sudo fdisk /dev/sda

The fdisk tools needs to access the table without executing any code in the MBR to do so. This reduces the space for executable code to 446 bytes (4x 16-byte table entries). This is enough to find and locate a larger boot-stage on the disk, using the raw BIOS routines and execute this second stage. An incredibly detailed (and thus useful) walkthrough of the booting scheme can be read on the grub mailing list.

Step 3 : MBR (stage 2) 

The second stage boot-loader is much larger. In the windows world this is called the VBR and is loaded directly from the beginning of the partition to boot. GRUB (in the classic MBR scheme) loads its second stage from a gap in the disk - the rest of the first cylinder on the disk is used as padding to align the first partition on a cylinder boundary. This padding consists of 63 sectors, or 31.5K of space. This is enough space to identify a boot partition with a known file system and load files (not raw sectors) from there. Typically the second stage of GRUB will display a splash screen with a menu and allow the user to select what to boot next. For a windows install this means chainloading the initial sectors from the installed partition. For a linux install it means loading a kernel file into memory, along with an initial ramdisk holding a bootstrap filesystem and passing control to the kernel.

The splashscreen menu is control by a file-format for GRUB2 that looks like this:

menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz
}

The root is the partition that contains the files for grub. Numbering is somewhat chaotic, harddisks are numbered from zero: hd0, hd1... while partitions are numbered from 1. So (hd0,1) is the first partition on the first disk. This corresponds to the partitioning scheme in the previous post (1GB partition to hold the kernel, initrd and .iso imagers for installers). The second stage will mount the ext2 file-system on that partition, then the filenames ('/vmlinuz') are absolute paths in that file-system. The kernel accepts some arguments from the boot-loader, panic=20 is very useful during development: when the kernel panics and refuses to boot it reboots the system back to grub after 20 seconds. When you don't have a physical keyboard to stab ctrl-alt-del on, this one saves a lot of coffee mugs from acts of extreme violence.

On some distros (gentoo springs to mind, although they may have updated it since I used it last) these config files are edited directly on the boot drive, normally under /boot/grub. Debian has some extra support for building the configuration. Each menuitem becomes an executable script under /etc/grub.d, so for example the above becomes /etc/grub.d/11_remaster by wrapping the contents in a here-document:

#!/bin/sh -e
cat << EOF
menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz
}
EOF

Step 4: Kernel / initrd

GRUB mounted the boot drive in order to load the kernel files, but the kernel cannot access this mount: we want the kernel to be independent of the boot-loader, and looking through memory for these data-structures would represent a huge dependency. So the kernel will need to access the hardware and load the file-system itself. Standard boot-strip problem: where are the drivers to do this? They of course on the disk. Bugger.

In Linux the solution is quite elegant. Start with a / file-system already mounted, including all the necessary drivers. Use it to mount the real / file system on the disk, and then move the original out of the way. This is much easier to achieve, and doesn't involve its own bootstrap problem because we can just serialise the contents of a ramdisk directly onto the harddrive. The kernel can be booted with this ramdrive read back into memory. The tool for doing this is cpio, the contents get gzipped and GRUB knows how to load and unzip the initrd.gz directly into memory for the kernel to use during boot.

An important principe here is reuse of existing tools: the initrd is a standard filesystem (before the serialisation and zipping) so we can edit any part of the boot-image by mounting it and using standard tools on it. This is what the cycleramd.bash script in the previous post did to insert the KVM para-virtualisation drivers.

Step 5: Init / inittab

Once the kernel has finished initialising itself, and gained access to the root file system it can proceed with booting the rest of the system. The /sbin/init executable is responsible for bring the system up to the required level. In an ideal world this is a simple isolate piece of code controlled by a single plaintext configuration file called /etc/inittab. Trying to pull this from my desktop system fell over the slimey entrails of systemd, which I will describe another time. The inittab from the hd-media initrd is a simpler example to work from:

# /etc/inittab
# busybox init configuration for debian-installer

# main rc script
::sysinit:/sbin/reopen-console /sbin/debian-installer-startup

# main setup program
::respawn:/sbin/reopen-console /sbin/debian-installer

# convenience shells
tty2::askfirst:-/bin/sh
tty3::askfirst:-/bin/sh

# logging
tty4::respawn:/usr/bin/tail -f /var/log/syslog

# Stuff to do before rebooting
::ctrlaltdel:/sbin/shutdown > /dev/null 2>&1

# re-exec init on receipt of SIGHUP/SIGUSR1
::restart:/sbin/init

The configuration file defines what programs we should connect to the ttys that we can access through alt-f1, alt-f2 etc. On a desktop we would expect to see the X server connected to a login manager. Different forms of shutdown are associated with commands. On a desktop we would see different runlevels associated with the programs executed to get there. In this ramdisk we see how to turn the linux system into a one-program kiosk system, in this case to run the installer. Neat.

Step 6: Rest of the system

This depends entirely on the programs launched from init. The single-application kiosk image boots a very different system from a typical server (ha, what is that?) or a typical desktop environment. Seeing how the debian-installer manages its boot gives me some very evil ideas for single-process servers in a locked down environment that I may explore later. For lecture slides I should probably also take some time to describe on another day:

  • The modern GPT scheme and UEFI
  • Booting from cdrom and usb.
  • Way to launch server processes
  • Access to the debian-installer source on anonscm.
  • Alterative preseeding approach with the config on the ramdisk.

Sunday, 11 October 2015

System Deployment (3 of x)

The desktop installer seems to work (its in day-to-day use now). Currently it builds an e17 desktop on-top of Debian, with enough support to rebuild everything in my repositories. It is stable enough to host the build system for remastering a server: the target environment.

Server overview.

The server is a VPS (virtual private server) running in a data-center belonging to the provided. The virtual machine runs inside a KVM hosting environment. Physical access is simulated through VNC  - importantly this connects to the host rather than the guest so it remains available during reboot and reinstall. Unfortunately the keymap is a bit screwed up and there is no way to change it. Most(!?!) important punctuation can be found by setting the local keymap to US, but it is a minimal environment.

The simulated CD can only be changed by someone with admin privileges on the host, so it requires a support ticket and 24-48hr turn-around time. For this reason it is left set to a virgin debian installer image.

Bootstrap of the installation environment.

One issue that crops up straight away is that although the netinst image can find the virtual driver for installation - it cannot find it directly at boot. This is not a problem for a simple clean install. But the re-installer will use the hd-media kernel/initrd to boot the cdrom image. And this system cannot find the virtual drive.

KVM uses paravirtualisation, so the kernel will need the virtio drivers (in particular virtio_blk) and
these are not in the hd-media images by default. The initial environment will look like this:

Partition 1: 1000MB, ext2, bootable
   /vmlinuz - default kernel image from hd-media
   /initrd.gz - modified hd-media initial ramdisk with extra modules for virtio
   /remaster.iso - preseeded debian installer for the target installer
Partition 2: 1000MB, swap
Partition 3: Remaining space, ext4, mounted as /
Grub installed on the MBR
  - Standard menuitem to boot /dev/vda3 into the target system.
  - Extra menuitem to boot kernel/initrd from /dev/vda1

The standard installer is used through VNC to partition the disk and get a working system onto /dev/vda3. To rebuild the initrd we use the following script in the desktop environment. This saves a huge amount of work: the environment created by the jessie installer is the environment that the jessie installer was built inside.

#!/bin/bash
rm -rf tmpramd
mkdir tmpramd
(cd tmpramd && gunzip -c ../hdmedia/initrd.gz | cpio -i)
cp /lib/modules/3.16.0-4-amd64/kernel/drivers/block/virtio_blk.ko tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/block/
mkdir tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/virtio
cp /lib/modules/3.16.0-4-amd64/kernel/drivers/virtio/*ko tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/virtio/
cp /lib/modules/3.16.0-4-amd64/kernel/drivers/block/virtio_blk.ko tmpramd/lib/modules/3.16.0-4-amd64/kernel/drivers/block/
#sed -e 's:start-udev$:&\n/sbin/modprobe virtio_blk:' tmpramd/init >newinit
#mv newinit tmpramd/init
#chmod +x tmpramd/init
echo >>tmpramd/etc/rcS.d/virtio /sbin/modprobe virtio_blk
chmod +x tempramd/etc/rcS.d/virtio
#echo virtio_blk >>tmpramd/etc/modules
(cd tmpramd && find . | cpio -H newc -o | gzip >../cycled.gz)
scp cycled.gz main@face:initrd.gz

As the comments in the script show there are several ways to do this that do not seem to work. I don't why. If you have any idea please leave a comment. Trying to use /etc/modules to force loading the driver does nothing - perhaps this was 2.4 only thing that has been long superceded in the kernel? Inserting the modprobe into the init for the ramdisk just causes a kernel panic when it fails. Inserting the modprobe into the rcS.d means it is called later in init, when control is passed to the debian-installer-init script. This seems to work. [edit: no it doesn't. Some kind of stateful problem in testing, it looked like it worked but this is currently broken. Will update in a later post]. Inside the clean debian install we create /etc/grub.d/11_remaster and execute update-grub.

#!/bin/sh -e
cat << EOF
menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz
}
EOF

This puts us in the position where we can execute the preseeded installer directly from the harddrive to build the target system. There is no way to avoid the partitioning step inside the preseeded installer, so it is vital that the partitions made in the original clean install are identical to those made in the preseeded installer. Overwriting the partition table with the same data does not lose any data on the disk.

The preseeded installer.

As with the desktop install the preseed file is wrapped inside the .iso for the installer. Looks very similar, same partitioning scheme: 1GB for installer images, 1GB for swap, rest for a single / file-system. No dynamic network, hardcoded to the static setup of the target server. The overlay that gets untar'd at the end overwrites the sshd config. No passwords, no root access. Only strong-passphrase keys, the public halves being in the .iso and converted directly into an .ssh/authorized_keys file. There is a single random string with the password for the main user, but this can only be used over VNC. Basic package load for the server.

d-i debian-installer/language string en
d-i debian-installer/country string SE
d-i debian-installer/locale string en_SE.UTF-8
d-i keyboard-configuration/xkb-keymap select sweden

d-i netcfg/choose_interface select eth0

# To pick a particular interface instead:
#d-i netcfg/choose_interface select eth1

# To set a different link detection timeout (default is 3 seconds).
# Values are interpreted as seconds.
#d-i netcfg/link_wait_timeout string 10

# If you have a slow dhcp server and the installer times out waiting for
# it, this might be useful.
#d-i netcfg/dhcp_timeout string 60
#d-i netcfg/dhcpv6_timeout string 60

# If you prefer to configure the network manually, uncomment this line and
# the static network configuration below.
d-i netcfg/disable_autoconfig boolean true

# If you want the preconfiguration file to work on systems both with and
# without a dhcp server, uncomment these lines and the static network
# configuration below.
#d-i netcfg/dhcp_failed note
#d-i netcfg/dhcp_options select Configure network manually

# Static network configuration.
d-i netcfg/get_ipaddress string 46.246.89.132
d-i netcfg/get_netmask string 255.255.255.128
d-i netcfg/get_gateway string 46.246.89.129
d-i netcfg/get_nameservers string 8.8.8.8
d-i netcfg/confirm_static boolean true

# IPv6 example
#d-i netcfg/get_ipaddress string fc00::2
#d-i netcfg/get_netmask string ffff:ffff:ffff:ffff::
#d-i netcfg/get_gateway string fc00::1
#d-i netcfg/get_nameservers string fc00::1
#d-i netcfg/confirm_static boolean true

d-i netcfg/get_hostname string face
d-i netcfg/get_domain string mechani.se
d-i netcfg/hostname string face
d-i netcfg/wireless_wep string # Disable that annoying WEP key dialog.

### Mirror settings
d-i mirror/protocol string ftp
d-i mirror/country string se
d-i mirror/ftp/hostname string ftp.se.debian.org
d-i mirror/ftp/directory string /debian

### Account setup
# Skip creation of a root account (normal user account will be able to
# use sudo).
d-i passwd/root-login boolean false
# Alternatively, to skip creation of a normal user account.
#d-i passwd/make-user boolean false

# Root password, either in clear text
#d-i passwd/root-password password abc
#d-i passwd/root-password-again password abc
# or encrypted using an MD5 hash.
#d-i passwd/root-password-crypted password [MD5 hash]

# To create a normal user account.
d-i passwd/user-fullname string The main user
d-i passwd/username string main
d-i passwd/user-password password xxxxxxxx
d-i passwd/user-password-again password xxxxxxxx

d-i clock-setup/utc boolean true
d-i time/zone string Europe/Stockholm
d-i clock-setup/ntp boolean true

d-i partman-auto/disk string /dev/vda
d-i partman-auto/method string regular
# Manual use of the installer on face reports 30.1GB
d-i partman-auto/expert_recipe string \
remasterPart :: \
1000 1000 1000 ext2 \
$primary{ } $bootable{ } \
method{ keep } \
. \
1000 1000 1000 linux-swap \
$primary{ } \
method{ swap } format{ } \
. \
15000 15000 150000 ext4 \
$primary{ } $bootable{ } \
method{ format } format{ } \
use_filesystem{ } filesystem{ ext4 } \
mountpoint{ / } \
.
#d-i partman/choose_recipe select atomic
d-i partman-auto/choose_recipe select remasterPart
d-i partman/confirm_write_new_label boolean true
d-i partman/choose_partition select finish
d-i partman/confirm boolean true
d-i partman/confirm_nooverwrite boolean true
d-i partman-basicmethods/method_only boolean false
d-i partman-md/confirm boolean true

# Package setup
tasksel tasksel/first multiselect minimal
d-i pkgsel/include string openssh-server git python python-dateutil sudo bind9 bind9-host gitolite3 binutils dnsutils authbind curl
popularity-contest popularity-contest/participate boolean false

# MBR
d-i grub-installer/only_debian boolean true
d-i grub-installer/with_other_os boolean true
d-i grub-installer/bootdev string /dev/vda
d-i finish-install/reboot_in_progress note

d-i preseed/late_command string \
tar xzf /cdrom/overlay.tgz -C /target ; \
in-target chown -R main:main /home/main ; \
in-target chown root:root /etc/hosts ; \
in-target chown root:root /etc/ssh/sshd_config ; \
chmod 700 /target/home/main/.ssh ; \
in-target chown main:main /home/main/.ssh/authorized_keys ; \
chmod 600 /target/home/main/.ssh/authorized_keys