Wednesday, 14 October 2015

System Deployment (4 of x)

The server is back up and running, having managed to do a hdd install inside the KVM environment for the first time. Yay!. There is not anything running on it yet. There is a rough todo list before the site is back up.

  • Find out why the static network config is broken, and disable dhcp, verify that sshd is the open port open on the machine.
  • Stick the bind9 config back on the machine and bring the DNS back to life.
  • Install gitolite and setup barebones mirrors of the git server on gimli.
  • Put the web-user back in, and setup the hooks to run the server, redeploy from git.
But first a brief detour through the boot sequence (need to convert this to lecture slides)...

Linux Boot Sequence

This has changed over the years, and will probably change again. These details are against the current stable branch of jessie (late 2015).

Step 0: BIOS

The BIOS is effectively the firmware for the specific PC that it is running on. It exists in order to bootstrap the machine: the kernel to load is somewhere in the machine storage. The BIOS should contain enough of the storage-specific drivers to access the storage and load in the next stage for boot. The first BIOS was introduced on the original IBM PC in 1983. It has not changed much since then. In 1983 the range of peripheral devices available was limited; the BIOS was meant to function as a hardware abstraction layer in an era when that meant accessing the console, keyboard and disk driver. This layer is ignored by modern kernels.

The BIOS on every machine (the industry is currently in a transition to UEFI to replace this completely) follows a simple specification to find and execute the Master Boot Record (MBR).
For each disk in the user-specified boot order (e.g. HDD, cdrom, usb etc) :

  • Load sector 0 (512 bytes) into memory at 7C00h.
  • Verify the two-byte signature: 7DFEh=55h, 7DFFh=AAh.
  • Jump to the code in 16-bit real mode with the following register setup:
    • Code Segment = 0000h
    • Instruction Pointer = 7C00h

Step 1: MBR (stage 1)

The MBR used in DOS / Windows has changed over the years to include support for Logical Block Addresses / Plug'n'Play and other extension. The BIOS-MBR interface must remain constant to guarantee that the boot sequence will work without knowing the specific combination of BIOS and O/S on the machine.

It is easy to access the MBR from a live linux system as it is at a fixed location on the disk, for example if we are on the first scsi disk in the system:

dd if=/dev/sda of=mbr_copy bs=512 count=1
dd if=mbr_copy of=/dev/sda bs=512 count=1

Booting Linux almost always means booting the GRUB MBR. If we want to see how that works then we can just disassemble the code in the mbr:

objdump -D -b binary -mi386 -Maddr16,data16 /usr/mdec/mbr

mbr_copy:     file format binary
Disassembly of section .data:
00000000 <.data>:
   0:   eb 63                   jmp    0x65
65: fa cli
66: 90 nop
67: 90 nop
68: f6 c2 80 test $0x80,%dl
6b: 74 05 je 0x72
6d: f6 c2 70 test $0x70,%dl
70: 74 02 je 0x74
72: b2 80 mov $0x80,%dl
74: ea 79 7c 00 00 ljmp $0x0,$0x7c79

Here we can see a check on a parameter passed in by the BIOS (%dl indicates which disk was booted), a quick check to see if it is a fixed or removable disk and then a direct call into a BIOS routine by its real-memory address. (the API for the BIOS is a list of addresses and register setups).

All of the stage 1 functionality has to fit into 510 bytes of 16-bit real-mode code. This is not a lot. To make life more interesting there is a data-structure embedded inside this code in a standard format, this is the partition table for the drive that gives us four primary partitions. To be read and written by standard tools this table must be at specific locations inside the sector. When we access this table, e.g. something like:

sudo fdisk /dev/sda

The fdisk tools needs to access the table without executing any code in the MBR to do so. This reduces the space for executable code to 446 bytes (4x 16-byte table entries). This is enough to find and locate a larger boot-stage on the disk, using the raw BIOS routines and execute this second stage. An incredibly detailed (and thus useful) walkthrough of the booting scheme can be read on the grub mailing list.

Step 3 : MBR (stage 2) 

The second stage boot-loader is much larger. In the windows world this is called the VBR and is loaded directly from the beginning of the partition to boot. GRUB (in the classic MBR scheme) loads its second stage from a gap in the disk - the rest of the first cylinder on the disk is used as padding to align the first partition on a cylinder boundary. This padding consists of 63 sectors, or 31.5K of space. This is enough space to identify a boot partition with a known file system and load files (not raw sectors) from there. Typically the second stage of GRUB will display a splash screen with a menu and allow the user to select what to boot next. For a windows install this means chainloading the initial sectors from the installed partition. For a linux install it means loading a kernel file into memory, along with an initial ramdisk holding a bootstrap filesystem and passing control to the kernel.

The splashscreen menu is control by a file-format for GRUB2 that looks like this:

menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz

The root is the partition that contains the files for grub. Numbering is somewhat chaotic, harddisks are numbered from zero: hd0, hd1... while partitions are numbered from 1. So (hd0,1) is the first partition on the first disk. This corresponds to the partitioning scheme in the previous post (1GB partition to hold the kernel, initrd and .iso imagers for installers). The second stage will mount the ext2 file-system on that partition, then the filenames ('/vmlinuz') are absolute paths in that file-system. The kernel accepts some arguments from the boot-loader, panic=20 is very useful during development: when the kernel panics and refuses to boot it reboots the system back to grub after 20 seconds. When you don't have a physical keyboard to stab ctrl-alt-del on, this one saves a lot of coffee mugs from acts of extreme violence.

On some distros (gentoo springs to mind, although they may have updated it since I used it last) these config files are edited directly on the boot drive, normally under /boot/grub. Debian has some extra support for building the configuration. Each menuitem becomes an executable script under /etc/grub.d, so for example the above becomes /etc/grub.d/11_remaster by wrapping the contents in a here-document:

#!/bin/sh -e
cat << EOF
menuentry "Remaster ISO" {
set root='(hd0,1)'
#loopback loop /remaster.iso
linux /vmlinuz initrd=/initrd.gz root=/dev/vda1 vga=788 auto=true panic=20 priority=critical preseed/file=/cdrom/preseed.cfg ---
initrd /initrd.gz

Step 4: Kernel / initrd

GRUB mounted the boot drive in order to load the kernel files, but the kernel cannot access this mount: we want the kernel to be independent of the boot-loader, and looking through memory for these data-structures would represent a huge dependency. So the kernel will need to access the hardware and load the file-system itself. Standard boot-strip problem: where are the drivers to do this? They of course on the disk. Bugger.

In Linux the solution is quite elegant. Start with a / file-system already mounted, including all the necessary drivers. Use it to mount the real / file system on the disk, and then move the original out of the way. This is much easier to achieve, and doesn't involve its own bootstrap problem because we can just serialise the contents of a ramdisk directly onto the harddrive. The kernel can be booted with this ramdrive read back into memory. The tool for doing this is cpio, the contents get gzipped and GRUB knows how to load and unzip the initrd.gz directly into memory for the kernel to use during boot.

An important principe here is reuse of existing tools: the initrd is a standard filesystem (before the serialisation and zipping) so we can edit any part of the boot-image by mounting it and using standard tools on it. This is what the cycleramd.bash script in the previous post did to insert the KVM para-virtualisation drivers.

Step 5: Init / inittab

Once the kernel has finished initialising itself, and gained access to the root file system it can proceed with booting the rest of the system. The /sbin/init executable is responsible for bring the system up to the required level. In an ideal world this is a simple isolate piece of code controlled by a single plaintext configuration file called /etc/inittab. Trying to pull this from my desktop system fell over the slimey entrails of systemd, which I will describe another time. The inittab from the hd-media initrd is a simpler example to work from:

# /etc/inittab
# busybox init configuration for debian-installer

# main rc script
::sysinit:/sbin/reopen-console /sbin/debian-installer-startup

# main setup program
::respawn:/sbin/reopen-console /sbin/debian-installer

# convenience shells

# logging
tty4::respawn:/usr/bin/tail -f /var/log/syslog

# Stuff to do before rebooting
::ctrlaltdel:/sbin/shutdown > /dev/null 2>&1

# re-exec init on receipt of SIGHUP/SIGUSR1

The configuration file defines what programs we should connect to the ttys that we can access through alt-f1, alt-f2 etc. On a desktop we would expect to see the X server connected to a login manager. Different forms of shutdown are associated with commands. On a desktop we would see different runlevels associated with the programs executed to get there. In this ramdisk we see how to turn the linux system into a one-program kiosk system, in this case to run the installer. Neat.

Step 6: Rest of the system

This depends entirely on the programs launched from init. The single-application kiosk image boots a very different system from a typical server (ha, what is that?) or a typical desktop environment. Seeing how the debian-installer manages its boot gives me some very evil ideas for single-process servers in a locked down environment that I may explore later. For lecture slides I should probably also take some time to describe on another day:

  • The modern GPT scheme and UEFI
  • Booting from cdrom and usb.
  • Way to launch server processes
  • Access to the debian-installer source on anonscm.
  • Alterative preseeding approach with the config on the ramdisk.

No comments:

Post a Comment