Saturday, December 18, 2021

Linux from Scratch and Creating initrd for LVM


Hi there. Even though, I wanted to write about electronics for a while, it did not happen again, unfortunately. Today's topic is configuration that took me two weeks. More precisely creating an initrd. So, why do I need such a thing?

Let me start from LVM first. Although I had bad experiences with it at first, I've gotten used to it so much since 2014, that it's now unimaginable to have a linux installation without it. I do not have partitions spanning multiple disks, but I think LVM's most useful feature is its flexibility in moving and resizing partitions.

Linux from Scratch (LFS) project is on the other side of this subject. It is a project, which allows users to compile necessary tools and linux kernel from scratch i.e. from their source code, in order to build a working linux environment. It starts with a working compiler of course. After compiling gcc as a cross compiler, basic tools are compiled with it and a minimal chroot (change root) environment is created. Linux kernel is compiled at the last step and made bootable with grub. I got to know about this project from Viktor Engelmann's videos. I will follow a different path here from his. We both are actually following slightly different paths from official LFS documentation (or "the book" in LFS terminology) but they are both based on the book in the end. A few steps that I will follow, will be from the (B)eyond LFS book.

There are actually two versions of the book, systemd and initd versions but they are basically same up until the chapter 8: Packages. This post is based on the eleventh version of the book for both systemd and initd. And there are two different formats of the book for each version under "Download" and "Read Online" in website. Although the .pdf file in Download page looks more compact, copying and pasting command is definitely easier from the .html version. Btw, most of the time in this project is spent compiling packages (takes around 16h on my machine to compile it entirely, excluding tests). Therefore, I will only mention the steps which I didn't follow from the book.

Note that the hyperlinks in this post are linked to the latest version of the book as of the date of writing. When a more recent version than v11 is released, some links may lead to different pages.


System Requirements

The book says that a 10 GB partition would be enough to compile all packages, but a 30 GB partition is recommended for growth. The disk usage had exceeded 20 GB while trying to compile Fedora kernel because too many kernel features are enabled.

I will install the OS on the first disk of a two-disk virtual machine (VM) and compile packages on the second disk. The advantage of this is that I do not risk to corrupt the disk of my own (host) system in case I run a command on the host instead of chroot environment, accidentally. Compiling code requires high disk performance. Therefore, working on SSD is highly recommended. VM will give an IO performance close to the IO performance of its host. Another advantage of working in a VM is that it can provide an isolated environment even if the physical disk doesn't have enough free space to allocate a dedicated partition.

In his videos, Viktor Engelmann downloads and compiles the packages on an USB stick. I personally don't find this logical because even spinning disks can provide more IOPS than USB sticks. Therefore working on an USB is not efficient for this project. If LFS is to be booted from USB, everything can be configured and compiled on a disk and then copied to an USB stick before the grub step and the bootloader can be written to USB at the end.

There are no other requirements than disk in this project. A fast CPU will of course shorten the compile time but you will get the same result on a slow CPU.

There is also no restriction on the OS to be installed on the first disk. Since I feel myself comfortable with RedHat based systems and wanted to try CentOS 8 Stream for a long time, I'll go for it.


VM Setup and Creating Partitions

To keep this post as short as possible, I won't go into the minor details of this setup. I used both VBox and vmware virtualization platforms. In the previous post, I have written that two more kernel features must be enabled when compiling the kernel for vmware, I will mention it again in kernel compilation. I created a VM with 2 GB RAM and 20 GB disk. I chose "Minimal Install" as base and selected "Development Tools" as additional software (left image). All settings are as in the image below. LVM is created automatically during installation, I did not configured it manually. My main concern is creating LVM for LFS and that I will configure manually. But I have to add another 40 GB disk. A new disk cannot be added to a VM on the fly in VBox (revision: it is actually possible). In vmware, new disk can be rediscovered with echo "- - -" > /sys/class/scsi_host/host2/scan command or reboot the VM if you add a new disk on the fly. Same command can be applied in VBox. The new disk was connected to host2 in my VM.

On CentOS, sshd is enabled by default. I found its IP and connected via SSH because it is easier to copy and paste commands into terminal than to type them on console. If VM has a NAT configuration, you have to configure "port forwarding" to connect to VM. I had mentioned this in one of the previous posts .

My LVM template is, a 512 MB /boot partition at the beginning of the disk and an LVM partition on the rest. In LVM partition, two 4 GB partitions for /var and /home, two 2 GB partitions for /tmp and swap and rest for root partition. Total usage on partitions except root does not exceed 60 MB. As a result 12.5 GB space remains untouched. As I mentioed above, root partition usage can reach 16 GB. Therefore, I added a relatively large disk of 40 GB. When the LFS compilation finished, net size of vmdk disk file was 28.8 GB.

I quickly partitioned the disk with following command:

echo -ne "n\np\n1\n\n+512M\nn\np\n2\n\n\nt\n2\n8E\np\nw\n" | sudo fdisk /dev/sdb

The result can be checked from the output:

Device     Boot   Start      End  Sectors  Size Id Type
/dev/sdb1          2048  1050623  1048576  512M 83 Linux
/dev/sdb2       1050624 83886079 82835456 39.5G 8e Linux LVM

Then I created partitions in LVM:

sudo pvcreate /dev/sdb2
sudo vgcreate vg_lfs /dev/sdb2
sudo lvcreate -n lv_var  -L 4G vg_lfs
sudo lvcreate -n lv_home -L 4G vg_lfs
sudo lvcreate -n lv_swap -L 2G vg_lfs
sudo lvcreate -n lv_tmp  -L 2G vg_lfs
sudo lvcreate -n lv_root -l100%FREE vg_lfs
sudo lvscan

In the output of last command, I should see both new and existing partitions. Partitions need to be formatted after they are created:

sudo mkfs.ext4 /dev/sdb1
sudo mkfs.ext4 /dev/vg_lfs/lv_root
sudo mkfs.ext4 /dev/vg_lfs/lv_tmp
sudo mkfs.ext4 /dev/vg_lfs/lv_home
sudo mkfs.ext4 /dev/vg_lfs/lv_var
sudo mkswap    /dev/vg_lfs/lv_swap

The book of LFS assumes ext4 FS is used during the installation (section 2.5).


Let's Create LFS Work Environment

When I saved and ran the script in section 2.2, only python3 and makeinfo were not found. I installed python3 with sudo dnf install python3 command. I will come to the latter package later.

I created an LFS env. variable with export LFS=/mnt/lfs command and added this to .bash_profile as well (section 2.6). I also created this directory. Now, I need to mount partitions (section 2.7) but I wrote a script, to not manually mount all four partitions:

#!/bin/bash

if [[ x$LFS == "x" ]]; then
    echo '$LFS' variable is empty.
    exit 1
fi

STEP=1
for PARTITION in "/" "var" "home" "tmp"; do
    if [[ $PARTITION == "/" ]]; then
        LVMNAME="root";
    else
        LVMNAME=$PARTITION;
    fi

    echo "[ $STEP / 5 ] Mounting $LVMNAME partition"
    if [ ! -d "$LFS/$PARTITION" ]; then
        sudo mkdir -pv "$LFS/$PARTITION";
        sudo chown $USER:$GROUPS "$LFS/$PARTITION";
    fi

    sudo mount "/dev/vg_lfs/lv_$LVMNAME" $LFS/$PARTITION
    sudo chown $USER:$GROUPS "$LFS/$PARTITION";

    STEP=$((STEP+1))
done

echo "[ 5 / 5 ] Activating swap.."
sudo swapon /dev/vg_lfs/lv_swap  2> /dev/null

If this script is called by just a single user, $USER and $GROUPS variables can be substituted with the username and group name. And swap doesn't necessarily need to be activated but I did it nevertheless.


LFS Packages

When I mentioned "package", .rpm or .deb files should not be understood. These are source code packages. Before downloading them, I need some additional software, i.e. wget, vim-enhanced and makeinfo, which I previously skipped. makeinfo comes in texinfo package but its repository "powertools" is disabled. So, I installed with following command:

sudo dnf install --enablerepo="powertools" texinfo wget vim-enhanced

Then I created "sources" directory (section 3.1), downloaded the files in this directory using wget list and checked their hashes. If there are some problems with download, you can give --no-check-certificate parameter to wget.

I switched to root to run the commands in section 4.2, but first I exported LFS variable again, for root (because I had first exported it for normal user). Then I ran commands. I don't need to create an lfs user (4.3) since I am in a VM. I changed the ownership of the directories to my normal user. Then I saved the given .bashrc to my home directory (not root) with the name "lfs_env.sh" and loaded it with source command. If I reboot the VM, I will run this again.

I followed the fifth and sixth chapters exactly as they are. I ran the commands in section 7.2, up to 7.3.2. Since next commands (starting from section 7.3.3) will run on each entry to the chroot env. and the mounted resources must be unmounted in reverse order on exit, I created a script from the commands:

#!/bin/bash

if [[ x$LFS == "x" ]]; then
    echo '$LFS' variable is empty.
    exit 1
fi

mount -v --bind /dev $LFS/dev
mount -v --bind /dev/pts $LFS/dev/pts
mount -vt proc proc $LFS/proc
mount -vt sysfs sysfs $LFS/sys
mount -vt tmpfs tmpfs $LFS/run

if [ -h $LFS/dev/shm ]; then
  mkdir -pv $LFS/$(readlink $LFS/dev/shm)
fi

chroot "$LFS" /usr/bin/env -i HOME=/root  TERM="$TERM"  PS1='(lfs chroot) \u:\w\$ ' PATH=/bin:/usr/bin:/sbin:/usr/sbin /bin/bash --login +h

umount -v $LFS/run
umount -v $LFS/sys
umount -v $LFS/proc
umount -v $LFS/dev/pts
umount -v $LFS/dev

If /dev/pts is not unmounted while exiting chroot, a new terminal cannot be opened in VM and VM needs to be restarted.

I entered chroot and continued to create necessary files and directories. Btw, /etc/passwd and /etc/group files in section 7.6 are the first point where systemd and sysVinit differ.

Since the script does unmount the resources when exiting chroot, the unmount commands in section 7.14 are not needed anymore. And I also can create a VM snapshot for backup, so the rest is also not needed.

In section 8.25, while compiling shadow with cracklib support, I got "undefined reference to `FascistCheck'" error. I reconfigured with following command and then the compilation succeeded:

LDFLAGS=-lcrack ./configure --sysconfdir=/etc --with-libcrack --with-group-name-max-length=32

The "make -k check" step in section 8.26 takes so long, that I started the test before going to bed and it was still incomplete when I woke up. From my understanding, there are more than 350K tests and some of them are stress tests. It is also written in this section that some tests are known to fail. Test results are available here. My results were almost the same as these. There is very simple sanity chech at the end of the section. IMHO, just doing this test would be enough but the book considers "make check" step as critical and not to be skipped.

In section 8.69, systemd and sysVinit packages are getting significantly different.

At the end of chapter 8, I removed the +h parameter that I gave to bash, in lfs_enter_chroot.sh script and saved it.

The ninth chapter handles initd/systemd settings. This means these two chapters are completely different in each version. In this chapter, I followed the book exactly. I entered KEYMAP=trq in /etc/sysconfig/console for initd or in /etc/vconsole.conf for systemd for setting Turkish keyboard layout. If there is no KEYMAP set, English layout is loaded by default. I skipped section 9.10.3 of systemd book because I have a separate partition for /tmp.

Section 10.2 is very important because fstab be created. Here, I have to add all partitions, I created in the beginning, into fstab. /boot partition is currently on /dev/sdb1 because it is still on the second disk. But when I detach the first disk from the VM to boot from LFS, this partition will become /dev/sda1. Hence I cannot use this device name. Each disk under linux, has a unique and fixed UUID. I have to use this, so that /boot can always be mounted regardless it is on first or second disk. The value linked to /dev/sdb1 in ls -la /dev/disk/by-uuid output is the UUID, I need. Or using,

lsblk -o NAME,MAJ:MIN,RM,SIZE,RO,TYPE,MOUNTPOINT,UUID

command, I list the disks and their UUIDs. lsblk output is more verbose but UUID column is empty in chroot environment. Therefore, I ran the command outside of chroot (in VM), noted the value down and created fstab with this value:

/dev/mapper/vg_lfs-lv_root   /       ext4   defaults  1  1
/dev/mapper/vg_lfs-lv_var    /var    ext4   defaults  0  0
/dev/mapper/vg_lfs-lv_home   /home   ext4   defaults  0  0
/dev/mapper/vg_lfs-lv_tmp    /tmp    ext4   defaults  0  0
/dev/mapper/vg_lfs-lv_swap   swap    swap   pri=1     0  0
UUID=01234567-89ab-cdef-0123-456789abcdef /boot ext4 defaults 0 0

and for initd version; proc, sysfs, devpts etc. entries must be added as well. I have not included them here. These are already in the book.


Compiling the Linux Kernel

After compiling all packages, it comes to compiling the kernel. Section 10.3 of both versions is about compiling the kernel. I created a default configuration by running make mrproper and make defconfig. I have explained these commands in detail in my previous article. I will use make menuconfig to select other features.

make menuconfig TUI
There is not much to change in defconfig for
systemd features
initd version of the book. It is enough to have uevent helper turned off and devtmpfs support on. For systemd version of the book, more features needs to be enabled (next image).

It is mentioned in Systemd's Errata, that CONFIG_SECCOMP is not under "Processor type and features" submenu, but that's OK, because this feature comes on in defconfig. It can still be searched in menuconfig or found in .config:

(lfs chroot) root:/sources/linux-5.13.12# grep -n SECCOMP  .config
687:CONFIG_HAVE_ARCH_SECCOMP=y
688:CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
689:CONFIG_SECCOMP=y
690:CONFIG_SECCOMP_FILTER=y
691:# CONFIG_SECCOMP_CACHE_DEBUG is not set

Btw, this is just my personal opinion, but I prefer to keep such errors to myself, as I think, some people in LFS support channels are not friendly at all, based on my own experience in IRC and their mailing list archive.

Back to the topic. My goal is installing LFS with LVM. LFS does contain the bare minimum, for example it doesn't have a window manager. LVM is also not counted as basic and it is a part of another project called (B)eyond LFS or BLFS in short. In LVM section of the BLFS book, there is another list of features that needs to be enabled for LVM support. I will enable them and include the modular ones to kernel (just personal preference). In the meantime, I looked at the Gentoo documentation for LVM and enabled a few more features that are not mentioned in BLFS but recommended by Gentoo. Instead of recompiling the kernel because of missing features, I prefer to boot with a few KBs larger kernel.

Finally, the readers working with vmware have to activate "Fusion MPT device support" as well, which is the main topic of the previous article. This is basically the driver of SCSI controller in vmware and without it, the kernel cannot find the hard disk and cannot boot. This feature is under "Device Drivers" submenu.

After completing all these steps, I went back to LFS section 10.3.1, compiled the kernel with make and then installed the modules. Since my /boot partition is on /dev/sdb1, mounted it with mount /dev/sdb1 /boot command and copied vmlinuz (kernel), System.map and config files there.


GRUB Bootloader

After compiling the kernel and copying it to /boot, now it's time to set up GRUB. At the moment, I have two options: I can install GRUB on /dev/sdb (actually, this was my original plan) or I can add LFS to CentOS's existing GRUB on /dev/sda. The advantage of the first option is that the disk is configured to run independently of CentOS. The advantage of the latter is that it is easy to set up.

First, I will configure first option: I ran grub-install /dev/sdb command. grub.cfg file is slightly different than the one in the book:

set default=0
set timeout=5

insmod ext2
set root=(hd0,1)

menuentry "GNU/Linux, Linux 5.13.12-lfs-11.0-systemd" {
    linux   /vmlinuz-5.13.12-lfs-11.0-systemd root=/dev/vg_lfs/lv_root ro
}

With "set root" keyword, GRUB's root device is set to first partition of zeroth disk. From GRUB's point of view, LFS disk is not zeroth yet, but it will be when CentOS disk is detached. There is no need to specify paths prefixed with "/boot" because /boot partition is separated. The root argument, given to the kernel is the device path of the root partition. Btw, the configuration above is for systemd. For initd, the "menuentry" and "linux" lines will not have "-systemd", that's all.

I exited chroot env. after saving this. Then I shut the VM down and removed its first disk. When I powered the VM up again, I saw GRUB menu, I just created and got kernel panic while trying to boot with this entry: Yaay!. OK, it's not something to be happy about it but this indicates two things: (1) grub is set up correctly, (2) kernel is properly compiled and copied to /boot.

So, why did I get a kernel panic, then? As seen in call trace, in mount_block_root function, kernel could not find the disk (specified with root= in GRUB) to mount to root directory. Why? Because LVM has not been activated yet. Unfortunately, there is nothing to do here, so I added virtual disk back and returned to CentOS.

Do I have to remove the disk to boot to LFS and add it back to boot to CentOS when any problem occurs? Hell, no! I appended following lines to /etc/grub.d/40_custom in CentOS:

menuentry "GNU/Linux, Linux 5.13.12-lfs-11.0-systemd" {
  set root=(hd1,1)
  linux   /vmlinuz-5.13.12-lfs-11.0-systemd root=/dev/mapper/vg_lfs-lv_root ro
}

This is my second GRUB configuration option which I mentioned above. It is essentially the same configuration except "set root" keyword is in menuentry block. This snippet is for systemd again and there will be no "-systemd" part for initd. Then I transferred the line I added into CentOS' grub.cfg:

GRUB_DISABLE_OS_PROBER=true  grub2-mkconfig -o /boot/grub2/grub.cfg

OS prober is a very nice feature to find other OSes installed and to add them to grup automatically but it doesn't work well due to a bug. Now, it is easier to switch between OSes, so I can continue from where I left off.


initramfs

I opened LVM section (or systemd LVM section) in BLFS book. The second to last paragraph of About LVM says, that an initramfs is required to use LVM on root file system.  initramfs, is a compressed virtual disk, containing some basic programs and configs. Acronym for 'Initial RAM Filesystem'. If this file exists, it is unpacked to root directory by bootloader and the programs and configs in it do the necessary operations for the system to continue to boot. "Rescue kernel", which is coming with many distros, is actually a simple initramfs containing a shell.

BLFS has its own initramfs creation script. To add LVM support to initramfs, first LVM must be installed. And for this;

1) First libaio, which is the prerequisite of LVM
2) which is actually not for LVM but it's a very useful utility for troubleshooting
3) mdadm. Its test can be skipped, the command doesn't even run the tests.*
4) cpio to compress initramfs
5) LVM. Its tests also take long time and some of them are even problematic. I configured LVM with --with-thin* ve --with-cache* parameters, given in the book as well as --with-vdo=none parameter. There is an extra command in systemd LVM though it's not critical.
6) and initramfs script must be installed.

* I haven't tested but LVM should also work without mdadm.

mkinitramfs script consists of two parts. The first part is the script itself and the second part is the file named init.in, which will be copied to the initramfs file by the script. In LFS v10.1, this script had a bug. It was searching for coreutils and util-linux components (like ls, cp, mount, umount etc), which are essential for initramfs, in /bin instead of /usr/bin and in /lib instead of /usr/lib. The script was ending with an error. As a workaround, I had linked missing files to where they should be in /usr/bin and /usr/lib. This bug is fixed in v11.

Since I gave kernel version as a parameter to the script, it added kernel modules (.ko files) to initramfs and created it.

(lfs chroot) root:~# mkinitramfs 5.13.12
Creating initrd.img-5.13.12... done.

I copied the file to /boot (the partition must be mounted first) and changed /boot/grub/grub.cfg as follows before exiting the chroot:

menuentry "GNU/Linux, Linux 5.13.12-lfs-11.0-systemd" {
  linux /vmlinuz-5.13.12-lfs-11.0-systemd root=/dev/vg_lfs/lv_root ro
  initrd  /initrd.img-5.13.12
}

The configuration I made above will only work when CentOS disk is removed. I am actually using CentOS' GRUB. So, after exiting chroot, I added the same initrd line to /etc/grub.d/40_custom and rerun this command:

GRUB_DISABLE_OS_PROBER=true  grub2-mkconfig -o /boot/grub2/grub.cfg

I restarted the VM, chose LFS and voilà:

and same result for systemd version:

Although, some services are still failing on systemd machine, they both are booting without any problem in general.

Saturday, December 4, 2021

Compiling the Linux Kernel for Vmware


Hi there. In this post, I will give general info about compiling linux kernel and address the solution of a problem I encountered. This post was actually going to be a part of the next post, but this problem has costed me so much time, that it's worth writing about it in a separate blog post. The title may give a hint about the next post.


Compiling Kernel in Linux

Compiling the kernel is not something that should be done on a daily basis, unless you are running linux on a very special device e.g. an embedded system or a brand new hardware that the standard kernel does not support. Or you are using an experimental distro such as LFS or such as Gentoo where the kernel does not come pre-compiled. On the other hand, to be able to compile the kernel is a must to learn and master the system you are running at the lowest level. And it is easier than compiling another source code, once you have met few prerequisites and know the basics.

In this post, I will compile kernel v5.13.12. What I first need, is the source code of course. Btw, kernel version only matters for .config file, which I will mention it in next paragraphs. The rest is applicable to (almost) all kernel versions. Any kernel is available at www.kernel.org/pub under linux/kernel subdirectory. I have linked .xz file above due to its smaller size but .gz version of same data is also available.

I have extracted this file. Some necessary programs for compilation (such as make, gcc and gcc-c++) were already installed on my machine. The only missing package was ncurses-devel and it is only required for make menuconfig not the compilation itself however menuconfig makes choosing kernel properties quite easy. The prerequisites for the kernel is not limited to these. A detailed list for the v5.13 is available at kernel.org documentation. But it is not mandatory to have all the packages in that list installed. For example, if you are not going to compile anything related to PPP, pppd is not needed or OpenSSL is only required if kernel modules are to be signed.

Here is some preliminary info before starting with compilation: make clean deletes all compiled files  and makes the code ready for recompilation (from scratch). make mrproper, which is special for linux kernel makefile, deletes .config file and all other generated files [1]. So, before I started, I also cleaned the directory with this command and set most default kernel properties with make defconfig. These properties, I am mentioning, are kept in .config file, so defconfig actually creates a default .config. These can be viewed or edited with a text editor or using make menuconfig command over ncurses TUI. Since there are 4845 lines in .config file, it is not logical to edit all properties manually.

.config file is different for each version, because each minor version adds or removes some features in kernel. Thus, I cannot compile using a .config file of another version due to compatibility. But I can adapt an older .config file to a newer kernel with make oldconfig command. This command only asks the user for the newly added properties which isn't in old .config. Having this said, it also doesn't make much sense if you are trying to adapt a v2.6 config file to v5.13.

make menuconfig TUI

I've run  make menuconfig command and the menu above has appeared. Lines, ending with "--->" indicate a submenu. Each line in this menu (unless it has a submenu) correspond to a line in .config. In documents, features that must be enabled are given with [*] and features that must be disabled are given with [ ]. And [M] means that the feature is modular, which I explain later. Following image shows the kernel features, recommended to configure LFS with systemd:

Kernel features for LFS systemd


For example, features such as "Auditing Support", "Control Group support" and "Configure standard kernel ..." are under "General setup"; "open by fhandle syscalls" is under "Configure standard kernel ...". The names in square brackets at the end of each line are the item names in .config file. Btw, I couldn't find "Enable seccomp ..." feature, which should be under "Processor type and features" (wrong at source). This feature is under "General architecture-dependent options" in v5.13 according to Kernelconfig.io. In older versions such as v5.9.9, it is under "Processor type and features". A specific item can be searched among all items by its name by pressing '/' in menuconfig. I can also change this feature in .config file (below). Usually, first letters of menu entries (but not always) in visible area, are their keyboard shortcuts. If I enter "Processor type..." submenu, I can navigate through the features that start with "E", by pressing "E" key on keyboard. 

$ grep -n SECCOMP .config
687:CONFIG_HAVE_ARCH_SECCOMP=y
688:CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
689:CONFIG_SECCOMP=y
690:CONFIG_SECCOMP_FILTER=y
691:# CONFIG_SECCOMP_CACHE_DEBUG is not set

CONFIG_SECCOMP is already enable on line 689. By substituting 'y' with an 'n', I can disable it if it's not needed. All these features go in the .config file. For [M], here is an example:

Kernel features for using LVM in BLFS


M denotes that the feature will be compiled as a kernel module. Not all kernel functions have to be in compiled kernel, named vmlinuz-... under /boot directory. To keep kernel as small as possible, some functions can be kept as modules under /usr/lib/modules and loaded only if needed. In this way, LVM modules on a machine without an LVM disk don't need to be loaded into memory or sound card modules on a VM without a sound card.

Module files have .ko (kernel object) extension. They are loaded with insmod or modprobe and listed with lsmod. Modules allow new functionality to be added to the linux kernel without the need to recompile it [2]. Module configurations are kept in .conf files under /etc/modprobe.d/. With these files, arguments can be passed to the modules or modules can be disabled completely. Kernel modules are a much broader topic that I can cover here.

After setting the required options with menuconfig or manually, I've run make to compile the kernel. Of course, it does not get compiled instantly. With default options, it takes about 5 minutes on my computer. Since some features are compiled as modules by default, these modules need to be copied to the corresponding module directory with make modules_install after make.

After compilation, the file arch/x86/boot/bzImage (kernel) can be copied under /boot with name vmlinuz-... and loaded with grub. Every distribution copies its .config file, used for compilation, to /boot with name config-$(uname -r). This means that, if I want to recompile the same kernel I am using, I can copy that config file in /boot to the directory, where I compile the kernel, with name ".config" and run make again. Let's assume, I want to recompile Fedora kernel v5.13.12. I search kernel package from the build system koji and download kernel-core package for x86_64 architecture. Then I extract it using rpm2cpio kernel-core-5.13.12-200.fc34.x86_64.rpm | cpio -idmv command and copy lib/modules/5.13.12-200.fc34.x86_64/config file to linux-5.13.12 directory with name ".config" and compile it. I eventually get the same kernel as Fedora (additional reading: How to install a kernel from koji). However, because Fedora kernel is built to run on every machine, having that many features enabled (e.g. Fibre Channel even though it is modular), increases compile time drastically. It took 5-6 hours on my computer.

After this much preliminary information, I can discuss about the problem I had.


Compiling the Linux Kernel for Vmware

First, I've installed two VMs, one in VBox and one in vmware. They are same except minor differences. As I explained above, I have added some features to the defconfig, compiled the kernel, then copied it to /boot, created initramfs and grub config to load them. So far so good.

When I have booted the VMs with the new kernel, the one in VBox was booting fine but the one in vmware has failed to boot and fallen to initramfs rescue shell. As it can be seen in the next screenshot, there are no /dev/sd* disks and /dev/mapper is empty.  This is strange. Both VMs have two disks and first disk contains CentOS. I've booted both of them to CentOS because initramfs does not contain any decent tools like lsblk, lspci to diagnose the problem. Btw, Ubuntu Live could also help.

As it can be seen from the output below, I checked the devices and their kernel modules with lspci -k command:

In Vmware, besides the common modules like ata_piix, vmw_vmci, pcieport, there is another module called mptspi for SCSI controller. In VBox output (bottom part of above image), there are only standard drivers. ahci is used as SATA controller and there is no mptspi. I did some research on this module on internet. The configuration of mptbase, mptscsih, mptspi drivers is referred as "Fusion MPT ScsiHost drivers for SPI" [3], which is under "Device Drivers/Fusion MPT device support" menu with item name is CONFIG_FUSION according to kernelconfig.io. In default .config AHCI is turned on but Fusion MPT is off. That's why VBox is working fine but vmware isn't. And since this option is enabled in CentOS kernel, there is no problem with it on both platforms.

$ grep AHCI .config
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_ACARD_AHCI is not set

$ grep FUSION .config
# CONFIG_FUSION is not set

Problem found. I have run make menuconfig again and chosen this driver. Since the VM will be running under vmware, I've included it to the kernel (not a module) and recompiled. After copying the new kernel (arch/x86/boot/bzImage) to /boot and rebooting the VM, it booted up properly.




References:
[1]: Why both make clean and make mrproper are used?
[2]: https://en.wikipedia.org/wiki/Loadable_kernel_module
[3]: https://cateee.net/lkddb/web-lkddb/FUSION_SPI.html
 - : https://www.youtube.com/watch?v=WiZ05pnHZqM

Tuesday, November 2, 2021

What is the Disk System and What isn't? #4.5: FAT32


Hi there. I continue the file systems article series (which I gave a short break) with FAT32. I shortly mentioned it in previous article, but had not discussed thoroughly. In this article, I will install FreeDOS to a FAT32 disk, examine its structure and compare it with MS-DOS.

As I mentioned before, FAT32 was developed in the 90's when the disk sizes reached above 2 GB limit of FAT16 and it was introduced to end users in August 1996 with Win95 OSR2 (bundled with MS-DOS 7.1). Therefore, no version of MS-DOS, which is sold separately (not Windows bundled), does support FAT32. And Microsoft also announced that OS installations to FAT32 disks will not be supported after WinXP. FAT32 is quite similar to FAT16, in terms of its structure. In this way, file system routines in DOS kernel were not rewritten and FAT32 support has been added to DOS kernel with only about 5KB of code [1].

Starting from very basics, i.e. MBR, the only difference with FAT32 here, is the partition type field in partition table. FAT32 formatted CHS disks have 0x0B and disks with LBA support have 0x0C in this field. I had mentioned this in my second blog post.

First 36 bytes of boot sector are same for both FAT16 and FAT32 but FAT32 has some extra fields. This was explained in my third blog post. The capabilities of FAT32 have been enhanced with newly added mirroring flags, root cluster entry as well as FSINFO sector. And the boot sector code is slightly different from that of FAT16.

First, I installed a FreeDOS VM, so that what I explained, would not remain in pure theory. I will not go into minor details of this installation, as I explained this in my previous article.  It would be very good, if VM disk has a single partition larger than 2GB. And fdisk's FAT32 support must be turned on before partitioning. I checked the partition table of the VM with the following command:

hexdump -C FreeDOS.vdi | less

As it can be seen in the output below, partition type is 0x0B, i.e. FAT32 CHS.

002001c0  01 00 0b fe bf 09 3f 00  00 00 4b f5 7f 00 00 00  |......?...K.....|

It is CHS, because the disk is smaller than 8 GB. I installed FreeDOS to a disk larger than 8 GB and partition type value was 0x0C there.

The main difference between FAT16 and FAT32 is obviously in boot sector. In previous article, I put two tables about FAT32 boot sector data structure (in other words DOS 7.1 EBPB). Below, I put those two tables together:

Sector OffsetSizeDescription
0x003 byteJMP to the boot code
0x038 byteOEM Name
0x0BwordBytes per sector
0x0DbyteSectors per cluster
0x0EwordReserved sectors
0x10byteNumber of FATs
0x11wordReservedNote1
0x13wordReservedNote2
0x15byteMedia descriptor byte
0x16wordReservedNote3
0x18wordSectors per track
0x1AwordNumber of heads
0x1CdwordNumber of hidden sectors
0x20dwordTotal number of sectorsNote4
0x24dwordSectors per FAT
0x28wordMirroring flagsNote5
0x2AwordFAT versionNote6
0x2CdwordRoot directory cluster
0x30wordFSINFO sector
0x32wordBackup boot sector
0x3412 byteReservedNote7
0x40bytePhysical drive number
0x41byteReservedNote8
0x42byteExtended signature (0x28 or 0x29)
0x43dwordVolume serial number
0x4711 byteVolume label
0x528 byteFile system typeNote9
Compiled from Wikipedia (1, 2)

Note1: On systems prior to FAT32, this field holds max. number of root directory entries because before FAT32 the root directory was limited in size. This field is now zero because FAT32 removes this restriction.
Note2: In older FAT versions, this field holds total number of sectors. In FAT32, this field is zero (since the number will not fit here anymore) and the value at offset 0x20 is used instead.
Note3: In older FAT versions, this fields holds the number of sectors per FAT, but since this value will not fit in a word with FAT32, the dword at offset 0x24 is used.
Note4: If this field is zero, OS reads the number of sectors from partition record.
Note5: Normally, FAT is always written in two copies and each file operation is written to both copies. With this flag, single table can be set to active.
Note6: This field is defined but not used. It is always zero.
Note7: In Microsoft documentation, this field is given as "boot file name". Normally, kernel file name to be loaded, appears hard coded in boot code. I guess, that this field is reserved to keep kernel file name in a fixed position in future.
Note8: This byte is always zero but Windows NT uses bits 0 and 1 as dirty bit. The details will be explained in FSINFO section.
Note9: Some OSes use this field to store the total number of sectors  when it overflows a dword.

In the output of hexdump command above, boot sector comes right after MBR. MBR is of course in zeroth sector and boot sector is in 63rd sector, but since I didn't give -v parameter to hexdump, sectors filled with zeros are shown with just  a '*' character.

00207e00  eb 58 90 46 52 44 4f 53  35 2e 31 00 02 08 20 00  |.X.FRDOS5.1... .|
00207e10  02 00 00 00 00 f8 00 00  3f 00 ff 00 3f 00 00 00  |........?...?...|
00207e20  4b f5 7f 00 ee 1f 00 00  00 00 00 00 02 00 00 00  |K...............|
00207e30  01 00 06 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00207e40  80 00 29 f7 16 04 38 46  52 45 45 44 4f 53 32 30  |..)...8FREEDOS20|
00207e50  31 36 46 41 54 33 32 20  20 20 fc fa 29 c0 8e d8  |16FAT32   ..)...|


OEM Name: FRDOS5.1
Bytes per sector: 512 byte
Sectors per cluster: 8
Reserved sectors: 32
Number of FATs: 2
Media descriptor: 0xF8 (Harddisk)
Sectors per track: 0x3F = 63
Number of heads: 0xFF = 255
Hidden sectors: 0x3F = 63
Total number of sectors: 0x7F F54B = 8 385 867
--- Below part is not compatible with FAT16 ---
Sectors per FAT: 0x1FEE = 8174
Mirroring flags: 0
FAT Version: 0
Root directory cluster: 2
FSINFO Sector: 1*
Backup boot sector: 6*
Physical drive number: 0x80
Extended signature: 0x29
Volume Serial Number: 3804-16F7
Volume Label: FREEDOS2016
File system: FAT32

* These fields will be explained later in this article.

Reading the boot sector data manually, is obviously hard. This is how it's look like in disk editor:


Unfortunately, all the information does not fit to the screen. Therefore, I did a trick and changed character size from 8x16 to 8x14 pixels before starting disk editor with following code:

mov ax,1111
mov bl,0
int 10

Source: stackoverflow. In the answer, the guy wrote that above code snippet turns 25 line mode on, whereas it should be 28 lines, I guess. Maybe, this is just because of an incompatibility issue in VBox BIOS code. In the same answer, it is mentioned that giving AX=1112h would enable 43 lines mode. I could fit whole sector info to one screen, but it would became quite difficult to read, so I gave up.


FSINFO Sector
FAT32 has two more important sectors besides the boot sector itself. One of them is the FSINFO sector: file system information sector. I briefly explained this in the boot sector article. Until FAT32, free space on disk used to be calculated by counting free clusters. Let's omit the technical details for a moment. Maximum number of clusters in FAT16 is 216 = 65 536 while this increased to ≈268M in FAT32. This means that the previous free space calculation algorithm would be running 4096 times slower. To solve this, FSINFO sector is added to FAT32, which keeps the number of free and occupied sectors. Even though, there is a pointer to this sector in boot sector, the value of the pointer is almost always 1, which means FSINFO sector follows boot sector. Its structure is as follows:

Sector Offset
SizeDescription
0x004 byteSector signature 'RRaA'
0x04480 byteReserved
0x1E44 byteSector signature 'rrAa'
0x1E8dwordNumber of free clusters
0x1ECdword
Number of occupied clusters
0x1F012 byteReserved
0x1FC4 byteSector signature 0x0,0x0,0x55,0xAA
Compiled from Wikipedia

The number of free and occupied clusters may not be actual, if a disk is not properly unmounted. In WinNT, when a FAT32 disk is connected, the zeroth bit of byte 0x41 of boot sector is set (dirty bit) and reset when unmounted. When a disk is connected (again) and if this bit is not zero, it indicates that it hasn't been unmounted properly. In this case, the user is asked to run CHKDSK because the number of free and occupied clusters might be (presumably) not correct. Similarly, if an IO error occurs, first bit of byte 0x41 is set and the user is asked to perform a surface scan when the disk is remounted. 0xFFFF FFFF is written to these fields during formatting. This is an invalid value and OS is expected to calculate and write actual values here.

hexdump output of these fields are given below:

00208000  52 52 61 41 00 00 00 00  00 00 00 00 00 00 00 00  |RRaA............|
00208010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
002081e0  00 00 00 00 72 72 41 61  be dd 0f 00 ed 18 00 00  |....rrAa........|
002081f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|


This output shows that there are 0x0F DDBE = 1 039 806 free clusters. Based on the boot sector, each cluster consists of 8 sectors of 512 bytes. This gives 4061.742 MB of free space:


And the data on the disk is calculated as 24.925 MB in 0x18ED = 6381 occupied clusters. This value is of course, not the total size of all files but total space occupied on the disk. I had mentioned the difference between them and "slack space" concept in the clusters section of FAT article.


Backup Boot Sector
In FAT32, a backup of boot sector is written to the sixth sector against any data corruption. In theory, this backup can be written to any sector, but MS does not recommend to keep this backup anywhere except the sixth sector, in their official documentation. Only Win95's boot sector code tries to read this (FreeDOS or WinXP do not) if an IO error occurs. If the pointer to this sector is in an unreadable sector, it doesn't make much sense to have a backup at all. Although, this protects against viruses that destroy boot sector, it does not prevent any new generation virus from destroying its backup, too. This backup could only help disk recovery programs to restore the boot sector, if the original one is corrupted. By the way, there is also a backup FSINFO sector after backup boot sector.

I previously mentioned that, FAT32 support has started with Win95 OSR2 and continued until WinXP is out of support. I wanted to install OSR2 in VBox to play with (I think it's not supported under VBox, I got blue screen all the time when scanning devices), so I installed it in vmware. It's kinda tricky: I downloaded a Win95b system disk image from bootdisk.com. I created a bootable .iso with k3b, using the floppy image and OSR2 setup files, which I had downloaded from ITU software server. I created a VM in vmware, with 128 MB RAM, 12 GB disk and a floppy drive. I selected Win95 as operating system. Then I booted it with the .iso I created. When I was on A:\> prompt, I created a single partition with FDISK on entire disk. I rebooted the VM with CD. While the VM is booting, you have to be quick to press Esc and select CD drive to boot. Otherwise, the VM will try to boot from hard disk and halt in "Missing operating system" screen. After booting, I formatted the disk, created a directory named "setup", copied installation files on CD there and finally ran SETUP.EXE from this directory. Setup program tries to access to the directory, where the installation was started, for device drivers. If you started the installation from CD, you will be asked for that CD all the time, when a new device is connected, and this is annoying. The rest of the installation is pretty straightforward.

Is it worth installing this? I think no. There is no significant difference in the boot sector data area. Actually, values are more or less the same as FreeDOS except for the disk size (obviously). Similarly, I can say the same for 32 and 64-bit WinXP, no fundamental difference. Boot sector can be viewed with disk editor under Win95, but it detects Windows, activates read-only mode and does not allow it to be changed. In WinXP, hard disk can be read with HxD, but with WinXP, it is impossible to connect to any webpage and download a file any more due to TLS incompatibility. For this reason, I configured file sharing in WinXP, downloaded HxD installation file to my computer and copied it to the VM over file sharing. 

Note: Latest Fedora, doesn't allow SMB1 protocol as client (which XP supports), therefore it is necessary to add "client min protocol = NT1" in [global] stanza in smb.conf to connect*.


File Allocation Table (FAT)
The location of the table is calculated by adding hidden sectors value to reserved sectors value, like FAT16. In other words, it is located after boot sector by the number of hidden sectors.

fat_start = hidden_sectors + reserved_sectors (1)

I already mentioned that unlike FAT16, root directory has not a fixed location in FAT32. So, there is no need to calculate root_dir_start value.

data_start = fat_start + number_of_FATs * sectors_per_FAT (2)


The function, to convert cluster number to sector number is same for both FATs:

clus2sect(c) = (c - 2) * sectors_per_cluster + data_start (3)


To obtain directory tree, root directory cluster value is read from boot sector and its absolute sector is calculated by substituting it in the formula (3). Before parsing the directory tree, let's have a look at FAT. Boot sector is in sector 63 (hidden sectors) and 32 sectors are reserved for it (reserved sectors). From the formula (1), FAT is found in sector 95 (=63+32). The .vdi file header is 0x200000 bytes long, i.e. 512 * 4096. So, .vdi file header is 4096 times 512 byte blocks, plus 95 more blocks for MBR and boot sector, makes 4191, this is the FAT location:

dd if=~/VirtualBox\ VMs/FreeDOS/FreeDOS.vdi \
bs=512 skip=4191 | hexdump -C | less

00000000  f8 ff ff 0f ff ff ff 0f  03 00 00 00 04 00 00 00  |................|
00000010  05 00 00 00 ff ff ff 0f  00 00 00 00 ff ff ff 0f  |................|
[SNIP]


The logic behind the entries is the same as in previous versions of FAT, so there is no need to dwell on all entries. I've only included first eight entries here. I added FAT32 support to my fatread code in github*:

Cluster0: 0xFFF FFF8 (0x0000)
Cluster1: 0xFFF FFFF (0x0004)
Cluster2: 0x3 (0x0008)
Cluster3: 0x4 (0x000C)
Cluster4: 0x5 (0x0010)
Cluster5: 0xFFF FFFF (0x0014)
Cluster6: 0x0 (0x0018)
Cluster7: 0xFFF FFFF (0x001C)
...

Zeroth entry is media descriptor byte or FAT ID, like FAT16 and 12, but notice that the first 4 bits of all entries are zero. First entry contains an end of file (EoF) or end of chain (EoC) mark. Second entry is pointing to the third, third to fourth, fourth to fifth and fifth entry contains an EOF. From boot sector, remember that the root directory is starting from the second cluster. So, we found the root directory. Sixth cluster is empty and seventh contains another EOF.

Although, entries are 32-bit in size, only low 28-bits of them are used. Highest nibble is reserved. Therefore, theoretical upper limit of the number of clusters is 228 = 268 435 456. Since 12 values have a special usage, the practical limit is 12 less than theoretical limit. These special values are, values between 0x0FFF FFF8 and 0x0FFF FFFF for EOF, 0x0FFF FFF7 for bad cluster, 0x0 for free cluster and reserved values 0x1 and 0x0FFF FFF6. Additionally, usage of the values between 0x0FFF FFF0 and 0x0FFF FFF5 is discouraged due to compatibility reasons.

Bit 27 of first entry can be used as dirty bit, like byte 0x41 of boot sector. If this bit is reset during mount, OS tries to scan the disk, or at least assumes that the values in FSINFO sector are unreliable. Similarly, bit 26 is used for IO errors.

*Note: FAT32 table takes up a large space (e.g. 2 * 8174 sectors). If there is not much data in .vdi disk, FAT will consist of zero entries. Because these entries are also not stored in .vdi file (unless the virtual disk is Preallocated), fatread is unaware of these empty blocks and returns erroneous values especially with large cluster numbers. For example, it returns FAT ID on the 259840th cluster of FreeDOS, because it actually started to process second FAT:



Directories and Directory Table
From the boot sector, I know, that the root directory starts on second cluster and I also found in previous section, that it continues up to fifth cluster. fat_start = 95 and sectors_per_FAT = 8174. Then, from the formula (2), data_start = 16443 and from the formula (3) clust2sect(2) = 16443. But there is a problem with this calculation as it can be seen below:

dd if=~/VirtualBox\ VMs/FreeDOS/FreeDOS.vdi bs=512 skip=$((4096+16443)) | hexdump -C | head
00000000  2d 2d 2d 2d 2d 2d 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |----------------|
00000010  2d 2d 2d 2d 2d 2d 2d 2d  2d 2d 2d 0d 0a 54 61 64  |-----------..Tad|

The note marked with '*' above, actually explains the problem. Because empty blocks are not kept in dynamically allocated .vdi file, incorrect sector content appears, even though the calculation is correct. I checked root directory contents using DISKEDIT from inside the VM. Actually, when I start DISKEDIT from C:\, it automatically opens the root directory. To prove, that my calculation is correct, I pressed Alt+P, gave 16443 and saw the same data (F2: hexadecimal view).

Contents of the root directory

I saw "FREEDOS2016" entry at the beginning of the root directory. I opened .vdi file with hexdump -C ~/VirtualBox\ VMs\FreeDOS\FreeDOS.vdi | less command and searched the first entry by typing "/FREEDOS2016", and found this string at the 0x407600 offset of the file.

These entries here are also similar to those in previous FAT versions. Most important difference is that the cluster numbers are 32-bit. Low word is at offset 0x1A and high word is at 0x12. There also are some new features that come with MS-DOS 7 and WinNT. I put an updated version of the table in FAT article here:

OffsetSize
Description
0x008 byteFile name
0x083 byteFile extension
0x0B1 byteFile attributes
0x0C1 byteMSDOS: Reserved
WinNT: Case information
0x0D1 byteCreate time (in msec.)
0x0EwordCreate time
0x10word
Create date
0x12wordAccess date
0x14wordCluster num. (high word)
0x16word
Modify time
0x18word
Modify date
0x1AwordCluster num. (low word)
0x1CdwordFile size

Byte 0x0C is ignored by MS-DOS and Win95. In WinNT and XP, third bit of 0xC byte indicates lowercase filename and fourth bit indicates lowercase extension:

0x00: TEST000.TXT
0x08: test000.TXT
0x10: TEST000.txt
0x18: test000.txt

Time and date format is same as FAT16: H: hours, M: Minutes and S: Seconds:

Offset 0x0F, 0x17
Offset 0x0E, 0x16
H4H3H2H1H0M5M4M3M2M1M0S4S3S2S1S0

Likewise, Y: Year, M: Month, D: Day:

Offset 0x11, 0x13, 0x19Offset 0x10, 0x12, 0x18
Y6Y5Y4Y3Y2Y1Y0M3M2M1M0D4D3D2D1D0

As mentioned previously, time resolution in FAT16 is two seconds because seconds are represented with five bits. In FAT32, byte 0x0D represents 10 milliseconds. Thereby, time resolution is reduced to 10 msecs but only values between 0..199 are valid.

Example:
0b706e40  4e 54 4c 44 52 20 20 20  20 20 20 27 08 00 00 10  |NTLDR      '....|
0b706e50  8e 38 50 53 01 00 00 10  8e 38 8f 1c c0 d0 03 00  |.8PS.....8......|

NTLDR file properties byte is 0x27. This means archive, hidden, read-only and system file flags are set. The value at 0xC is 0x8, so the filename is in lowercase. Extension doesn't exist, so bit 4 is irrelevant. File creation and modification dates are same: 0x1000. Second bit of hour field is 1, so it is 02:00. Dates are also same: 0x388E = 0011100 0100 01110 = 1980+28/04/14. Access date is: 0x5350 = 0101001 1010 10000 = 1980+41/10/16, because I recently checked file properties. File start at cluster 0x011C8F and its size is 0x3D0C0 = 250048 bytes.

Long File Name (LFN) Support
Since FAT32 does not have any enhancements to FAT16 VFAT, detailed information is given in previous article.


FAT32 Boot Code

a. FreeDOS Boot Code
I copied FreeDOS boot code from my VM, compared it with the codes in github and found out that (LBA supporting) boot32lb.asm is running. I downloaded it and added my comments in, prefixed with "; --". This can be downloaded here. Line numbers, given in the rest of the article are w.r.t. the file with my comments.

Boot code is loaded to 0:0x7C00 by default (line 54) and execution continues from real_start label (l. 117) with a jmp instruction. Between the lines 60 and 116, pointers to the boot sector data are defined. Like FAT16 code, the kernel will be loaded at 0x60:0 address, therefore the code copies itself to 0x1FE0:0 and resumes its execution from this address (lines 123-129). The pointer at line 131 points to the address to load the kernel. "Loading FreeDOS" is output to the screen at line 141 and in the following calc_params block, fat_start and data_start values are calculated using formulas (1) and (2), lines 155 and 163.

In FAT16, reserved_sectors value is mostly 1, and FAT starts right after the boot sector. In FAT32, this value is always greater than one, as there is FSINFO and backup boot sectors between boot sector and FAT. Since root directory cluster is given in boot sector, it doesn't need to be calculated as in FAT16.

There is an interesting code block between the lines 169 and 178. In a loop, the value of AX which is initially 512, is compared with bytes_per_sector and multiplied by 2 if not equal. At each step, the operand of the shift operation on line 278 is increased by one (self modifying code) and this shift operation is used while calculating the location (abs. sect. values) of cluster entries in FAT. This code seems to support sectors bigger than 512 bytes, in short.

At line 189, sector number of root directory cluster is calculated and read with readDisk function at line 194. KERNEL.SYS is searched in root directory entries, between the lines 201 and 212. DI is increased while searching and if its value exceeds bytes_per_sector value, then next sector is read (l. 216) and DX is decreased by one, which contains sectors_per_cluster initially. If all sectors in a cluster have been read, then DX will be zero (line 216), so given a cluster number in EAX, next_cluster function returns its consecutive cluster number. All instructions between the lines 188 and 220 repeats until KERNEL.SYS is found or there are no more entries in directory table. If it's found, its cluster number is loaded into EAX (ff_done label). This is translated to abs. sect. number using convert_cluster function (line 232) and entire cluster is read sector by sector with readDisk (line 236). The function on line 232 returns carry, if the next cluster of the file contains EoF mark. This indicates that the file has been completely read. In this case, the execution is handed over to the kernel at boot_success label.

I will not go into the details of individual functions here. I tried to explain all of them with my comments in the code.


b. Windows Boot Code
While searching in internet for Windows boot code, I found some resources about it in personal webpage of Jens Elkner from UNI Magdeburg. Since Win95 boot code is not open source, I will briefly explain this code dump. In this section, I have given code references with offset address instead of line numbers.

The most important detail about Windows boot code is that it consists of two parts. The real boot sector code is responsible for loading the rest, which is in the third sector of a Win95 partition and in twelfth sector of a WinXP partition.

Win95 looks to have inherited DOS boot code. At the beginning of the code, the values on a diskette parameter table are modified (0x7C6E, 0x7C81, etc.). It is dubious that this table is ever used with disks. If the boot media is a floppy (0x7C8E), the execution continues from 0x7CB5 to process boot sector data. If boot media is a disk (or has an MBR to be more precise), this MBR is read, the partition record is found by comparing hidden_sectors value with starting LBA address value in partition entries. They must be equal if it's correct partition entry. Partition type byte of the booting partition is ORed with 2 (at 0x7CAA) and written to 0x7C02 (overwriting the NOP command). This will be compared with 0xE at address 0x7D40. Because if LBA is supported, then partition type can be either 0xC or 0xE (0xC OR 2 = 0xE). In this case, function 0x42 of int 0x13 can be used. It is really strange that MS is not checking LBA support with code, but I also think every computer, which is not so old in 90s, was supporting LBA.

At 0x7CC4 CX=3. This value will enter to read_disk function at 0x7D31 as 2. Thus, two more sectors after the boot sector will be read to memory. These two sectors are FSINFO sector and second part of the boot sector. If reading fails here (0x7CD2), the code will try to load backup boot sector (0x7CD9) and it jumps to the beginning of second phase boot sector at 0x8000, if it has been loaded successfully. Btw, the instruction at 0x7CD4 nonfunctional (from my understanding) but because of 0xF8 value (media descriptor), it may have a special meaning, for example it could be used as variable somewhere in code.

Between 0x7D03 and 0x7D30, there are functions that show errors and reboot computer. Between here and error strings, there are CHS and LBA disk read functions as well. Btw, just behind the error messages, there are four strange pointers, pointing a relative address to themselves. MS oddities again.

Second phase of boot code contains a lot of unnecessary CLI/STI blocks. data_start is calculated up to 0x8016 and stored in [BP-04] at 0x801B. [BP-08] is written with -1 for future use (0x801F). With SHLD instruction at 0x803E, EDX is shifted left by 16-bits and high word of EAX is written to DX. In other words, the value in EAX is written to DX:AX with a single instruction. This instruction is in many places in code with its counterpart, because EAX is used in calculations with 32-bit values but sector number is given to read_disk function in DX:AX. Hey, Microsoft, if you had optimized read_disk function instead?! Two shift operations in 0x8047 write DX:AX to EAX (inverse of SHLD). Another oddity in read_disk is, that a DAP packet is created before checking LBA support. Huh, this will not be used if LBA is not supported, right?!

The root directory cluster read at 0x8028, is translated to abs. sect. number between the offsets 0x8050 and 0x8067 and root directory table is read to the memory 0:0x700 (0x8068 to 0x8073). The entry of IO.SYS is checked on the table (0x8081) and if found, the code will branch from 0x8084 to the routine at 0x809F that loads the file into the memory. In this routine, cluster number of IO.SYS is found at 0x80A2 and written to DX:AX (0x80CF). Only four sectors from this cluster are read to 0:0x700 (0x80D9 and 0x80DC). At 0x80E4 and at 0x80EA, 'MZ' signature at the beginning of IO.SYS and first two chars of the code are checked, respectively. If these are consistent, the file is run at 0x70:0x200, otherwise an "Invalid system disk" error pops up (but why?). IO.SYS has to read the rest of its sectors by itself.

Given a cluster number in DX:AX, the routine between 0x80FD and 0x811F calculates its consecutive cluster number. At 0x8120'de, a sub-function is called to find in which FAT sector the given cluster is. The sector number found is written to the variable at 0x801F. Thus, e.g. if the first cluster of IO.SYS is 3, its entry is in the first sector of FAT and most likely its next cluster is 4, whose entry is also in the same sector as third. So, the same sector of FAT doesn't need to be read again and again. If the value in EAX (0x8138) is the same as the last read FAT sector (in [BP-08]), the JE instruction at 0x813C will jump to the end of the function.

In WinXP, on the other hand, the floppy table is not processed anymore. The code loads the secondary part directly from the twelfth sector to 0x8000 and jumps there. LBA support is checked in the code on the fly. There is a lot of space filled with zeroes in boot sector, when unnecessary parts are removed from the code. Second phase code is almost the same as Win95 except unnecessary CLI/STI's are removed and as I wrote above, sector number is constantly kept in EAX and read_disk takes the parameter from EAX, thus no shift operation is needed. The difference between two codes is 63 bytes. Btw, WinXP loads NTLDR file to 0x:2000:0, instead of IO.SYS. In short, the oddities in Win95 boot code doesn't exist in WinXP boot code and it is more optimized.


[1]: https://en.wikipedia.org/wiki/File_Allocation_Table#FAT32