Chapter 13. Troubleshooting (2.213)

Revision: $Revision: 451 $ ($Date: 2011-03-23 13:29:59 +0100 (Wed, 23 Mar 2011) $)

This topic has a total weight of 7 points and contains the following objectives:

Objective 2.213.1; Identifying boot stages and troubleshooting bootloaders

Candidates should be able to determine the cause of errors in loading and usage of bootloaders. GRUB and LILO are the bootloaders of interest.

Objective 2.213.2; General troubleshooting

A candidate should be able to recognize and identify boot loader and kernel specific stages and utilize kernel boot messages to diagnose kernel errors. This objective includes being able to identify and correct common hardware issues, and be able to determine if the problem is hardware or software.

Objective 2.213.3; Troubleshooting system resources

A candidate should be able to identify, diagnose and repair local system environment.

Objective 2.213.4; Troubleshooting environment configurations

A candidate should be able to identify common local system and user environment configuration issues and common repair techniques.

Identifying boot stages (2.213.1)

Revision: $Revision: 1.8 $

Candidate should be able to: determine, from bootup text, the 4 stages of boot sequence and distinguish between each.

Key files, terms and utilities include:

boot loader start and hand off to kernel
kernel loading
hardware initialization and setup
daemon initialization and setup

Resources: the man pages for the various commands.

The bootstrap process

The boot process has been described at length in Chapter 2, System Startup (202). This section underlines and enhances that description. We will limit our discussion to PC hardware, though most other hardware uses similar schemes.

The PC boot process is started on powerup. The processor will start execution of code contained in the Basic In- and Output System (BIOS). The BIOS is a program stored in Read Only Memory (ROM) and is a part of the PC hardware. Apart from the bootstrap code it contains routines to set up your hardware and to communicate with it. Most of the code in the BIOS is never used by Linux, but the bootstrap code is.

The bootstrap code will load a block of data from sector 0, cylinder 0 of what has been configured to be your boot drive. In most cases this will be the first floppy drive. If reading the floppy disk fails or no floppy disk was inserted, the program in the BIOS will try to load the first sector from the first (specified) hard disk. Most BIOSes allow you to set up an alternate order, i.e. to try the hard disk first or first try to boot from CD.

Wherever the data was found (and if it was found, of course) the BIOS will load it into memory and try to execute it as if it were a program. In most cases the data either consists of code from a boot loader such as LILO, or the start of an operating system kernel like Linux. If the code on the boot sector is illegible or invalid, the BIOS will try the next bootdevice.

Kernel loading

As can be determined from the text above, there are two ways for the kernel to be loaded:

  • by using the kernelcode itself. The first sector of the boot disk will contain the first sector of the Linux kernel itself. That code loads the rest of the kernel from the boot device.

  • by using a bootstrap loader. There are 2 well-known bootstrap loaders for Linux: GRUB (GRand Unified Bootloader) and LILO. LILO is still widely used, but most modern distributions employ GRUB. GRUB has a number of advantages over LILO, such as built in knowledge of filesystems. Hence GRUB is capable of loading configuration files and the kernel directly by their filename. LILO uses a different method: the physical location (track/sector/offset) for the kernel is stored at installation time. The bootloader part of LILO doesn't need knowledge of the filesystem it is booting from.

If a bootstrap loader has been used, it will locate the kernel, load it and execute it. If the kernel has been raw-copied to a diskette its first sector also contains code that loads the rest of the kernelcode from the boot device and consequently executes it.

The kernel will initialize its internal data structures and device drivers. Once it is completely initialized, it consults the contents of the ramdisk word, a fixed address in its binary that specifies where the kernel can find the filesystem that will be mounted as root (`/', the root filesystem). The ramdisk word also can specify that the filesystem is a RAMdisk. A RAMdisk is a memory region that is loaded with a (optionally compressed) image of a filesystem, and that is used if it were a hard disk. If the kernel can not find the root filesystem it halts.

Daemon initialization

Assuming all went well the kernel now is up and running and has mounted its root filesystem. Next, the kernel Will start up the init program, located in either /bin or /sbin. init uses the configuration file /etc/inittab to determine which program(s) to start next.

The way init is used to start up the initial processes varies from distribution to distribution. init can be configured in many ways. But in all cases a number of commands will be issued to set up the basic system such as running fsck on hard disks, initializing swapping and mount disks that are configured in /etc/fstab. Next, a group of commands (often scripts) are executed. They define a so called runlevel. Such runlevels define a set of processes that need to be run to get the system in a certain state, for example multi-user mode or single-user mode.

The initdefault entry in the /etc/inittab file defines the initial runlevel of the system. If there is no such entry or the configuration file was not found init will prompt for a runlevel at the system console. Consequently, all processes specified for that runlevel in the inittab file will be started. In some cases the initial scripts are specified by the sysinit label in the init, in other cases they are considered part of a runlevel.

When a runlevel defines multi-user use, typically a number of daemons is started next. This is done by using start-up scripts, that can be in various location, depending on the distribution you use. Typically such start-up scripts are located in the /etc/rc.d/ directory and named aptly after the software they run, e.g. sendmail, inetd or sshd. Typically, these scripts are linked to another level of directories, one per runlevel. In all runlevels one or more getty programs will be spawned, to enable user logins.

Recognizing the four stages during boot

From what was written before you now should be able to identify the four stages of the bootsequence:

  • boot loader start and hand off to kernel - typically you can recognize this stage because LILO displays the four letters L, I, L, and O. Each of these letters identifies a certain stage in the initial bootprocess. Their meaning is described in more detail in the section called “LILO errors”;

  • kernel loading - this stage can be recognized since the kernel will display various messages, starting with the message Loading followed by the name of your kernel, e.g. Linux-2.2.20.

  • hardware initialization and setup - can be identified by various messages that inform you about the various hardware components that were found and initialized.

  • daemon initialization and setup - this is fairly distribution specific, but this stage can be recognized by messages that typically contain lines like Starting the ... daemon.

The kernel stores its messages is a ring buffer. On most Linux systems that ring buffer is flushed to a file during the last phase of the boot process for later inspection. The command dmesg will display the contents of the current buffer (and actually often is used to flush the ring buffer to a file during the last phase of the boot sequence). Check the manual page for more information.

Troubleshooting LILO

Revision: $Revision: 1.7 $

Candidate should be able to: determine specific stage failures and corrective techniques.

Key files, terms and utilities include:

Know meaning of L, LI, LIL, LILO, and scrolling 010101 errors
Know the different LILO install locations, MBR, /dev/fd0, or primary/extended partition
/boot/boot.b
Know significance of /boot/boot.### files

Resources: the man pages for the various commands, Wirzenius98, Yap98.

Booting from CD-ROM and networks

As of this writing most BIOS's let you choose booting from hard disk, floppy, network or CDROM. To give an oversight these alternatives are outlined below. Since most systems boot from hard disk, this process was described in more detail and is elaborated on later on.

Booting from CDROM requires that your hardware support the El Torito standard. El Torito is a specification that says how a CDROM should be formatted such that you can directly boot from it. A bootable CDROM contains contains a floppy-disk image in its initial sectors. This image is treated like a floppy by the BIOS and booted from.

Booting from the network is done using the Boot Protocol (BOOTP) or the Dynamic Host Configuration Protocol (DHCP). DHCP actually is an evolution of BOOTP. In most cases the client has no means to address the bootserver directly, so the client broadcasts an UDP packet over the network. Any bootserver that has information about the client stored will answer. If more than one server responds, the client will select one of them. Since the requesting client does not yet have a valid IP address, the unique hardware (MAC) address of its network card is used to identify it to the BOOTP server(s) in your network. The BOOTP server(s) will issue the IP address, a hostname, the address of the server where the image of the kernel to boot can be found and the name of that image. The client configures its network accordingly and downloads the specified image from the server that was specified using the Trivial File Transfer Protocol (TFTP). TFTP is often considered to be an unsafe protocol, since there is no authentication. It uses the UDP protocol. However, its triviality also compact implementations that can be stored in a boot-ROM, for example a PC BIOS. After the kernel image has been retrieved, is will be started the usual way. Often, the root filesystem is located on another server too, and NFS is used to mount it. This requires a Linux kernel that allows the root filesystem to be NFS.

Booting from disk or partition

booting from floppy or disk is the common case. In previous chapters we already described the boot process used to boot from floppy. However, there is is a slight difference between floppy and hard disk boots. Both contain a bootsector, located at cylinder 0, head 0, sector 1. On a floppy the boot sector often contains just the boot code to be loaded in memory and executed.

booting from hard disk requires some additional functionality: a hard disk can contain one or more partitions, in which case the boot program needs to find out from which partition to boot. A partition in turn will contain its own bootcode sector. The sector located at cylinder 0, head 0, sector 1 is called the master boot record (MBR).

Information about hard disk partitions is typically stored in partition tables, which are data-structures stored on a special partition sector. There are various types of partition tables, for example IRIX/SGI, Sun or DOS. It depends on the hardware in use which type of partition table is used. In this book we focus on the classical PC (DOS) partition table, which is typical for PC hardware. By default Linux accepts and uses DOS partition tables. Support for other partition types can be enabled in the kernel. On PC hardware the partition table is part of the MBR. You can use the fdisk command to print out your current partition table or to create a new one.

On a PC the BIOS starts loading the first 446 bytes of cylinder 0, head 0, sector 1 into memory. These bytes comprise the boot program. That boot program is executed next. It is up to you which program to use to boot your system. By writing your own boot program you could continue the boot process any way you want. But there are many fine boot programs available, for example the DOS loader and the Linux Loader (LILO). Alternately, you can use another boot loader program, for example GRUB. Sometimes a Windows boot loader is used (i.e. Bootmagic), or even the old fashioned DOS boot loader.

DOS for example uses a loader programs that scans the partition table for a bootable partition. When an entry marked active was found the first sector of that partition is loaded into memory and executed. That code in turn continues the loading of the operating system.

Linux can install a loader program too. Often this will be LILO, the Linux Loader. LILO uses a two-stage approach: the boot sector has a boot program, that loads a boot file, the second stage boot program. That program presents you with a simple menu-like interface, which either prompts you for the operating system to load or optionally times out and loads the the default system. Note, that the code in the MBR is limited: it does not have any knowledge about concepts like filesystems let alone filenames. It can access the hard disk, but needs the BIOS to do so. And the BIOS is not capable of understanding anything but CHS (Cylinder/Heads/Sectors). Hence, to find its boot program, the code in the MBR needs exact specification of the CHS to use to find it. These specifications are figured out by /sbin/lilo, when it installs the boot sector.

The second stage LILO boot program needs information about the following items:

  • where /boot/boot.b can be found; it contains the second stage boot program. The second stage program will be loaded by the initial boot program in the MBR;

  • the /boot/map file, which contains information about the location of kernels, boot sectors etc.; this information is used mostly by the second stage boot program; see below for a more detailed description of the map file;

  • the location of kernel(s) you want to be able to boot

  • the boot sectors of all operating systems it boots

  • the location of the startup message, if one has been defined

Remember, to be able to access these files, the BIOS needs the CHS (Cylinder/Head/Sector) information to load the proper block. This also holds true for the code in the second stage loader. LILO therefore needs a so called map file, that maps filenames into CHS values. This file contains information for all files that LILO needs to know of during boot, for example locations of the kernel(s), the command line to execute on boot, and more. The default name for the map file is /boot/map. /sbin/lilo uses the file /etc/lilo.conf to determine what files to map and what bootprogram to use and creates a mapfile accordingly.

More about partitions tables

The DOS partition table is embedded in the MBR at cylinder 0, head 0, sector 1, at offset 447 (0x1BF) and on. There are four entries in a DOS partition table. Only one of them can be marked as active: the boot program normally will load the first sector of the active partition in memory and deliver control to it.

An entry in the partition table contains 16 bytes, as shown in the following figure:

Figure 13.1. A (DOS) partition table entry

 
|boot? ||start                 ||type  ||partition             |
|      ||cyl      |head   |sect||      ||cyl      |head   |sect|
|------||--------||------||----||------||--------||------||----|


|start in LBA                  ||size in sectors               |
|------||------||------||------||------||------||------||------|

As you can see, each partition entry contains the start and end location of the partition specified as the Cylinder/Head/Sector of the hard disk. Note, that the Cylinder field has 10 bits, therefore the maximum number of sectors that can be specified is (2^10==) 1024. BIOSes traditionally use CHS specifications hence older BIOSes are not capable of accessing data stored beyond the first 1024 cylinders of the disk.

As disks grew in size the partition/disk sizes could not be properly expressed using the limited capacity of the CHS fields anymore. An alternate method of addressing blocks on a hard disk was introduced: Logical Block Addressing (LBA). LBA addressing specifies sections of the disk by their block number relative to 0. A block can be seen as a 512 byte sector. The last 64 bits in a partition table entry contain the begin and end of that partition specified as LBA address of the begin of the partition and the number of sectors.

Tip

Remember that your computer boots using the BIOS disk access routines. Hence, if your BIOS does not cope with LBA addressing you may not be able to boot from partitions beyond the 1024 cylinder boundary. For this reason people with large disks often create a small partition somewhere within the 1024 cylinder boundary, usually mounted on /boot and put the boot program and kernel in there, so BIOS can boot Linux from hard disk. Once loaded, Linux ignores the BIOS - it has its own disk access procedures which are capable of handling huge disks.

The type field contains the type of the partition, which usually relates to the purpose the partition was intended for. To give an impression of the various types of partitions available, a screen dump of the List command within fdisk follows:

 0  Empty           17  Hidden HPFS/NTF 5c  Priam Edisk     a6  OpenBSD
 1  FAT12           18  AST Windows swa 61  SpeedStor       a7  NeXTSTEP
 2  XENIX root      1b  Hidden Win95 FA 63  GNU HURD or Sys b7  BSDI fs
 3  XENIX usr       1c  Hidden Win95 FA 64  Novell Netware  b8  BSDI swap
 4  FAT16 <32M      1e  Hidden Win95 FA 65  Novell Netware  c1  DRDOS/sec (FAT-
 5  Extended        24  NEC DOS         70  DiskSecure Mult c4  DRDOS/sec (FAT-
 6  FAT16           3c  PartitionMagic  75  PC/IX           c6  DRDOS/sec (FAT-
 7  HPFS/NTFS       40  Venix 80286     80  Old Minix       c7  Syrinx
 8  AIX             41  PPC PReP Boot   81  Minix / old Lin db  CP/M / CTOS / .
 9  AIX bootable    42  SFS             82  Linux swap      e1  DOS access
 a  OS/2 Boot Manag 4d  QNX4.x          83  Linux           e3  DOS R/O
 b  Win95 FAT32     4e  QNX4.x 2nd part 84  OS/2 hidden C:  e4  SpeedStor
 c  Win95 FAT32 (LB 4f  QNX4.x 3rd part 85  Linux extended  eb  BeOS fs
 e  Win95 FAT16 (LB 50  OnTrack DM      86  NTFS volume set f1  SpeedStor
 f  Win95 Ext'd (LB 51  OnTrack DM6 Aux 87  NTFS volume set f4  SpeedStor
10  OPUS            52  CP/M            93  Amoeba          f2  DOS secondary
11  Hidden FAT12    53  OnTrack DM6 Aux 94  Amoeba BBT      fd  Linux raid auto
12  Compaq diagnost 54  OnTrackDM6      a0  IBM Thinkpad hi fe  LANstep
14  Hidden FAT16 <3 55  EZ-Drive        a5  BSD/386         ff  BBT
16  Hidden FAT16    56  Golden Bow

Extended partitions

The design limitation that imposes a maximum of four partitions proved to be troublesome as disks grew larger and larger. Therefore, a work-around was invented: by specifying one of the partitions as a DOS Extended partition it in effect becomes a container for more partitions aptly named logical partitions. The Extended partition can be regarded as a container, that holds one or more logical partitions. The total size of all logical partitions within the extended partition can never exceed the size of that extended partition.

In principle Linux lets you create as many logical partitions as you want, of course restricted by the physical boundaries of the extended partition and hardware limitations. The logical partitions are described in a linked list of sectors. The four primary partitions, present or not, get numbers 1-4. Logical partitions start numbering from 5. The main disk contains a partition table that describes the partitions, the extended partitions contain logical partitions that in turn contain a partition table that describes a logical partition and a pointer to the next logical partitions partition table, see the ASCII art below:

The LILO install locations

LILO's first stage loader program can either be put in the MBR, or it can be put in any partitions boot sector. Of course, you could put it in both locations if you wanted to, for example in the MBR to decide whether to boot Windows, DOS or Linux and if Linux is booted, its boot sector could contain LILO's primary loader too, which would for example enable you to choose between different versions/configurations of the kernel.

The tandem Linux and Windows is frequently used to ease the migration of services to the Linux platform or to enable both Linux and Windows to run on the same computer. To dual boot Linux and Windows 95/98, you can install LILO on the master boot record. Windows NT and Windows 2000 require their own loader in the MBR. In these case, you can install LILO in the Linux partition as a secondary boot loader. The initial boot will be done by the Windows loader in the MBR, which then can transfer control to LILO.

LILO backup files

/sbin/lilo can create the bootprogram in the MBR or in the first sectors of a partition. The bootprogram, sometimes referred to as the first stage loader will try to load the second stage boot loader. The seconds stage bootloader is contained in a file on the boot partition of your Linux system, by default it is in the file /boot/boot.b.

If you use /sbin/lilo to write the bootprogram it will try to make a backup copy of the old contents of the bootsector and will write the old contents in a file named /boot/boot.####. The hash symbols are actually replaced by the major and minor numbers of the device where the original bootsector used to be, for example, the backup copy of the MBR on the first IDE disk would be stored as /boot/boot.0300: 3 is the major number for the device file /dev/hda, and 0 is the minor number for it. /sbin/lilo will not overwrite an already existing backup file.

LILO errors

When LILO loads itself, it displays the word

LILO

Each letter is printed before or after performing some specific action. If LILO fails at some point, the letters printed so far can be used to identify the problem.

(nothing)

No part of LILO has been loaded. Either LILO isn't installed or the partition on which its boot sector is located isn't active.

L error

The first stage boot loader has been loaded and started, but it can't load the second stage boot loader. The two-digit error codes indicate the type of problem. This condition usually indicates a media failure or a geometry mismatch. The most frequent causes for a geometry mismatch are not physical defects or invalid partition tables but errors during the installation of LILO. Often these are caused by ignoring the 1024 cylinder boundary.

This error code signals a transient problem - in that case LILO will try to resume or halt the system. However, sometimes the error code is not transient and LILO will repeat it, over and over again. This means that you end up with a scrolling screen that contains just the error codes. For example: the error code 01 signifies an illegal command. This signifies that the disk type is not supported by your BIOS or that the geometry can not correctly be determined. Other error codes are described in full in the LILO's user documentation.

LI

The first stage boot loader was able to load the second stage boot loader, but has failed to execute it. This can either be caused by a geometry mismatch or by moving /boot/boot.b without running the map installer.

LIL

The second stage boot loader has been started, but it can't load the descriptor table from the map file. This is typically caused by a media failure or by a geometry mismatch.

LIL?

The second stage boot loader has been loaded at an incorrect address. This is typically caused by a subtle geometry mismatch or by moving /boot/boot.b without running the map installer.

LIL-

The descriptor table is corrupt. This can either be caused by a geometry mismatch or by moving /boot/map without running the map installer.

LILO

All parts of LILO have been successfully loaded.

Copyright Snow B.V. The Netherlands