Troubleshooting a non-booting Linux box January 31, 2011Posted by Tournas Dimitrios in Linux.
This article is mainly targeted to Redhad based Linux distributions , but the concepts remain unchanged for all Linux boxes . To effectively troubleshoot the boot process and track down the root cause , it is vital that you already known the boot sequence . For example , if the problem occurs before rc.sysinit gets run then we immediately know not to bother checking services or even any partitions besides the root partition because they have not been brought up until after rc.sysinit is finished .
If you need a refresh for the boot sequence read first this article before you continue .
When a problem is so bad that it prevents the system from booting or allowing users to log in , there are three very important “working modes ” that can help :
- single user mode
- emergency mode
- rescue mode (it needs the boot cd )
Let’s outline each “mode ” separately :
Single-user mode :
In single user mode the system loads the kernel , runs rc.sysinit and then drops the user to a root shell , bypassing all authentication . Depending on how it is called single user mode may also run initialization scripts in the /etc/rc.d/rc1.d/ directory . This mode is used to bypass the starting of high-level services like apache , sendmail , vsftpd , nfs …. so it is suited to do some maintenance tasks like file-system checks etc… Also it is mostly used to change the root password if it becomes lost , corrupt or anything else happens to prevent the root user to log-in .
Be warned that because single-user mode allows complete control of the system without requiring a password it can pose a serious security risk unless the proper precautions are taken .Physical access to sensitive machines should always be password-protected . Remember that password- protecting GRUB does not prevent systems from booting normally , it only prevents unauthorized users from passing extra kernel arguments .
To enter single user mode , pass the arguments S or s or single to the kernel at the boot process .
Emergency mode :
In emergency mode the startup process is greatly simplified . Once the kernel is loaded , the user is simply prompted for the root password . If the correct password is entered then a root shell is created and the user is left to do the rest . Because rc.sysinit does not run , no partitons other than the root partition are mounted and no low-level services are started . Because rc does not run , no high-level services are started . This is obviously not a runlevel for ” every – day ” use . However , because it bypasses so much of the startup process, there are only a few things required to enter emergency mode .
- There must be a bootloader present to load the kernel
- There must be a kernel present to load
- The root partition must be mountable
- The /etc/passwd file must be intact
- The /sbin/sulogin command and its associated libraries must be available to prompt for the password
- The userm ust have the correct credentials ( password)
- There must be a working shell for the authenticated user .
compared to the number of elements involved in a normal startup
(92 services on my CentOs 5 box ) , that is relatively short list. If you can enter emergency mode then you know that none of the previous mentioned elements are malfunctioning and can begin troubleshooting other elemetnts of the system ( propably starting with rc.sysinit and rc ) . If you cannot , then you have narrowed the scope of yur investigation significantly .
Emergency mode is entered by passing the argument ” emergency ” to the kernel via the bootloader (GRUB) .When the GRUB menu appears , select the kernel you wish to boot and press “a” to append the argument and press “b ” to start the boot process.
Also the init itself can be bypassed by passing the kernel boot parameter : init-/bin/bash (no root password needed ) .
The last resource for troubleshooting :
The previous two modes illustrated the important role of the GRUB bootloader when troubleshooting a non-booting system . Most problems can be solved by rebooting the machine and passing arguments to the kernel via the bootloader . The worst case scenario is when we cannot pass arguments to the kernel when :
- The bootloader is malfunctioning ( femaged ) or missing . Remember that the primary task of the bootloader is to load the kernel image and its supporting files ( initial ram “initrd”)
- The kernel is malfunctioning or missing
- The root partition is unusable . If it is damaged to the point that it cannot be mounted by the kernel , the GRUB is useless.
In these situations , an administrator needs to resort to some other boot media to boot the system . Fortunately , two form of help is available .
- Boot floppy : Create it with the mkbootdisk command on a runing system .
- The rescue mode
The rescue mode :
The rescue environment is provided by an boot – cd or the first installation cd / dvd . The bootloader and the kernel is extracted from the cd media . These are the steps :
- After the system starts booting from the CD-ROM you will see a screen that displays a boot prompt. Enter “linux rescue” at the boot prompt to boot the system in rescue mode.
- You will be asked to select the language, which defaults to English. Select the appropriate language and press OK to continue.
- You will be asked to select the keyboard type, which defaults to us (USA). Select the appropriate keyboard type and press OK to continue.
- Linux notes that the rescue mode will attempt to mount your existing partitions under /mnt/sysimage directory. You should select the Skip button because if you are trying to fix a file system problem (most file system tools do not work on mounted file systems for security reasons).
- Once you are at the shell prompt, run the fsck partition_name command to perform a file system check for your hard drive. For example, to check the first partition of your first IDE/EIDE disk, run the `fsck /dev/hda1` command. Similarly, to check the first partition of your first SCSI disk run the `fsck /dev/sda1` command. This will display errors that the fsck program finds in the named partition and ask you to take an action. If you are not interested in getting prompted for one or more file system errors and want the fsck program to fix whatever it can, you can run this same command with a “-p” option. For example, the `fsck -p /dev/hda1` command runs the fsck program on the first IDE/EIDE hard drive partition and tries to fix everything it can.
- If your disk has bad blocks, you can locate bad blocks in a disk by running the badblocks device_name command. For example, to find bad blocks on the second partition of the first IDE/EIDE disk run the `badblocks /dev/hda2` command.
- Repeat the previous steps as many times necessary to check all your file system partitions.
Press Ctrl+Alt+Del to reboot the system and make sure to remove the CD-ROM so that your system can boot from your hard drive.