Identify, capture and resolve common issues faced by Red Hat Enterprise Linux administrators using best practices and advanced troubleshooting techniquesAbout This BookDevelop a strong understanding of the base tools available within Red Hat Enterprise Linux (RHEL) and how to utilize these tools to troubleshoot and resolve real-world issuesGain hidden tips and techniques to help you quickly detect the reason for poor network/storage performanceTroubleshoot your RHEL to isolate problems using this example-oriented guide full of real-world solutions Who This Book Is For If you have a basic knowledge of Linux from administration or consultant experience and wish to add to your Red Hat Enterprise Linux troubleshooting skills, then this book is ideal for you. The ability to navigate and use basic Linux commands is expected.What You Will LearnIdentify issues that need rapid resolution against long term root cause analysisDiscover commands for testing network connectivity such as telnet, netstat, ping, ip and curlSpot performance issues with commands such as top, ps, free, iostat, and vmstatUse tcpdump for traffic analysisRepair a degraded file system and rebuild a software raidIdentify and troubleshoot hardware issues using dmesgTroubleshoot custom applications with strace and knowledge of Linux resource limitations In Detail Red Hat Enterprise Linux is an operating system that allows you to modernize your infrastructure, boost efficiency through virtualization, and finally prepare your data center for an open, hybrid cloud IT architecture. It provides the stability to take on today's challenges and the flexibility to adapt to tomorrow's demands. In this book, you begin with simple troubleshooting best practices and get an overview of the Linux commands used for troubleshooting. The book will cover the troubleshooting methods for web applications and services such as Apache and MySQL. Then, you will learn to identify system performance bottlenecks and troubleshoot network issues; all while learning about vital troubleshooting steps such as understanding the problem statement, establishing a hypothesis, and understanding trial, error, and documentation. Next, the book will show you how to capture and analyze network traffic, use advanced system troubleshooting tools such as strace, tcpdump & dmesg, and discover common issues with system defaults. Finally, the book will take you through a detailed root cause analysis of an unexpected reboot where you will learn to recover a downed system.Style and approach This is an easy-to-follow guide packed with examples of real-world core Linux concepts. All the topics are presented in detail while you’re performing the actual troubleshooting steps.

Table of Contents

Red Hat Enterprise Linux Troubleshooting Guide


About the Author

About the Reviewers


Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders


What this book covers

What you need for this book

Who this book is for


Reader feedback

Customer support

Downloading the example code




1. Troubleshooting Best Practices

Styles of troubleshooting

The Data Collector

The Educated Guesser

The Adaptor

Choosing the appropriate style

Troubleshooting steps

Understanding the problem statement

Asking questions



Attempting to duplicate the issue

Running investigatory commands

Establishing a hypothesis

Putting together patterns

Is this something that I've encountered before?

Trial and error

Start by creating a backup

Getting help


Team Wikis or Runbooks


Man pages

Reading a man page





Additional sections

Info documentation

Referencing more than commands

Installing man pages

Red Hat kernel docs


Following up


Root cause analysis

The anatomy of a good RCA

The problem as it was reported

The actual root cause of the problem

A timeline of events and actions taken

Any key data points to validate the root cause

A plan of action to prevent the incident from reoccurring

Establishing a root cause

Sometimes you must sacrifice a root cause analysis

Understanding your environment


2. Troubleshooting Commands and Sources of Useful Information

Finding useful information

Log files

The default location

Common log files

Finding logs that are not in the default location

Checking syslog configuration

Checking the application's configuration

Other examples

Using the find command

Configuration files

Default system configuration directory

Finding configuration files

Using the rpm command

Using the find command

The proc filesystem

Troubleshooting commands

Command-line basics

Command flags

The piping command output

Gathering general information

w – show who is logged on and what they are doing

rpm – RPM package manager

Listing all packages installed

Listing all files deployed by a package

Using package verification

df – report file system space usage

Showing available inodes

free – display memory utilization

What is free, is not always free

The /proc/meminfo file

ps – report a snapshot of current running processes

Printing every process in long format

Printing a specific user's processes

Printing a process by process ID

Printing processes with performance information


ip – show and manipulate network settings

Show IP address configuration for a specific device

Show routing configuration

Show network statistics for a specified device

netstat – network statistics

Printing network connections

Printing all ports listening for tcp connections



iotop – a simple top-like I/O monitor

iostat – report I/O and CPU statistics

Manipulating the output

vmstat – report virtual memory statistics

sar – collect, report, or save system activity information

Using the sar command


3. Troubleshooting a Web Application

A small back story

The reported issue

Data gathering

Asking questions

Duplicating the issue

Understanding the environment

Where is this blog hosted?

Lookup IPs with nslookup

What about ping, dig, or other tools?

Ok, it's within our environment; now what?

What services are installed and running?

Validate the web server

Validating the database service

Validating PHP

A summary of installed and running services

Looking for error messages

Apache logs

Finding the location of Apache's logs

Reviewing the logs

Using curl to call our web application

Requesting a non-PHP page

Reviewing generated log entries

What we learned from httpd logs

Verifying the database

Verifying the WordPress database

Finding the installation path for WordPress

Checking the default configuration

Finding the database credentials

Connecting as the WordPress user

Validating the database structure

What we learned from the database validation

Establishing a hypothesis

Resolving the issue

Understanding database data files

Finding the MariaDB data folder

Resolving data file issues


Final validation


4. Troubleshooting Performance Issues

Performance issues

It's slow




Top – a single command to look at everything

What does this output tell us about our issue?

Individual processes from top

Determining the number of CPUs available

Threads and Cores

lscpu – Another way to look at CPU info

ps – Drill down deeper on individual processes with ps

Using ps to determine process CPU utilization

Putting it all together

A quick look with top

Digging deeper with ps


free – Looking at free and used memory

Linux memory buffers and caches

Swapped memory

What free tells us about our system

Checking for oomkill

ps - Checking individual processes memory utilization

vmstat – Monitoring memory allocation and swapping

Putting it all together

Taking a look at the system's memory utilization with free

Watch what is happening with vmstat

Finding the processes that utilize the most memory with ps


iostat – CPU and device input/output statistics

CPU details

Reviewing I/O statistics

Identifying devices

Who is writing to these devices?

ps – Using ps to identify processes utilizing I/O

iotop – A top top-like command for disk i/o

Putting it all together

Using iostat to determine whether there is a I/O bandwidth problem

Using iotop to determine which processes are consuming disk bandwidth

Using ps to understand more about processes


ifstat – Review interface statistics

Quick review of what we have identified

Comparing historical metrics

sar – System activity report





Review what we learned by comparing historical statistics


5. Network Troubleshooting

Database connectivity issues

Data collection

Duplicating the issue

Finding the database server

Testing connectivity

Telnet from blog.example.com

Telnet from our laptop


Troubleshooting DNS

Checking DNS with dig

Looking up DNS with nslookup

What did dig and nslookup tell us?

A bit about /etc/hosts

DNS summary

Pinging from another location

Testing port connectivity with cURL

Showing current network connections with netstat

Using netstat to watch for new connections

Breakdown of netstat states

Capturing network traffic with tcpdump

Taking a look at the server's network interfaces

What is a network interface?

Viewing device configuration

Specifying the interface with tcpdump

Reading the captured data

A quick primer on TCP

Types of TCP packet

Reviewing collected data

Taking a look on the other side

Identifying the network configuration

Testing connectivity from db.example.com

Looking for connections with netstat

Tracing network connections with tcpdump


Viewing the routing table

The default route

Utilizing IP to show the routing table

Looking for routing misconfigurations

More specific routes win


Trial and error

Removing the invalid route

Configuration files


6. Diagnosing and Correcting Firewall Issues

Diagnosing firewalls

Déjà vu

Troubleshooting from historic issues

Basic troubleshooting

Validating the MariaDB service

Troubleshooting with tcpdump

Understanding ICMP

Understanding connection rejections

A quick summary of what you have learned so far

Managing the Linux firewall with iptables

Verify that iptables is running

Show iptables rules being enforced

Understanding iptables rules

Ordering matters

Default policies

Breaking down the iptables rules

Putting the rules together

Viewing iptables counters

Correcting the iptables rule ordering

How iptables rules are applied

Modifying iptables rules

Testing our changes


7. Filesystem Errors and Recovery

Diagnosing filesystem errors

Read-only filesystems

Using the mount command to list mounted filesystems

A mounted filesystem

Using fdisk to list available partitions

Back to troubleshooting

NFS – Network Filesystem

NFS and network connectivity

Using the showmount command

NFS server configuration

Exploring /etc/exports

Identifying the current exports

Testing NFS from another client

Making mounts permanent

Unmounting the /mnt filesystem

Troubleshooting the NFS server, again

Finding the NFS log messages

Reading /var/log/messages

Read-only filesystems

Identifying disk issues

Recovering the filesystem

Unmounting the filesystem

Filesystem checks with fsck

The fsck and xfs filesystems

How do these tools repair a filesystem?

Mounting the filesystem

Repairing the other filesystems

Recovering the / (root) filesystem



8. Hardware Troubleshooting

Starting with a log entry

What is a RAID?

RAID 0 – striping

RAID 1 – mirroring

RAID 5 – striping with distributed parity

RAID 6 – striping with double distributed parity

RAID 10 – mirrored and striped

Back to troubleshooting our RAID

How RAID recovery works

Checking the current RAID status

Summarizing the key information

Looking at md status with /proc/mdstat

Using both /proc/mdstat and mdadm

Identifying a bigger issue

Understanding /dev

More than just disk drives

Device messages with dmesg

Summarizing what dmesg has provided

Using mdadm to examine the superblock

Checking /dev/sdb2

What we have learned so far

Re-adding the drives to the arrays

Adding a new disk device

When disks are not added cleanly

Another way to watch the rebuild status


9. Using System Tools to Troubleshoot Applications

Open source versus home-grown applications

When the application won't start

Exit codes

Is the script failing, or the application?

A wealth of information in the configuration file

Watching log files during startup

Checking whether the application is already running

Checking open files

Understanding file descriptors

Getting back to the lsof output

Using lsof to check whether we have a previously running process

Finding out more about the application

Tracing an application with strace

What is a system call?

Using strace to identify why the application will not start

Resolving the conflict


10. Understanding Linux User and Kernel Limits

A reported issue

Why is the job failing?

Background questions

Is the cron job even running?

User crontabs

Understanding user limits

The file size limit

The max user processes limit

The open files limit

Changing user limits

The limits.conf file

Future proofing the scheduled job

Running the job again

Kernel tunables

Finding the kernel parameter for open files

Changing kernel tunables

Permanently changing a tunable

Temporarily changing a tunable

Running the job one last time

A look back

Too many open files

A bit of clean up


11. Recovering from Common Failures

The reported problem

Is Apache really down?

Why is it down?

What else was happening at that time?

Searching the messages log

Breaking down this useful one-liner

The cut command

The sort command

The uniq command

Tying it all together

What happens when a Linux system runs out of memory?

Minimum free memory

A quick recap

How oom-kill works

Adjusting the oom score

Determining whether our process was killed by oom-kill

Why did the system run out of memory?

Resolving the issue in the long-term and short-term

Long-term resolution

Short-term resolution


12. Root Cause Analysis of an Unexpected Reboot

A late night alert

Identifying the issue

Did someone reboot this server?

What do the logs tell us?

Learning about new processes and services

What caused the high load average?

What are the run queue and load average?

Load average

Investigating the filesystem being full

The du command

Why wasn't the queue directory processed?

A checkpoint on what you learned

Sometimes you cannot prove everything

Preventing reoccurrence

Immediate action

Long-term actions

A sample Root Cause Analysis

Problem summary

Problem details

Root cause

Action plan

Further actions to be taken



