M53 - Troubleshooting Lab: 5 Break Scenarios

SysAdmin

Troubleshooting Lab: 5 Break Scenarios

Apply structured troubleshooting to realistic OS failures by choosing the first checks, narrowing scope, and proposing a safe next action.

55 min ADVANCED BOTH Curriculum-reviewed

Prerequisites

Module M52 - Systematic Troubleshooting: PDIVET Module M32 - Network Diagnostics Module M46 - Logging & Event Analysis Module M47 - Services Management

What you should be able to do after this

Apply structured troubleshooting to realistic OS failures by choosing the first checks, narrowing scope, and proposing a safe next action.

The Point of This Lab

This lab is not about memorizing one magic command per scenario.

It is about practicing three habits:

choose the first checks well
narrow the problem before changing things
make the next action safe and explainable

For each case, focus on:

what the symptom really tells you
what scope you should test first
which command or observation gives the most useful evidence next

Scenario 1: Name Fails, Network Might Not

Problem: Users cannot reach wiki.corporate.local.

Observed symptom: Browser reports a name-resolution error.

Good first questions:

is general network connectivity working?
is only the name lookup failing?

Strong first checks:

ping 8.8.8.8
nslookup wiki.corporate.local or dig wiki.corporate.local

Likely lesson: If raw connectivity works but name lookup fails, the problem is probably DNS rather than the target server itself.

Scenario 2: Connection Refused on One Port

Problem: A database host is reachable, but connections to port 3306 are refused.

Good first questions:

is the service listening?
is the service running?
is the refusal local to the service or caused by a filter in front of it?

Strong first checks:

ss -tulpn | grep 3306 on Linux or a Windows equivalent for listening ports
systemctl status mysql or Get-Service
service logs if the process is stopped or failing

Likely lesson: A refused port often points toward a stopped service or an application not listening, not always a general network outage.

Scenario 3: System Feels Slow, Not Dead

Problem: A web server responds, but it is extremely slow.

Good first questions:

is the bottleneck CPU, memory, or disk?
is a background job competing with the main workload?

Strong first checks:

top or htop
disk or I/O observation tools
recent backup, compression, or maintenance activity

Likely lesson: Slow response often needs resource inspection before service restarts or config changes.

Scenario 4: The Fix Did Not Survive Reboot

Problem: A change seemed to work yesterday, but after reboot the system is broken again.

Good first questions:

was the fix made in a persistent location?
does startup overwrite or regenerate that state?
is the service reading the same config you changed?

Strong first checks:

confirm the service is running
inspect the expected configuration file
check whether automation, container recreation, or policy management rewrote the change

Likely lesson: Some fixes disappear because they were made in the wrong layer or the wrong place, not because the change itself was bad.

Scenario 5: “No Space Left” Even Though Space Exists

Problem: An application cannot write temporary files, but df -h still shows free space.

Good first questions:

is the block space full?
is the filesystem out of inodes instead?

Strong first checks:

df -h
df -i

Likely lesson: Storage problems are not only about gigabytes. Filesystem metadata limits matter too.

How To Use This Lab Well

For each scenario, write down:

the first two checks you would run
what each result would help you confirm or exclude
the safest next action after those checks

If you can explain your sequence clearly, you are building real troubleshooting ability.

What You Just Practiced

distinguishing symptoms from causes
selecting high-value first checks
reducing scope before making changes
proposing a next action that tests a hypothesis instead of guessing

This is the right mindset to carry into the intensive CLI and capstone sections that follow.

Study flow

Read for understanding first, practice immediately after, then mark complete only when you can explain the idea back in your own words.

Security labs Monitoring labs Command library

Self-check

How do you choose the first diagnostic commands in a new incident?
What makes a troubleshooting next step safe and useful?

If not yet, do one more practice pass before you move forward.

Move on when

For a realistic failure scenario, name the first checks you would run and explain why.
Propose a next action that tests a hypothesis without creating unnecessary new risk.

Completion matters only if you can do these without leaning on the page.

Review rhythm

After you finish, come back tomorrow for a short no-notes recall pass.

Use spaced recalls around day 1, day 3, day 7, day 14.
Try to explain the main idea before reopening the page.
Do one related lab or command lookup from memory, not recognition alone.
Re-read only the parts you could not retrieve cleanly.

Learning context

Best for

Need structureIT / support / sysadminSerious hands-on operatorExpert refresh / reference

Pathways

Career operator pathFoundations with structured review

Study fit

Beginner-safe Prerequisites are strongly recommended Balanced lesson and reference

This lesson is paced to support newer learners without assuming too much prior system experience.

This lesson is meant to teach the idea and still leave you with usable command and workflow recall.

Topics

labcapstonetroubleshootingpdivetdnsfirewallservices

Use right after this lesson

Lab LAB-MON-02 - The Journalctl Vault Lab LAB-SEC-04 - Hardening with chattr Library ping Library nslookup Library dig Library systemctl Library Get-Service Library journalctl

On this page