Save Your Linux Machine From Certain Death
Recovering your root password and more
Apr 30 ·7min read
Troubleshooting damaged systems is an essential skill of every SysAdmin, SRE, or DevOps engineer. Every one of us runs into OS-related issues from time to time and it’s better to be prepared when things go terribly wrong.
It’s especially beneficial to be able to identify and act on the issue quickly to prevent any significant damage. To help with that in this article, we will go over a few common problems that you might encounter as well as ways to gather information, troubleshoot, and solve these issues.
Note: This article uses RHEL 8 / CentOS . But, the examples/concepts below can be applied to any Linux distribution.
Recovering R oot
Password
What if you lose the root
password and you don't have access to a privileged user? If you still have access to the machine, then there is a way to solve this inconvenient situation.
First, start by rebooting the machine. When the machine starts, hit any key to access the boot menu:
In the boot menu, hit e
to edit boot options. Using the arrows, move to the line starting with linux
and append rd.break
. This breaks the boot process early on.
Optionally, you can also append enforcing=0
, to pause SELinux enforcing. Next, hit CTRL+X
to let the machine boot.
After a few seconds of booting, you should get the shell. At this point, you have access to the system in read-only mode.
So, to change anything in the system — like the root
password — we need to make the filesystem read-write . We can do that by running mount -o remount,rw /sysroot
.
The next thing we need to do is enter the root jail using chroot /sysroot
— this changes the root of the filesystem to /sysroot
instead of /
. This is required so that any further commands we run will be in regards to the /sysroot
directory. Now we can change the root
password using passwd
.
If you added enforcing=0
to boot options, you can now hit CTRL+D
(or type exit
) and let the system fully boot. If not, run touch /.autorelabel
to trigger the SELinux system relabel.
This is needed because changing the password results in /etc/password
having an incorrect SELinux security context. Therefore, we need to relabel the whole filesystem during the next boot (this can take some time, depending on the size of the filesystem).
As an alternative solution, you could also access Linux’ debug-shell
. This can be done, again, by accessing GRUB during boot and appending systemd.debug-shell
instead of rd.break
.
When you let the system boot with this option, you will end up in a normal shell session, which isn’t very helpful. If you, however, try to access terminal 9 using CTRL+ALT+F9
, you will open debug-shell
with full root
permissions.
Here, you can change the password normally. At this point, you can switch back to a normal shell ( CTRL+ALT+F1
) and log in.
You shouldn’t forget to stop the debug-shell
though, as it is a huge vulnerability to the system. You can do that by running systemctl stop debug-shell.service
(you can still switch back to debug-shell
but it will be unresponsive; killed-off).
Fixing Unmountable Filesystems
Creating new partitions, creating filesystems, mounting filesystems, etc. are common tasks for most SysAdmins.
But, even though these are basic tasks, it’s easy to make a mistake that may render your system unbootable. Let’s see how you can solve problems related to unmountable filesystems.
As with previous solutions, we start by rebooting the machine, accessing the boot menu and editing it, this time appending systemd.unit=emergency.target
. This tells your system to boot into an emergency target instead of the default one (multi-user or graphical).
When the system boots and we get the shell, we login as root and we again remount the filesystem using mount -o remount,rw /
. Now we can try mounting all filesystems by running mount -a
.
If there is a problem with mounting a specific filesystem, you might see an error message like mount: /wrong-mount: mount point does not exist.
or mount /wrong-mount: special device /dev/sdb1 does not exist.
. These kinds of issues need to be fixed inside /etc/fstab
:
After fixing the issue in /etc/fstab
, run systemctl daemon-reload
, so that systemd
picks up the changes. Now, run mount -a
again. If the issue was indeed fixed, you should see no error (no news, is good news). You can now exit using CTRL+D
and let the system boot normally.
Aside from a mistyped device or mount point name, you might also encounter issues with VDO (Virtual Data Optimizer) or Stratis, which require extra mount arguments.
E.g. x-systemd.requires=vdo.service
or x-systemd.requires=stratisd.service
, without which the system won’t boot properly.
Another common and easily fixable mistake might be a missing quote when using UUID="...
to specify the device (use /etc/fstab
syntax highlighting, it can save you a lot of problems).
Troubleshooting SELinux Problems
This one is not a life and death kind of a situation, but it can cause a lot of problems, so it’s beneficial to be able to identify it quickly when it happens.
It’s important to realize that most of the time, SELinux is doing its job correctly. But it might just happen that you are trying to achieve something SELinux doesn’t expect.
Some of the problems you might encounter may include issues with incorrect file context, for example, after moving a file from one place to another. Sometimes the issue might be with overly restrictive policies (SELinux booleans) or blocked service ports.
One can troubleshoot all of these problems by first temporarily changing SELinux to non-enforcing mode using setenforce 0
and retrying the action that wasn’t working previously.
If the problem was fixed by switching SELinux to non-enforcing mode, then we know that the problem was caused by an SELinux violation.
Now, if we turn SELinux back on using setenforce 1
, we can try to analyze and fix the violation.
First, install setroubleshoot-server
using yum -y install setroubleshoot-server
. This troubleshooting server will listen to /var/log/audit/audit.log
and send summary messages to /var/log/messages
.
Next, to analyze these messages, run grep sealert /var/log/messages
which should give you messages like this:
As an example here, I configured httpd
to run on port 8012
which is blocked because of SELinux service’s allowed ports. If we were not aware of this, then it would be quite hard to find the root cause of this issue.
The output above can help with that. We can see a description of the SELinux violation as well as a command that can help us troubleshoot further, so let’s try it out:
This produces a full report of what caused the violation. Including a suggested (not necessarily the most appropriate) fix.
If you have some experience with SELinux, you might realize that the most appropriate way to fix this issue is to add the relevant port to the SELinux service ( http_port_t
). This can be done by running semanage port -a -t http_port_t -p tcp 8012
.
This pattern of replicating the violation, looking for sealert
messages in var/log/messages
and viewing the report, and analyzing the report can be applied to any SELinux violation/problem, not just the one example above.
Alternatively, you can also search directly in /var/log/audit/audit.log
using ausearch
. The specific command you would want to run: ausearch -m AVC -ts recent
. This shows all recent denials.
The output should look something like this (same information, but a little less user friendly):
Getting Logs From a Crashing System
By default, logs stored in /run/log/journal
are not persisted across system reboots. That might become a problem if you need to debug logs on a crashing system.
To preserve journal logs, we need to modify /etc/systemd/journald.conf
. More specifically, the Storage
parameter:
By uncommenting and changing Storage
to persistent
, we tell systemd
to store all logs in var/log/journal
. Aside from this change, we also need to run systemctl reload systemd-journald
to make sure that the change takes effect.
Even though this change will persist logs on your system, it won’t keep all of them forever. By default, journald
is configured to not exceed 10% of the filesystem or leave the system with less than 15% of free space.
Now, to actually inspect the previously-stored logs. First, switch to root
user. Run journalctl --list-boots
. This will give you a list like this:
Based on the dates and times, choose from which boot you want to see logs. For example, to view logs from the boot with id -2
with log level err
or higher:
If the logs above are not enough to troubleshoot your issues, then there are other log files to check:
/var/log/messages /var/log/boot.log
Alternatively, if you can’t boot your machine normally, then you can access emergency.target
as shown above, a view logs there in the same way.
Conclusion
There is a lot more that can go wrong with a Linux machine than what I have shown in the sections above. These examples/approaches, however, can be applied to a variety of other problems that you might encounter.
Also, not all of them are life-and-death kind of situations, but it’s always preferable to be able to solve them rather quickly, especially if this problematic machine is a production system.
Solving most of the issues depends on getting the right information and being able to restore previous configurations, therefore, it’s crucial to always store logs and to backup critical files on your system before modifying them
This article was originally posted at martinheinz.dev
以上所述就是小编给大家介绍的《Save Your Linux Machine From Certain Death》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。