After 30 years managing Linux servers, I've found these practices helped me stay focused and effective.
It’s not just knowing how to set up and maintain your servers and understanding how system commands work that makes you a good system administrator. You need to know how to fix things when something breaks down, how to keep systems and data secure, how to monitor performance, how to manage backups, and how to craft clever scripts that make your work more consistent and save you time to do all the rest of your work. It’s knowing these things and holding yourself to a set of cardinal rules that can help you keep your systems running smoothly and your users happy.
I spent more than 30 years managing Linux servers. My jobs ranged from doing all the systems work in a company with only a few employees to managing all the servers in the physics and astronomy department at a high-ranked university and a couple large and several significant federal agencies. Keeping my skills honed and my attention focused was always invaluable.
I developed these rules over time, and they came to dominate how I handled my job and helped me stay focused on what was most important.
Rule 1: Never do anything you can’t back out of
Always be fully aware of the impact of the changes you are making on a Linux system, and know how to back out the changes if something goes wrong. This might involve restoring a user account from your backups, reverting to an older version of an application, or even, depending on where you’re working, reverting to a backup server. Always have plans for what you’re going to do if something goes wrong.
Rule 2: Avoid making changes on Fridays
If you’re going to make some significant changes to a system, don’t pick a time right before you’re going to disappear for a few days. Ensure that the system or application is running reliably before you move on to some other task or drive home.
Rule 3: Identify root causes
Whenever possible, identify the root causes of problems that you encounter. Knowing the underlying cause of a problem can help you avoid similar problems in the future.
Rule 4: Practice your disaster recovery plans
Develop disaster recovery plans and practice them like you might conduct a fire drill. Make sure you can smoothly slide to an alternate system or backup server as needed while you get the problematic one back in shape.
Rule 5: Automate anything you have to do more than three times, especially when it’s complicated
Scripting routine tasks, especially complex tasks, will avoid potential mistakes. It will also save you quite a bit of time and make it easy for someone else to run the script when you can’t.
Rule 6: Never rely on a script you haven’t thoroughly tested
Always test your scripts to be sure that they work exactly as you intended – especially complicated scripts.
Rule 7: Document your work
Document your scripts and your routines well enough that someone else understands what to do when you can’t. Scripts should have enough comments to make them easy to read and, if necessary, modify. Add just enough comments to make it clear what the script is doing. Over-commenting on obvious commands is counterproductive.
Rule 8: Pay attention to your mistakes
Everyone makes mistakes from time to time. Pay attention to the kind of mistakes that you tend to make – that can help you avoid them.
Rule 9: Be a little paranoid
No, don’t actually be paranoid, but don’t be overconfident either. Look for potential problems in the work you do, and ask yourself what could go wrong and how you might prepare for it.
Rule 10: Be proactive
Always make time to consider what could be improved, be made more reliable, run faster or be easier to use or maintain.
Rule 11: Pay a LOT of attention to security
Ensure that the systems you manage are secure. Require complex passwords with periodic expiration dates. Ensure that access to the root account is limited. Limit sudo privileges to necessary commands. Pay attention to access privileges for all users.
Rule 12: Don’t ignore your log files
Check your log files for any indications of problems and ensure they have adequate disk space.
Rule 13: Back up nearly everything
Use reliable back-up procedures to ensure that important files can be recovered as needed.
Rule 14: Consider everyone’s time as valuable as your own
Be considerate of your users, your fellow sysadmins, and anyone who supplies you with important assistance. Appreciate their work.
Rule 15: Keep your users (customers) informed
Let the users of the systems you manage know when upgrades are happening, what changes they should expect to see, and how to report any problems they might encounter.
Rule 16: Go out of your way to be likable
Remain friendly and approachable. Let your users know when you’re up to your ears in some intensive work and when you might be able to address their concerns.
Rule 17: Never stop picking up new skills
Managing Linux servers can be a very time-consuming and demanding task. Even so, keep on the lookout for things you would like to learn – new skills to develop or how to better understand the problems you encounter or the applications you support.
Rule 18: Seek a balanced life
Be a competent and confident Linux sysadmin, but always take time to find enjoyment in many other things. Reward yourself for your hard work. Pursue other interests. Stay happy.