Semi-Belated SysAdvent Roundup
There is plenty of evergreen content, tidbits of wisdom, interesting perspectives, and more to be gleaned from community-run sources like this. Ops School is another one that comes to mind.
Have any great forums, sites, blogs etc like this in your bookmarks or RSS feeds? Pls share with meeee ➡️ https://hachyderm.io/web/@paigerduty
What is SysAdvent?
One article for each day of December, ending on the 25th article.
With the goals of sharing, openness, and mentoring, we aim to provide great articles about systems administration topics written by fellow sysadmins.
all this to say back in December I ambitiously embarked on my own SysAdvent of sorts where I intended to share 1 past post a day along with a snippet that resonated.....and I made it 60% of the way through before vacation brain fully took over and I released myself from the tyranny of time. without further ado below is my full list and yes I shamelessly included my own authored post at the bottom 😅.
1. What's the Problem?
Ask about the problem they need solved. Get details. Why does he or she need postgres? Can the existing mysql deployment and knowledge be used instead? Most of the time customers who simply ask for actions, "please implement this solution," often are unaware of existing, similar options already available. It's also possible that this customer is trying to solve a problem that doesn't exist, doesn't affect your company, or isn't feasible to solve completely.
If you get requirements, you might find they are simply "I need a database that speaks SQL." Alternately, you might find that the requirements include "I need to run this 3rd party tool which requires postgres." Dig deeper. What does this tool do? Can it's features be provided by another tool that doesn't require burdening your team with additional products to support? Is the problem the customer wants to solve even in the scope of your team?
2. Lessons in Migrations
totes also applicable to major migrations
Your move will be long. It will be stressful. You will trip over things you didn't plan for, thought you'd planned for, and were sure someone else was planning for. Don't add to the misery by making another big change at the same time. This goes double for anything involving a complicated technology with multiple vendors (including a local monopoly that Does Not Like competition) that will leave everyone very upset if it fails to work right when they come in.
3. What Time Is It?
Time is a complex thing. Did you know there are a few bazillion time standards? Not just time representations, but actually standards on how to record and observe the passage of time! That's awesome! Further, learning about time standards helps explain why we have leap years and leap seconds.
4. Zen and the Art of Troubleshooting
All too often when troubleshooting it's easy to think of every possible thing that could go wrong. We get caught up in our own abstractions and forget about reality. We must focus on the moment, and deliberately acknowledge where we've created abstractions.
This is a deliberate form of thinking, and it takes some practice. In Zen this is called it 初心 (shoshin), the Beginners Mind. Seeing everything fresh, as if it were the first time you've seen it. Being in the moment. Being deliberate.
5. Down the "ls" Rabbit Hole
Too often sysadmins are afraid to dive into the source code of our core utilities to see how they really work. We're happy to edit our scripts but we don't do the same with our command line utilities, libraries, and kernel. Today we're going to do some source diving in those core components. We'll answer the age-old interview question, "What happens when you typels
at the command line and press enter?" The answer to this question has infinite depth, so I'll leave out some detail, but I'll capture the essence of what is going, and I'll show the source in each component as we go. The pedants in the crowd may find much to gripe about but hopefully they'll do so by posting further detail in the comments.
6. Speaking the Same Language
What was the last legal document you saw that seemed approachable?
7. Effective System Administration
Being an effective system administrator requires an ability to do several (seemingly obvious but often rather fraught) things: To break down projects into actions that we understand as a part, as a whole, and can manage in a discrete period of time; explaining this roadmap to other teams; and successfully keeping implementation on schedule while being flexible enough to handle any issues that arise. The job descriptions and responsibilities of system administrators can vary greatly in scope and the corresponding degrees of difficulty and creativity necessary to succeed. Since "system administrator" alone can sometimes function as a vague catch-all for such a diversity of tasks and functions we use a variety of sometimes unwieldy names to better specify our roles and focus. Regardless of title there is a great deal of commonality in how teams we work for/with view us and depend upon our knowledge and skills. In some cases it's a bit like being a member of a symphony in which the strings, the brass, and the wind sections cannot agree upon the tempo or even what piece to play.
8. Following the White Rabbit
Have you ever worked with a vendor support, and after much back and forth, ended up with an answer, "works for us, so it must be something with your setup, sorry!" This is such a story. And like most similar situations, I learned some good lessons worth sharing.
9. The Pursuit of Learning Through Bad Ideas
As you throw out the absolute worst idea possible to solve something, several outcomes can occur.
1. Your idea, while terrible, just isn’t bad enough. Somebody else in the discussion thinks they can do better (worse). They try to one-up you. They often succeed, and it’s amazing. This sport of spouting bad ideas leads to collaboration, as one person’s idea gets picked up and added to by others.
2. A terrible idea isn’t understood by everybody to be terrible. This often happens when there’s a wide range of experience, either in the job, or within this specific problem domain. The discussion can help spread knowledge, as a more experienced team member explains why your solution of “install head mounted GoPro cameras for auditing purposes” might not actually make your audits any cleaner.
3. Experienced people get a new viewpoint on problems. The problems you face today may be similar to ones you’ve seen before. Trying to think of the worst possible solution forces you to deviate from your usual viewpoint, and can lead to another level of understanding. It can also lead to you reaching for tools or solutions that you’d normally not have considered.
4. You come up with a real, legitimate solution. It’s likely one you and your team would not have arrived at without getting creative and trying to think of the worst idea. For example, choosing a Google spreadsheet[1] as the back end for an internal service. It sounds like a terrible idea. A spreadsheet isn’t really a database. It doesn’t really have a great query language, it can’t handle lots of updates per second, but it has access control, it’s a familiar interface for non-technical folks, and doesn’t require significant upgrades or maintenance.
5. The team learns to debate and discuss ideas. This is important. Because these ideas are intentionally terrible, people don’t get offended when somebody shoots down the idea (or builds on it to come up with something worse). It helps the team learn how to debate properly. Learning how to dismantle ideas without judgment is a much healthier and more productive practice than attacking the person with the idea.
10. DevOps for Horses: Moving an Enterprise Application to the Cloud
Remember that this process should be iterative — unless you have the budget to build a greenfield environment tomorrow, you are going to be tackling this one piece at a time. Don’t feel ashamed because your environments aren’t automated enough or you don’t have comprehensive enough tests for your application. Rather, focus on making things better. If you don’t have enough automation, build more. If there aren’t enough good tests, write just one. Then re-examine your environment, see what most needs improvement, and iterate there.
There’s no way to completely move an app without touching the code, but there’s plenty of work to do before you get there in preparation of scalable, loosely coupled code. Don’t wait for the perfect application to start doing the right thing.
11. Debugging for Systems Engineers
I wanted to write something for sysadvent that would be interesting but focussed on debuggers. There isn’t enough space here to give a full debugger tutorial but instead wanted to give some cases when I reach for a debugger and what for, with some specific tips and examples thrown in. If you’re already an expert at using debuggers you can probably stop reading now.
12. Simplicity in Complex Systems
...the tools we use in the IT landscape are inherently complex, but not as complex as the systems of people that create and maintain these technologies
Nothing adds complexity to an application or system quite like somebody who does not understand it, yet has been tasked with working on it. The fault does not lie with them! Knowledge and learning are everyone’s responsibility. When we can’t make a system simple enough for people to intuit, we must take responsibility for explaining it. It is our responsibility to become better teachers
13. Fear and Loathing in Systems Administration
When someone says “DevOps Doesn’t Work”, they’re absolutely correct. DevOps is a concept, a philosophy, a professional movement based in trust and collaboration among teams, to align them to business needs. A concept doesn’t do work, and a philosophy does not meet goals - people do. I encourage you to seek out ways of working better with your fellow people
14. Leading Change in Organizations of All Sizes
Many of us have stories of trying to change something about where we work, and becoming frustrated after putting in a lot of work for little or no impact. This happens for all kinds of reasons, but usually comes down to one thing: unmet prerequisites. These are not necessarily requirements of your specific change, but they need to be in place before your change can happen. An example is trying to introduce refrigerators into a building with unreliable electric power. It doesn't matter how compelling your argument for refrigeration is, if the power isn't reliable then refrigeration will not keep things consistently cold and food will spoil. The power reliability must be addressed first, and then refrigeration can follow.
Unreliable power can be replaced with "unreasonable management,” "unreliable software testing,” "unexpected regulatory constraints,” etc. Some of those things will be in your power to change, some of them will not.
15. Write It Down or Suffer the Consequences
Humans store information in their brains in a set of fascinating ways, including mental maps of where they can go to look up auxiliary information. If you are depending on someone knowing how to find a key document, and they’re gone, it’s as if the document never existed at all. Think of it like RAM. When someone leaves an organization, it’s a reboot, and all the pointers to the information are destroyed. The information may still exist, but we’ll never know, because it was stored in the volatile memory of grey matter.
16. What Does Operations Do?
ah the age-old question
We can see how Oscar’s responsibilities grew over just two years. At first, it was just 6–8 laptops, office wifi, and a third-party office solution. Then it’s a cobbled-together server. Then development environments. Within a year, it’s 20 laptops, two application environments in the cloud, monitoring, alerting, and backups. After another 12 months, it’s a dozen third-party services, 40 laptops, 2 team members, offboarding processes, and monthly security audits.
17. No More On-Call Martyrs
...system integrity is only important when it impacts the bottom line. If a single engineer works herself half-to-death but keeps the lights on, everything is fine.
And from this void, a decades-old culture has arisen.
There is a cult of masochism around on-call, a pride in the pain and of conquering the rotating gauntlet. These martyrs are mostly found in ops teams, who spend sleepless nights patching deploys and rebuilding arrays. It’s expected and almost heralded. Every on-call sysadmin has war stories to share. Calling them war stories is part of the pride.
This is the language of the disenfranchised. This is the reaction of the unappreciated.
18. Why You Need a Post-Mortem Process
while I dislike the term post-mortem the template questions are golden as is the Knight Capital tale
A postmortem is intended to fill out the sort of knowledge gaps that inevitably exist after an outage:
1. Who was involved / Who should have been involved?
2. Was/is communication good between those parties?
3. How exactly did the incident happen, according to the people who were closest to it?
4. What went well / what did we do right?
5. What could have gone better?
6. What action items can we take from this postmortem to prevent future occurrence?
7. What else did we learn?
8. Without a systematic examination of failure, observers can resort to baseless speculation.
9. Without an analysis of what went right as well as what went wrong, the process can be viewed as a complete failure.
10. Without providing key learnings and developing action items, observers are left to imagine that the problem will almost certainly happen again.
19. Open Source Licensing in the Real World
disclaimer: neither I or the author are lawyers
it’s likely your contract will not cover things like this. Fix that. Talk to your boss/manager and get an understanding of what the company’s expectations are and what the their expectations of you being an Open Source contributor are. It’s best to do this as part of your negotiations when being hired, but either way you need to have those conversations. It’s important to start with your boss/manager instead of legal because the last thing you want to do is confuse/annoy legal. If you work in a large company, expect this process to take a while and while it does do not contribute during company time or using company resources to open source projects. I cannot stress that enough. If the company doesn’t want you contributing and you do, then you are on the hook legally speaking. Be up front and transparent. If you have contributed start discussions now and stop contributing while you do. Hiding information is worse than making an honest mistake.
20. Lighting Up Your Haunted Graveyards
A common conversation goes like this -
Enthusiastic New Person: “My manager suggested I add better monitoring to the$scary_thing
, where should I start?”
Grumpy Senior Engineer: “Um, the code is over there, but it’s 50,000 lines of spaghetti. The last time we touched it we learned it also processes payroll, and the person who wrote it quit 6 months ago. We just try not to touch it.”
A Different Grumpy Senior Engineer: “Yup, I looked at fixing it a few years ago and gave up. When it crashes we just restart it and hope it keeps going.”
Great. Now you have a system so haunted that two senior members on the team refuse to go near it. You’ve chosen to encase it in concrete and warning signs rather than fix it.
This is a huge trap. If you’re very lucky, you’ll only have to walk into the graveyard for security and platform updates, which are probably going to be ok. More likely the system will break spectacularly when you least expect it, and you’ll write a postmortem containing the phrases “technical debt” and “key components with no owner.”
What do you do now? March in with flashlights and throw a party.
21. Breaking in a New Company as an SRE
Week 2
With a working laptop and most of my accounts set up, it was time to start digging in. I got set up on the bastions and started looking at some tickets for our databases. Getting into live machines for analysis helped expose gaps in the new-hire onboarding workflow. Where possible I updated docs and other things were added to my notes.
Lessons:
- Find a ticket to work on right away and give yourself plenty of time. I hit a few snags with access and local conventions that slowed me down.
- Try to figure things out by searching docs and poking.
- Start committing small changes to the team’s primary repos. e.g. formatting, spelling, tests, and minor bug fixes.
- Set a time limit on self-discovery and ask for help.
22. Root Cause is Plural
Like the word data is plural, the phrase “root cause” should be as well. There is never a singular root cause, but instead root causes that contribute to an incident. Furthering the plant metaphor, a root came from a seed, and in order to sprout, that seed needed sufficient nutrients, water, and sunlight. Failing to recognize those nutrients, water sources, and sunlight means you prematurely stopped your learning process. There are other roots growing in your field, maybe similar to the one remediated, maybe different, but feeding off of those same needs.
23. Being Kind to 3am You
Showing empathy for your fellow engineer who is going to be thankful for that full night sleep is paid back in kind.
Showing self-care by giving yourself the tools to help you get your job done so you can go back to counting sheep.
Making sure your junior or new on-call engineers don't freak out in the middle of the night because you left them a note about that upgrade, so those new errors they're seeing are totally okay (well, not okay, but not unexpected).
Thinking more about how less pages makes everyone sleep easier, and what can be done to achieve that
24. What's in a Job Description (and Who Does It Keep Away)?
To this day, if I hear someone talking about a strong developer I might wonder "but how much can they deadlift?" Most job descriptions for roles outside physical datacenter management don't include this language anymore. This all got me thinking, what might be in job descriptions these days that could be turning off candidates?
25. Assembling Your Year In Review
My first company had a tradition of taking a moment to pause and review the year by the numbers. The showstopper was the chart showing the amount of data ingested year over year since the founding.In a single glance that chart conveyed a story that would take hours to tell!
It communicated the incredible efforts the employees took to scale the system to handle ingesting, processing, publishing and storing an ever increasing mountain of data. It illustrated how far the company had come and we were confronted head on with the realization that “what got you here, won’t get you there”.
The biggest impact I have seen comes after the presentation. Discussions from Year in Reviews have sparked sweeping oncall management changes as well as minor, but important, changes in the way developers engage with the SRE team.
CAT TAX
until next time,
paigerduty
Member discussion