2024

May 10, 2024
in TypeScript, Today I Learned
3 min read

Today I Learned: Destructure Objects in Function Signatures

When modeling types, one thing to keep in mind is to not leverage primitive types for your domain. This comes up when we use a primitive type (like a string) to represent core domain concepts (like a Social Security Number or a Phone Number).

Here's an example where it can become problematic:

// Definition for Customer
type Customer = {
  firstName: string,
  lastName: string,
  email: string,
  phoneNumber: string
}

// Function to send an email to customer about a new sale
async function sendEmailToCustomer(c:Customer): Promise<void> {
  const content = "Look at these deals!";

  // Uh oh, we're trying to send an email to a phone number...
  await sendEmail(c.phoneNumber, content);
}

async function sendEmail(email:string, content:string): Promise<void> {
  // logic to send email
}

There's a bug in this code, where we're trying to send an email to a phone number. Unfortunately, this code type checks and compiles, so we have to lean on other techniques (automated testing or code reviews) to discover the bug.

Since it's better to find issues earlier in the process, we can make this a compilation error by introducing a new type for Email since not all strings should be treated equally.

One approach we can do is to create a tagged union like the following:

type Email = {
  label:"Email",
  value:string
}

With this in place, we can change our sendEmail function to leverage the new Email type.

function sendEmail(email:Email, content:string): Promise<void> {
  // logic to send email
}

Now, when we get a compilation error when we try passing in a phoneNumber.

Compilation error that we can't pass a string to an email

One downside to this approach is that if you want to get the value from the Email type, you need to access it's value property. This can be a bit hard to read and keep track of.

function sendEmail(email:Email, content:string): Promise<void> {
  const address = email.value;
  // logic to send email
}

Leveraging Object Destructuring in Functions

One technique to avoid this is to use destructuring to get the individual properties. This allows us to "throw away" some properties and hold onto the ones we care about. For example, let's say that we wanted only the phoneNumber from a Customer. We could get that with an assignment like the following:

const customer: Customer = {
  firstName: "Cameron",
  lastName: "Presley",
  phoneNumber: "555-5555",
  email: {label:"Email", value:"Cameron@domain.com"}
}

const {phoneNumber} = customer; // phoneNumber will be "555-555"

This works fine for assignments, but it'd be nice to have this at a function level. Thankfully, we can do that like so:

// value is the property from Email, we don't have the label to deal with
function sendEmail({value}:Email, content:string): Promise<void> {
  const address = value; // note that we don't have to do .value here
  // logic to send email
}

If you find yourself using domain types like this, then this is a handy tool to have in your toolbox.

May 6, 2024
in Today I Learned, Tooling
4 min read

Today I Learned - Effective Pairing with Mob.sh

As someone who enjoys leveraging technology and teaching, I'm always interested in ways to simplify the teaching process.

For example, when I'm teaching someone a new skill, I follow the "show one, do one, lead one" approach and my tool of choice for the longest time was LiveShare by Microsoft.

Using VS LiveShare

I think this extension is pretty slick as it allows you to have multiple collaborators, the latency is quite low, and it's built into both Visual Studio Code (VS Code) and Visual Studio.

Drawbacks to LiveShare

Editor Lock-In

First, participants have to be using Visual Studio or VS Code. Since there's support for VS Code, this isn't quite a blocker as it could be. However, let's say that I'm wanting to work with a team on a Java application. They're more likely to be using IntelliJ or Eclipse as their editor and I don't want someone to have to change their editor just to collaborate.

Security Concerns

Second, there are some security considerations to be aware of.

Given the nature of LiveShare, collaborators either connect to your machine (peer-to-peer) or they go through a relay in Azure. Companies that are sensitive to where traffic is routed to won't allow the Azure relay option and given the issues with the URL creation (see next section), the peer-to-peer connection isn't much better.

To start a session, LiveShare generates a URL that the owner would share with their collaborators. As of today, there's no way to limit who can access that link. The owner has some moderator tools to block people, but there's not a way to stop anyone from joining who doesn't have the right kind of email address for example.

Introducing Mob.sh

While pairing with a colleague, he introduced me to an alternative tool, mob.sh

At first, I was a bit skeptical of this tooling as I enjoyed the ease of use that I got with LiveShare. However, after a few sessions, I find that this tool solves the problems that I was using LiveShare for just as good, if not better.

How It Works

At a high level, mob.sh is a command line tool that is a wrapper around basic git commands.

Because of this design choice, it doesn't matter what editor that a participant has, as long as the code under question is under git source control, the tooling works.

Let's explore how a pair, Adam and Brittany, would use this tool for work.

Adam and Brittany Start Pairing

Adam is looking to solve a logic issue in an AWS lambda could use Brittany's guidance since he's new to that domain.

Adam creates a new feature branch, fixing-logic-issue and starts a new mobbing session.

git switch -c fixing-logic-issue
mob start --create
# --create is needed because fixing-logic issue is not on the server yet

Under the hood, mob.sh has created a new branch off of fixing-logic-issue called mob/fixing-logic-issue. While Adam is making changes, they're going to occur on the mob/fixing-logic-issue.

Because the pair is working remotely, Adam shares his screen so that they're on the same page.

While on this branch, Adam writes a failing unit test that exposes the logic issue that he's running into. From here he signals that Brittany is up by running mob next

1	`mob next`

By running this command, mob.sh adds and commits all the changes made on this branch and pushes them up to the server. Once this command completes, it's Brittany's turn to lead.

Once Brittany see's the mob next command complete, she checks out the fixing-logic-issue branch and picks up the next portion of the work by running mob start

git pull # To get fixing-logic-issue branch
git switch fixing-logic-issue
mob start

Because she was on the fixing-logic-issue branch, mob.sh was able to see that there was a mob/fixing-logic-issue branch already, so that branch is checked out.

Based on the test, Brittany shows Adam where the failure is occurring and they write up a fix that passes the failing test.

Though there are more changes to be done, Brittany has a meeting to attend, so she ends the session by running mob done, committing, and then pushing the changes.

mob done
git commit -m "Fixed first logic bug"
git push

By running mob done command, all the changes that were on the mob/fixing-logic-issue are applied to the fixing-logic-issue branch. From here, Brittany can commit the changes and push them back to the server.

Wrapping Up

If you're looking to expand your pairing/mobbing toolkit, I recommend giving mob.sh a try. Not only is the initial investment small, but I find the tooling natural to pick up after a few tries and since it's a wrapper around Git, it reduces the amount of learning needed before getting started.

Apr 30, 2024
in Process, Leadership
7 min read

Running Effective Experiments With the Team

As a leader, you're always on the look out for new tools and approaches to help the team be more effective.

But what happens when you have an idea? How do you introduce it to the team and get buy-in? How do you encourage others to propose ideas as well (remember, you're job isn't to have all the ideas, but to encourage and choose the best ones).

Let's say that the idea works, what happens next? What if it failed, what do you do next? How do you share your lessons with others?

In this post, I'll walk you through my approach for running experiments with the team and how to answer these questions. Like any other advice, I've found success using this process, but you might find that you'll need to tweak or adjust for your team.

Working In the Open

When it comes to the team, I'm a big proponent of working in the open. Not only does this reduce the amount of questions from my leader about what we're doing, it also empowers others to chime in when they see something off or the team going down the wrong path.

With this philosophy in mind, I document our experiments in the team wiki. Now, I know that we should favor people over processes, however, I have found immense value in taking the 10 minutes to document as this helps get everyone on the same page and when we look at these experiments later, we have the context behind the experiment.

To me, this no different than a scientist writing down their experiments for later reference.

Defining an Experiment

As you might have guessed, I'm a big fan of using the scientific method for engineering work and especially so when it comes to experiments. As such, I capture the following info:

Purple liquid being injected into one of many test tubes — Seriously, if we're not taking notes, what kind of scientists are we?

Context - Why are we doing this? What inspired the experiment or what problem are we trying to solve?
Hypothesis - What change are we proposing and what outcome are we looking for?
Implementation - How are we going to run this experiment?
Duration - How long are we going to run this experiment for?
(Optional) Immediate Failure Criteria - Is there anything that could happen during this experiment that would cause to immediately stop?

For those looking for a template, you can find a markdown version in my Leadership Toolkit on GitHub

Scheduling the Retrospective

With the experiment documented, I send a meeting request the day after the experiment is scheduled to end. The goal of this meeting is to reflect on the experiment and to decide whether we should adopt the changes or to stop.

Leading the Team

After sending this meeting, my job is to help the team implement the experiment and coach/encourage as needed. Since it's a process change, it might take a bit for the team to adjust, so showing some patience and understanding is critical here.

While we are going the through the experiment, I'm going to note any changes that I'm noticing. For example, if we're running an experiment to have asynchronous stand-ups, I'm going to take notes on how I'm feeling about the team getting updates and how they're communicating with each other.

Depending on what comes up in our one-on-ones, I might even use this as a starter question.

Retrospective

Once the experiment has ran its course, it's time to reflect on the experiment and decide as a team on whether to adopt the changes or reject them.

To prepare, I'd recommend getting the right people in the room and setting the context.

During the retro, the team should be doing the majority of the talking. Your role is to seed the conversation and make sure everyone gets their opinions out. I like to capture these notes on a board so that the team has clear visibility on what worked and didn't work.

Once the notes have been added to the board, it's time for the team to decide to adopt the changes or not. During this step, I remind the team that this process isn't set in stone and if we want to tweak it in a future experiment, that's normal and encouraged.

At this point, I update the experiment write-up that we did earlier with the team decision and the logic behind the decision. This provides an easy way of sharing our lessons with others.

One cool thing about leading teams is that no two teams are the same. Between the personalities, skills, company culture, and motivations, what works for one team won't work for another team (and the other way around).

Because of this, it's critical to share your results with your leader and your peers. This way, they could see what we did, what worked, what didn't work, and hopefully get inspired to run their own experiments with teams.

If the team paid a learning tax for an experiment, why wouldn't we share those results with others so that they can learn from our experiences? They might be able to make suggestions to turn a failure into a success or to ask questions about how we dealt with an issue.

The group being successful is your success, do don't hoard knowledge, share it with others!

With the write-up completed, you can start simply by sending a link to the group. A better approach would be to have a standing agenda item for your team lead meeting where leaders can talk about experiments that have been ran recently and their outcomes.

Common Mistakes

When I've worked with leaders to introduce experiments, it can be a lot to take in because this is a different way of thinking. This is especially true if leaders are not in a psychologically safe environment or if they have priori experiences that weren't successful.

I can't guarantee that you'll run experiments flawlessly, however, if you avoid these common mistakes, your odds of success will be higher.

Not Time Boxing the Experiment

One of the key features of the experiment is that it's only going to run for a set period of time, so that if you find that it's not working, you've not permanently impacted the team.

If you have an experiment that's going to run into perpetuity, that's not an experiment anymore, that's a process change and that shouldn't go through this workflow because experiments can be abandoned, but process changes typically can't.

Treating Experiments as Foregone Conclusions

At some point, you're going to get a directive from your leader that you don't agree with, but you need to commit to the idea anyway. You know the idea isn't going to go over with the team, so you think framing it as an experiment can soften the blow.

DON'T DO THIS!

Jimmy Fallon saying It's Totally Not Worth It — Really, don't do this!

Experiments are just that, experiments. They are not a vehicle for you to make unpopular changes. If you use experiments for slipping in these types of changes, then the team will learn that experiments is code for "not great idea" and they'll stop using the process.

Remember, experiments are ideas that you and the team come up with to make things better, not directives from the top coming down.

Now, you could use an experiment to figure out a way to carry out the directive. A good leader tells you where we have to go, but not necessarily how to get there. The experiment could be to figure out how to get there, but not what the destination should be.

Running Multiple Experiments

When getting a new team or after identifying multiple areas that a team could improve in, it's going to be tempting to want to implement multiple changes at once.

Resist the urge.

Remember, an experiment, by definition, is a process change. So the more experiments you run, the more process changes happening, which in turn puts more stress on the team to remember all the changes.

In addition to all the process changes, you might find that one experiment futzes with another experiment and you may not get clear results.

Let's say that we had two experiments going on at the same, asynchronous stand-ups and spending Tuesday afternoons in independent learning. During your one-on-ones, you get some feedback that it's a bit odd to not know what other team members are working on.

What's driving that? Is it the async stand-ups? Or is it the dedicated learning time? Could it be both? You can't be sure.

Another way to think about this is to think about debugging a program. If something's not working, do you change 5 things at once? No, you're going to change one thing, re-run, and see what happens.

Same thing for experiments.

But Cameron! This team is a hot mess and could stand to improve in so many areas, what should I do then?

Instead of running all the experiments, instead, the team should decide which experiment would have the biggest payoff and then pursue that one. Remember, you're not playing the short game, but you're in for the long haul, so you'll have the time to make those changes.

Apr 25, 2024
in Process, AWS
3 min read

Troubleshooting a DynamoDB Connection Issue

Most of my blog posts cover process improvements, leadership advice, and new (to me) technologies. In this post, I wanted to shift a bit and cover some of the fun troubleshooting problems that I run into from time to time.

Enjoy!

The Setup - How Did We Get Here?

At a high level, the team had a need to process messages coming from a message queue, parse the data, and then insert into a DynamoDB table. At a high level, here's what the architecture looked like:

graph LR
Queue[Message Queue] --> Lambda[Lambda]
Lambda --> Process[Process?]
Process --> |Failed| DLQ[Dead Letter Queue]
Process --> |Success| DB[DynamoDB Table]

The business workflow is that a batch job was running overnight that would send messages to various queues (including this one). The team knew that we would receive about 100K messages, but had plenty of time to process them as this data was not needed for real-time.

What Went Wrong?

For the first night, everything worked as intended. However, for the second night, the team saw that only some of the messages made it to their DynamoDB table. A non-trivial number of them errored out with a message of Error: connect EMFIL <IP ADDRESS>.

I don't know about you, but I had never seen EMFIL as an error before and the logs weren't very helpful on what was going on.

Doing some digging, we found this GitHub Issue where someone has ran into a similar problem.

Digging through the comment chain, we found this comment, stating that you could run into this problem if you were exhausting the connection pool to DynamoDB.

Ah, now that's an idea! Even though I hadn't seen that error before, I know that if an application isn't cleaning up their connections properly, then the server can't accept new ones and that would fail the application. With almost 100K messages coming through and the large amount of failures, I could absolutely see how that might be the issue.

Inspecting the Code

With this in mind, I started to take a look at the lambda in question and found the following:

export const handler = (event) => {
  // logic to parse event

  const dbClient = DynamoDbDocumentClient.from(new DynamoDBClient());

  // logic to insert event
}

Aha! This code implies that for every execution of the lambda, it would attempt to create a new connection.

But Cameron, hold up. Yes, it will create the connection every time the lambda executes, but once the lambda is done, the connection will get cleaned up, so will it really try to spin up 100K connections?

You're right, when the lambda goes out of scope, the connection will get cleaned up.

But don't forget, it'll take the target server (DynamoDB) some time to tidy up. The problem is that since we were slamming 100K messages in rapid succession, DynamoDB didn't have enough time to clean up the connection before another connection was requested. And that was the problem.

Resolution

Now that we have an idea on what the problem could be, time to fix it. In this case, the change is straightforward (though the reasoning might not be.)

So instead of having this

export const handler = (event) => {
  // logic to parse event

  const dbClient = DynamoDbDocumentClient.from(new DynamoDBClient());

  // logic to insert event
}

We moved the client creation to be outside of the handler block altogether.

const dbClient = DynamoDbDocumentClient.from(new DynamoDBClient());

export const handler = (event) => {
  // logic to parse event
  // logic to insert event
}

Wait, wait. How does this solve the problem? You're still going to be executing this code for every message, so won't you have the same issue?

Now that's a great question! Something that the team learned is that when a Lambda gets spun-up, there's a context that's created that hosts the external dependencies. When a lambda execution finishes, the context is maintained by AWS for a certain amount of time to be reused in case the lambda is invoked again. This saves on the init/start-up times.

Because of the shared context, this allows us to essentially pool the connections and drastically reduce the amount of connections needed.

This same advice is given in the best practices documentation for lambdas.

Lessons Learned

After making the code change and redeploying, we were able to confirm that everything was working again with no issues.

Even though the problem was new to us, this was a great opportunity to learn more about how Lambdas work under the hood, understand more about execution context, and a bit of dive into troubleshooting unknown errors for the team.

Apr 22, 2024
in TypeScript, Today I Learned, Process
3 min read

Today I Learned: LibYear

When writing software, it's difficult (if not impossible) to write everything from scratch. We're either using a framework or various third-party libraries to make our code work.

On one hand, this is a major win for the community because we're not all having to solve the same problem over and over again. Could you imagine having to implement your own authentication framework (on second thought...)

However, this power comes at a cost. These libraries aren't free as in beer, but more like puppies. So if we're going to take in the library, then we need to make sure that our dependencies are up-to-date with the latest and greatest. There are new features, bug fixes, and security patches occurring all the time and the longer we let a library drift, the more painful it can be to upgrade.

If the library is leveraging semantic versioning, then we can take a guess on the likelihood of a breaking change base on which number (Major.Minor.Maintenance) has changed.

Major - We've made a breaking change that's not backward compatible. Your code may not work anymore.
Minor - We've added added new features that you might be interested in or made other changes that are backwards compatible.
Maintenance - We've fixed some bugs, sorry about that!

Keeping Up With Dependencies

For libraries that have known vulnerabilities, you can leverage tools like GitHub's Dependabot to auto-create pull requests that will upgrade those dependencies for you. Even though the tool might be "noisy", this is a great way to take an active role in keeping libraries up to date.

However, this approach only works for vulnerabilities, what about libraries that are just out-of-date? There's a cost/benefit of upgrading where the longer you go between the upgrades, the riskier the upgrade will be.

In the JavaScript world, we know that dependencies are listed in the package.json file with minimum versions and the package-lock.json file states the exact versions to use.

Using LibYear

I was working with one of my colleagues the other day and he referred me to a library called LibYear that will check your package.json and lock file to determine how much "drift" that you have between your dependencies.

Under the hood, it's combining the npm outdated and npm view <package> commands to determine the drift.

What I like about this tool is that you can use this as a part of a "software health" report for your codebase.

As engineers, we get pressure to ship features and hit delivery dates, but it's our responsibility to make sure that our codebase is in good shape (however we defined that term). I think this library is a good way for us to capture a data point about software health which then allows the team to make a decision on whether we should update our libraries now (or defer).

The nice thing about the LibYear package is that it lends itself to be ran in a pipeline and then you could take those results and post them somewhere. For example, maybe you could write your own automation bot that could post the stats in your Slack or Teams chat.

It looks like there's already a GitHub Action for running this tool today, so you could start there as a basis.

Apr 16, 2024
in Process
9 min read

Learning From Failures: Leveraging Postmortems for Good

As engineers, we love solving problems and digging into the details. I know that I feel a particular sense of joy when shipping a system to production for the first time.

However, once a system goes into production, it's going to fail. That's not a knock on engineering, that's a fact of our industry. We cannot build a system with 100% uptime, no matter how much we plan ahead or think about.

When this happens, our job is to fix the issue and bring the systems back up.

A common mistake that I see teams make is that they'll fix the problem, but never dig into the why it happened. Fast forward three months, and the team will run into the same problem, making the same mistakes. I'm always surprised when teams don't share their knowledge because if you've paid the tax of learning from the first outage, why would you pay the same tax to learn the lesson again?

This is where having a postmortem meeting comes into play. Borrowing from medicine, a postmortem is performed when the team gets together to analyze the outage and what could have been done differently to prevent it from happening. Other industries have similar mechanisms (for example, professional athletes review their games to learn where they made mistakes so that they can train differently).

One of the key concepts of the postmortem is that the goal isn't to assign blame, but to understand what happened and why. People aren't perfect and it's not reasonable for them to be. This concept is so fundamental that another term you might here for a postmortem is a blameless incident report (BIR).

Are you interesting learning how to run your own BIR process? Drop me a line and if there's enough interest, I'll write a follow-up post!

Getting Started

Every company has their own process when they have an outage, but health process should be able to answer these five questions at a minimum.

What stopped working?
Why did it stop working?
How did we fix it?
What led to the system breaking?
What are we doing to prevent this issue from happening in the future?

Organizations may have additional questions for their process, for example

What was the impact? (X customers were impacted for Y minutes)
How was this discovered? (Was it user reported, support found it, our monitoring tools paged on-call)

You can always make a process more complicated, but it's hard to simplify a process, so my recommendation is start with the 5 questions and then expand as your team evolves and matures.

What Stopped Working?

For there to be an outage, something had to stop working, right? So what was it? It's okay/normal to be a bit technical here, however, don't forget that this outage caused an impact to our users, so we should strive to explain the outage in those terms.

Another way to think about this is "What stopped working and what impact did it have for our users?"

Here's a not-great answer to the question

The Data Sync Java container stopped working.

While this does capture what stopped working, the details are quite vague. For example, did it not start? Did it start crashing? Was it the whole process that failed or only one part?

In addition, there's not a clear delineation of what the user impact was. For example, could users not log into the application at all? Were there certain parts of the application that stopped working?

We can improve this by including a bit more details in what specifically stopped working.

The Data Sync process was failing to connect to the UserHistory database.

Ah, okay, now we know that the sync process wasn't able to connect to a specific database and we can start getting a sense of why that would be a big deal. We're still not including the user impact, so let's add that bit of detail in.

The Data Sync process was failing to connect to the UserHistory database. As such, when users logged into their account, they could not see the latest transactions for their account.

Much better, now I know that our users couldn't see recent transactions and it was something to do with the Data Sync process.

As a side benefit, if I'm a new engineer to this codebase or team, I know know more about the architecture and that this process is involved for when users start transactions.

Why Did It Stop Working?

At some point, the system was working and if it's no longer working, something had to have happened, so what was it? Was there a code deployment to production? Did a feature flag get toggled that had an adverse effect? What about infrastructure changes like DNS entries or firewall?

This is a key critical step because if we don't know why it stopped working, then we don't have a good spot to start when we start diving into the circumstances behind what led to the outage.

This doesn't have to be a page worth of technical deep dive, a couple of sentences can suffice here. In our outage, the issue was due to a port being blocked by the firewall (where it wasn't before).

There was an infrastructure change for the database that blocked port 1433, which is the default port for the database. Because of this change, no application was able to successfully connect to the database.

How Did We Fix It?

If you've gotten to the point of writing the BIR, then you've fixed the issue and the system is up and running again, right? So what did you do to fix the issue? Did you rollback the deployment? Disable a feature flag? Burn down the application, change your name and get a new job? This is a cool part of the BIR because you're leveling up others that if they run into a similar issue, here's how you were able to get back up and running.

In our example, we were able to unblock the port, so we can answer this question with:

Once we realized that port 1433 was being blocked by the firewall, we worked with the Infrastructure team to unblock the port. Once that change was completed, the Data Sync service was able to start syncing data to the UserHistory database.

What Led To the System Breaking?

This is where the meat of the conversation should take place. In a healthy organization, we assume that people are wanting to do the right thing (if not, you have a much bigger problem that BIRs). So we've got to figure out how did we get here, what were the motivations and why did we do the things that we did?

One common approach is 5 Whys, made popular by the Toyota Production System. The idea is that we keep digging into why something happened and not stop at symptoms.

An example 5 Whys breakdown for this outage could be the following:

- Why did the Data Sync service started failing to connect to the UserHistory database?
    - Because the port that the Data Sync service was communicating with got blocked
    - Why did the port become blocked?
        - One of our security initiatives is to block default ports to lessen the changes of someone gaining access to our systems
        - Why is this an initiative?
            - Our current firewall solution doesn't support a way to have an _allowList_ of dynamic IP addresses. Since most vulnerability tools scan a network, they'll typically use default ports to see if there's a service running there and if so, try to compromise it.
            - Why doesn't our current firewall solution have support for dynamic IP addresses?
    - Why did we not see this issue in the lower environments?
        - The lower environments are not configured the same as our production environment
        - Why are these environments different?
            - Given that lowers receive less traffic than production, we have multiple databases installed on the same server, none of which are on the default port. By doing this, we're saving money on infrastructure costs.
        - Why did the team not realize that the lowers are configured differently?
            - The Data Sync process is an older part of our application that most of the team doesn't have knowledge of.
    - Why did our monitoring tools not catch the issue after deployment?
        - For the Data Sync process, we currently only have a health check, which only checks to see if the app is up, it doesn't check that it has line-of-sight to all its dependencies.
        - Why doesn't the health check include dependency checking?
            - Health checks are used to tell our cloud infrastructure to restart a service. Since restarting the service wouldn't have resolved the problem, that's why we don't have it included in the checks
        - Why don't we have other checks?
            - The Data Sync process predates our existing monitoring solutions and has been stable, so the work was never prioritized.

As you can see, this approach brings up lots of questions, including the motivation behind the work and why the team was doing it anyway. It is possible

What Are We Doing To Prevent Similar Issues in the Future?

The system is going to have an outage, that's not up for debate. However, it would be foolish to have an outage and not do anything to prevent it from happening in the future. If we already paid the learning tax once, let's not pay it again for the same issue.

This should be a concrete list of action items that the team takes ownership of. In some cases, it's work that they can do to prevent the issue going forward. In some cases, it could be working with others to help them fix things in their process.

In our hypothetical outage, we could have the following action items

Add Additional Monitoring for the Data Sync process
Work with Security to determine approach for securing our database instance
Create environment diagram for Data Sync process
Create architecture diagram for Data Sync process

Example Blameless Incident Report

In this post, we covered the 5 main questions to answer for a BIR and what good responses look like. In this section, I wanted to go over an example BIR for the database issue that we've been exploring. As you'll see, it's not a verbose document, however, it does capture the main points and this is easily consumable for other teams to learn from our mistakes.

# Title: Users Unable To See Latest Transactions

##  What Stopped Working?
The Data Sync process stopped being able to connect to the UserHistory Database. Because of this failure, when users logged into their account to see transactions, they were not able to see any new transactions.

## Why Did It Stop Working?

A change was made to the database infrastructure to block port 1433. This is the default port for a SQL Server database so when it was blocked, no application was able to communicate with the database.

## How Did We Fix It?

The firewall port change was reverted.

## What Led to the System Breaking?

To improve our security posture, we wanted to block default ports to our systems so that if someone was to gain access, they couldn't "guess" into the connection for the database.

When making these firewall changes, we start in the lower environments so that if there is a problem, we impact dev or staging and not production.

Unknown to the team, in the lower environments, we have multiple databases installed on a server, none of which are on port 1433. Because of this, we had false confidence that our changes were safe to deploy forward.

In production, each database has their own server, running on port 1433.

##  What Are We Doing To Prevent Issues In The Future?

- **Check the environment differences** - Before making infrastructure changes, we're going to check what the differences are in our lower environments vs production.
- **Create architecture diagram** - Since one of the main issues is that the team didn't understand the architecture of the Data Sync process, we're going to create an architecture diagram that covers the flow of the service.
- **Create environment diagram** - To better understand our system, we're going to create an environment diagram that covers the databases at play and how the Data Sync process communicates.
- **Work with Security on Approaches for Securing Database** - We'll work with the Security team to either setup a way to have dynamic IPs work with our firewall technology or to change our Data Sync process to have a static IP.

Apr 13, 2024
in Functional, TypeScript, Today I Learned
2 min read

Today I Learned - Iterating Through Union Types

In a previous post, we cover on how using union types in TypeScript is a great approach for domain modeling because it limits the possible values that a type can have.

For example, let's say that we're modeling a card game with a standard deck of playing cards. We could model the domain as such.

type Rank = "Ace" | "Two" | "Three" | "Four" | "Five" | "Six" | "Seven"
           | "Eight" | "Nine" | "Ten" |"Jack" | "Queen" | "King"
type Suite = "Hearts" | "Clubs" | "Spades" | "Diamonds"

type Card = {rank:Rank; suite:Suit}

With this modeling, there's no way to create a Card such that it has an invalid Rank or Suite.

With this definition, let's create a function to build the deck.

function createDeck(): Card[] {
  const ranks = ["Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King"];
  const suites = ["Hearts", "Clubs", "Spades", "Diamonds"];

  const deck:Card[] = [];
  for (const rank of ranks) {
    for (const suite of suites) {
      deck.push({rank, suite});
    }
  }
  return deck;
}

This code works, however, I don't like the fact that I had to formally list the option for both Rank and Suite as this means that I have two different representtions for Rank and Suite, which implies tthat if we needed to add a new Rank or Suite, then we'd need to add it in two places (a violation of DRY).

Doing some digging, I found this StackOverflow post that gave a different way of defining our Rank and Suite types. Let's try that new definition.

const ranks = ["Ace", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Jack", "Queen", "King"] as const;
type Rank = typeof ranks[number];
const suites = ["Hearts", "Clubs", "Spades", "Diamonds"] as const;
type Suite = typeof suites[number]

In this above code, we're saying that ranks cannot change (either by assignment or by operations like push). With that definition, we can say that Rank is some entry in the ranks array. Similar approach for our suites array and Suite type.

I prefer this approach much more because we have our ranks and suites defined in one place and our code reads cleaner as this says Here are the possible ranks and Rank can only be one of those choices.

Limitations

The main limitation is that it only works for "enum" style unions. Let's change example and say that we want to model a series of shapes with the following.

type Circle = {radius:number};
type Square = {length:number};
type Rectangle = {height:number, width:number}

type Shape = Circle | Square | Rectangle

To use the same trick, we would need to have an array of constant values. However, we can't have a constant value for any of the Shapes because there are an infinite number of valid Circles, Squares, and Rectangles.

Apr 9, 2024
in TypeScript, Functional Programming
5 min read

Exploring Map with Property Based Thinking

When thinking about software, it's natural to think about the things that it can do (its features like generating reports or adding an item to a cart).

But what about the properties that those actions have? Those things that are always true?

In this post, let's take a look at a fundamental tool of functional programming, the map function.

All the code examples in this post will be using TypeScript, but the lessons hold for other languages with Map (or Select if you're coming from .NET).

Examining Map

In JavaScript/TypeScript, map is a function for arrays that allow us to transform an array of values into an array of different values.

For example, let's say that we have an array of names and we want to ensure that each name is capitalized, we can write the following:

const capitalize = (name:string): string => {
  return n[0].toUpperCase() + n.substring(1);
}
const names = ["alice","bob","charlotte"];

const properNames = names.map(capitalize);

In our example, as long as we have a pure function that takes a string and returns a new type, then map will work.

What Does Map Guarantee?

Map is a cool function because it has a lot of properties that we get for free.

Maintains Length - If you call map on an array of 3 elements, then you'll get a new array with 3 elements. If you call map on an empty array, you'll get an empty array.
Maintains Type - If you call map on array of type T with a function that goes from T to U, then every element in the new array is of type U.
Maintains Order - If you call map on array with one function, then call map with a function that "undoes" the original map, then you end up with the original array.

Writing Property Based Tests

To prove these properties, we can write a set of unit tests. However, it would be hard to write a single test that that covers a single property.

Most tests are example based in the sense that for a specific input, we get a specific output. Property based tests, on the other hand, uses random data and ensures that a property holds for all inputs. If it finds an input where the property fails, the test fails and you know which input caused the issue.

Most languages have a tool for writing property-based tests, so we'll be using fast-check for writing property based tests and jest for our test runner

Checking Length Property

import fc from "fast-check";

describe("map", () => {
  it("maintains length", () => {
    // This is known as the identify function
    // as it returns whatever input it received
    const identity = <T>(a: T): T => a;

    // Fast Check assert that the following holds for all arrays of integers
    fc.assert(
      // data is the array of numbers
      fc.property(fc.array(fc.integer()), (data): void => {
        // We call the map function with the identify function
        const result = data.map(identity);

        // We make sure that our result has the same length
        expect(result).toHaveLength(data.length);
      })
    );
  });

If we run this test, we'll end up passing. But what is the value of data?

By adding a console.log in the test, we'll see the following values printed when we run the test (there are quite a few, so we'll examine the first few).

console.log
    [
       2125251991,  1334674146,
      -1531633149,   332890473,
       1313556939,   907640912,
        887735692, -1979633703,
       -259341001,  2015321027
    ]

  console.log
    [ 1307879257 ]

  console.log
    []

  # quite a few more...

Checking Type Property

We've proven that the length property is being followed, so let's look at how we can ensure that result has the right type.

To keep things simple, we're going to start with a string[] and map them to their lengths, yielding number[].

If map is working, then the result should be all numbers.

We can leverage typeof to check the type of each element in the array.

// An additional test in the describe block

it("maintains type", () => {
  const getLength = (s:string)=>s.length;
  fc.assert(
    // asserting with an array of strings
    fc.property(fc.array(fc.string()), (data): void => {
      // mapping to lengths of strings
      const result = data.map(getLength);

      // checking that all values are numbers
      const isAllValid = result.every((x) => typeof x === "number");
      expect(isAllValid).toBeTruthy();
    })
  );
});

Like before, we can add a console.log to the test to see what strings are being generated

console.log
  [ 'ptpJTR`G4', 's >xmpXI', 'H++%;a3Y', 'OFD|+X8', 'gp' ]

console.log
  [ 'Rq', '', 'V&+)Zy2VD8' ]


console.log
  [ 'o%}', '$o', 'w7C', 'O+!e', 'NS$:4\\9aq', 'xPbb}=F7h', 'z' ]

console.log
  [ '' ]

console.log
  [ 'apply', '' ]

console.log
  []
## And many more entries...

Checking Order Property

For our third property, we need to ensure that the order of the array is being maintained.

To make this happen, we can use our identity function from before and check that our result is the same as the input. If so, then we know that the order is being maintained.

it("maintains order", () => {
  const identity = <T>(a: T) => a;

  fc.assert(
    fc.property(fc.array(fc.string()), (data): void => {
      const result = data.map(identity);

      expect(result).toEqual(data);
    })
  );
});

And with that, we've verified that our third property holds!

So What, Why Properties?

When I think about the code I write, I'm thinking about the way it works, the way it should work, and the ways it shouldn't work. I find example based tests to help understand a business flow because of it's concrete values while property based tests help me understand the general guarantees of the code.

I find that once I start thinking in properties, my code became cleaner because there's logic that I no longer had to write. In our map example, we don't have to write checks for if we have null or undefined because map always returns an array (empty in the worse case). There's also no need to write error handling because as long as the mapping function is pure, map will always return an array.

For those looking to learn more about functional programming, you'll find that properties help describe the higher level constructs (functors, monoids, and monads) and what to look for.

Finding properties can be a challenge, however, Scott Wlaschin (of FSharpForFunAndProfit) has a great post talking about design patterns that I've found to be immensely helpful.

Mar 26, 2024
in Leadership, Process
5 min read

How Using Vertical Slicing Can Minimize Dependencies and Deliver Value Faster

How do we break down this work?

It's a good question and it can help set the tone for the project. Assuming the work is more than a bug fix, it's natural to look at a big project and break it down to smaller, more approachable pieces.

Depending on how you break down the work, you can dramatically change the timeline from when you can get feedback from your users and find issues much sooner.

In this post, let's look at a team breaking down a new feature for their popular application, TakeItEasy.

A New Day - A New Feature

It's a new sprint and your team is tackling a highly requested feature for TakeItEasy, the ability to setup a User Profile. Everyone is clear on the business requirements as we need the ability to save and retrieve the following information so that we can personalize the application experience for the logged in user:

Display Name
Name
Email Address
Profile Picture

Going over the high level design with the engineers, it's discovered that there's not a way to save this data right now. In addition, we don't have a way to display this data for the user to enter or modify.

Breaking Work Down as Horizontal Layers

Working with the team, the feature gets broken down as the following stories:

Create the data storage
Create the data access layer
Create the User Profile screen

Once these stories are done, this feature is done and that seems easy enough. As you talk with the team though, a few things stand-out to you.

None of these stories are fully independent of each other. You can build out the User Profile screen, but without the Data Access Layer, it's incomplete. Same thing with the data access layer, it can't be fully complete until the data storage story is done.
It's difficult to demo the majority of the stories. Stakeholders don't care about data storage or the data access layer, but they do care about the user setting up their profile. With the current approach, it's not possible to demo any work until all three stories are done.

As you approach each story, they seem to be quite large:

For the Data Storage work, it's an upgrade script to modify the Users table with nullable columns.
For the data access story, it's updating logic to retrieve each of the additional fields and making sure to handle missing values from the database.
For the User Profile screen, it's creating a new page, update the routing, and adding multiple controls with validation logic for each of the new fields.

Is there a different way we can approach this work such that we can deliver something useful sooner?

Breaking Down the Work as Vertical Slices

The main issue with the above approach is that there's a story for each layer of the application (data, business rules, and user interface) and each of these layers are dependent upon each other. However, no one cares about any single layer, they care about all of it working together.

Two People Eating Nachos — Seriously, could you imagine enjoying a plate of nachos by first eating all the chips, then the beans, then the salsa?

One way to solve this problem would be to have a single story Implement User Profile that has all this work, but that sounds like more than a sprints worth of work. We know that that the more work in a story, the harder it is to give a fair estimate for what's needed.

Another approach to solve this problem is by changing the way we slice the work by taking a bit of each layer into a story. This means that we'll have a little bit of database, little bit of data access, and little bit of the user interface.

If we take this approach, we would have the following stories for our User Profile feature.

Feature: Implement User Profile

Story: Implement Display Name
Story: Implement Name
Story: Implement Email
Story: Implement Profile Picture

Each story would have the following tasks:

Add storage for field
Update data access to store/retrieve field
Update interface with control and validation logic

There are quite a few advantages with this approach.

First, instead of waiting for all the stories to get done before you can demo any functionality, you can demo after getting one story completed. This is huge advantage because if things are looking well, you could could potentially go live with one story instead of waiting for all three stories from before.

Second, these stories are independent of each other as the work to Implement Display Name doesn't depend on anything from Implement Email. This increases the autonomy of the team and allows us to shift priorities easier as at the end of any one story, we can pick any of the remaining stories.

For example, let's say that after talking more with customers, we need a way for them to add their favorite dessert. Instead of the business bringing in the new requirement and pushing back the timeline, engineering can work on that functionality next and get that shipped sooner.

Third, it's much easier to explain to engineers and stakeholders for when a certain piece of functionality will be available. Going back to horizontal layering, it's not clear when a user would be able to set-up their email address. Now, it's clear when that work is coming up.

Why The Horizontally Slicing?

I'm going to let you on a little secret. Most engineers are technically strong, but can be ignorant of the business domain that they're solving in. Unless you're taking time to coach them on the business (or if they've been working in the domain for a long period of time), engineers just don't know the business.

As such, it's difficult for engineers to speak in the ubiquitous language of the business, it's much easier to speak in the technical details. This, in turn, leads to user stories that are more technical details in nature (modify table, build service, update pipeline) instead of user focused (can set display name, can set email address).

If you're an Engineer, you need to learn the business domain that you're working in. This will help you prevent problems from happening in your software because it literally can't do that. In addition, this will help you see the bigger picture and build empathy with your users as you understand them better.

If you're in Product or Business, you need to work with your team to level up their business domain. This can be done by having them use the product like a user, giving them example tasks, and spending time to talk about the domain. If you can get the engineers to be hands-on, every hour you invest here is going to pay huge dividends when it comes time to pick up the next feature.

Wrapping Up

The next time you and the team have a feature, try experimenting with vertically slicing your stories and see how that changes the dynamics of the team.

To get started, remember, focus on the user outcomes and make sure that each story can stand independently of one another.

If this post resonated with you, I'd like to hear from you! Feel free to drop me a line at CoachingCorner@TheSoftwareMentor.com!

Mar 19, 2024
in Today I Learned
4 min read

Today I Learned – The Difference Between Bubble and Capture for Events

I've recently been spending some time learning about Svelte and have been going through the tutorials.

When I made it to the event modifiers section, I saw that there's a modifier for capture where it mentions firing the handler during the capture phase instead of the bubbling phase.

I'm not an expert on front-end development, but I'm not familiar with either of these concepts. Thankfully, the Svelte docs refer out to MDN for a better explanation of the two.

What is Event Bubbling?

Long story short, by default, when an event happens, the element that's been interacted with will fire first and then each parent will receive the event.

So if we have the following HTML structure where there's a body that has a div that has a button

<body>
  <div id="container">
    <button>Click me!</button>
  </div>
  <pre id="output"></pre>
</body>

With the an event listener at each level:

// Setting up adding string to the pre element
const output = document.querySelector("#output");
const handleClick = (e)=> output.textContent += `You clicked on a ${e.currentTarget.tagName} element\n`;

const container = document.querySelector("#container");
const button = document.querySelector("button");

// Wiring up the event listeners
document.body.addEventListener("click", handleClick);
container.addEventListener("click", handleClick);
button.addEventListener("click", handleClick);

And we click the button, our <pre> element will have:

1
2
3

You clicked on a BUTTON element
You clicked on a DIV element
You clicked on a BODY element

What is Event Capturing?

Event Capturing is the opposite of Event Bubbling where the root parent receives the event and then each inner parent will receive the event, finally making it to the innermost child of the element that started the event.

Let's see what happens with our example when we use the capture approach.

// Wiring up the event listeners
document.body.addEventListener("click", handleClick, {capture:true});
container.addEventListener("click", handleClick, {capture:true});
button.addEventListener("click", handleClick, {capture:true});

After clicking the button, we'll see the following messages:

1
2
3

You clicked on a BODY element
You clicked on a DIV element
You clicked on a BUTTON element

Why Would You Use Capture?

By default, events will work in a bubbling fashion and this intuitively makes sense to me since the element that was interacted with is most likely the right element to handle the event.

One case that comes to mind is if you finding yourself attaching the same event listener to every child element. Instead, we could move that up.

For example, let's say that we had the following layout

<div>
    <ul style="list-style-type: none; padding: 0px; margin:0px; float:left">
      <li><a id="one">Click on 1</a></li>
      <li><a id="two">Click on 2</a></li>
      <li><a id="three">Click on 3</a></li>
    </ul>
  </div>

li {
  list-style-type: none;
  padding: 0px;
  margin:0px;
  float:left
}

li a {
  color:black;
  background:#eee;
  border: 1px solid #ccc;
  padding: 10px 15px;
  display:block
}

Which gives us the following layout

Click on 1
Click on 2
Click on 3

With this layout, let's say that we need to do some business rules for when any of those buttons are clicked. If we used the bubble down approach, we would have the following code:

// Stand-in for real business rules
function handleClick(e) {
  console.log(`You clicked on ${e.target.id}`);
}
// Get all the a elements
const elements = document.querySelectorAll("a");
// Wire up the click handler
for (const e of elements) {
  e.addEventListener("click", handleClick);
}

This isn't a big deal with three elements, but let's pretend that you had a list with tens of items, or a hundred items. You may run into a performance hit because of the overhead of wiring up that many event listeners.

Instead, we can use one event listener, bound to the common parent. This can accomplish the same affect, without as much complexity.

Let's revisit our JavaScript and make the necessary changes.

// Stand-in for real business rules
function handleClick(e) {
  // NEW: To handle the space of the unordered list, we'll return early
  // if the currentTarget is the same as the original target
  if (e.currentTarget === e.target) {
    return;
  }
  console.log(`You clicked on ${e.target.id}`);
}
// NEW: Getting the common parent
const parent = document.querySelector("ul");
// NEW setting the eventListener to be capture based
parent.addEventListener("click", handleClick, {capture:true});

With this change, we're now only wiring up a single listener instead of having multiple wirings.

Wrapping Up

In this post, we looked at two different event propagation models, bubble and capture, the differences between the two and when you might want to use capture.