The Site Reliability Engineering (SRE) team at Vendasta has focused on reducing the number of solutions that developers use. We’ve done this by completing new or unfinished migrations, in turn decreasing hidden tech debt.
Completing these migrations has also reduced cognitive load. Product teams no longer have as many potential solutions they could use to solve a problem, thus reducing analysis paralysis. We’ve done this by using active open-source software or cloud provider managed tools, such as Google Cloud Platform Redis Memorystore, tracking via Snowplow Analytics, and managing translations via weblate. Due to these successes, we’ve sought to better capture our process in writing. The spirit of this post is not to prescribe a process, rather these ideas should inspire you and your team to observe your own successes on the path to removing tech debt.
There are three important phases for completing migrations: planning, work, and cleanup.
What is a migration?
There tends to be some ambiguity around what a migration is at Vendasta. The definition of a migration can be as flexible as “movement from one part of something to another”. This definition is useful for our purposes because it is vague (“moving”) and doesn’t limit the scope (“something”). Here are some examples of activities that could be a “migration”:
- moving customer data into a new model
- moving to an active, open-source application from a home-grown (and unsupported) application
- moving microservices to a new cluster
- three teams are doing the same thing three different ways, agree on a single way, and delete code
Once the migration had been determined, we moved on to the next steps in the process.
The planning phase
The SRE team set aside a number of days to plan out what a successful migration would look like. This gave us adequate time to create a comprehensive plan and to also communicate necessary information to the development teams. Communication is very important to us as it’s something we’re striving to improve.
What does a successful migration need?
Based on the SRE team’s experiences, a successful migration needs:
- a summary of the scope by providing a clear declaration of what “done” and “success” looks like
- a list of the specific tasks we need to do to achieve the plan.
- a dedicated landing page for:
- task assignments
- tracking progress
- how-to guides
- slack channels
- a regular audit of the plan and refining the plan as needed
- a regular communication cadence of the progress
It was also important to us not to impose any downtime or cause undue hardship on Vendasta business or Vendasta’s development teams. The SRE team loves our customers – both our paying customers and our team of developers.
Successful outcome, definition of done
With an agile background, it seemed natural for us to start with a clear declaration of what “done” looks like. In other words, we wanted to answer the following question: “If we imagine complete success, what would that look like?” Using one of our recent migrations as an example, we made a list:
- Identify all code repos using the “old way” and schedule work to convert these repos to the “new way”
- Be able to provide help if development teams needed it. For example, plan for having an SRE member join a product team for a sprint.
- Complete brownout testing
- Archive the repository of the “old way”
- Delete Google Cloud Platform resources required for the “old way” after testing
However, before we even got started on the migration work itself, we needed to understand the level of effort.
Measuring the migration
To measure the migration, we created a script to find references to the repository to be sunset. This was a simple bash script grepping for a shared repository reference. We had to execute it across all Vendasta repositories (which at the time of writing is well over 200). A few weeks before this measurement work, one of our Infrastructure developers created repository mapper. This tool allows us to run commands on all Vendasta’s repos in a simple way. All we had to do was add the new script to the repository mapper tool, and we were off to the races! We now had a list of repositories that needed to convert their code from the “old way” to the “new way.”
Tracking, Help, and FAQs
Using the data from the last step, we created a table in our internal documentation tool. Along with the repositories, we added a few more columns, including:
- Migrated Y/N?
- Deployed to Demo Y/N?
- Deployed to Prod Y/N?
- Assigned to (usually a team or a person)
We had a list of deliverables and identified which teams would be responsible for each item. At this point, we were able to clearly see if there were teams that had a lot of repositories to convert. We were also able to identify which teams might need some extra support.
One SRE team member suggested creating a few models to support others. We offered these models to teams and let them decide how to proceed. We created both a “Do It Yourself (DIY)” model and a “Do It With Me (DIWM)” model, the latter for teams that wanted additional guidance. In this case, an SRE developer would join a team for a week/sprint and help them with the migration.
Finally, we populated a tracking page with links to references, how-tos, Slack channels, and FAQs. We also provided a migration guide “overview”—a 30,000’ picture of what your team will likely have to think about and do before writing code. “Thought before code” is a core value of Vendasta’s development process.
The work phase
After developing and scoping a plan, it was now time to commit to the execution. This section will focus on the steps used to communicate and engage with our customers.
Every Thursday at 1 pm, Vendasta hosts a one-hour time period set aside for internal tech demos and lightning talks. This is a time where members of R&D can present something cool and interesting that they’ve learned about in a structured (tech demo) or unstructured (lightning talk) way. It was during one of these tech demos that the SRE team announced our plans to complete a major migration and sunset a workflow service that was baked into a lot of business logic.
As part of our announcement and plans, we also needed to establish what a deprecation and sunset meant to Vendasta. We took the time to explain each of these terms during the announcement.
- Deprecation: supported, but don’t develop new software with it
- Sunset: it’s going away, action needs to be taken before a specific date
Before the announcement was made, we decided to pick a date that was loosely informed by the list of work needed to be done, but in truth, it was arbitrary. Despite it being mostly arbitrary, we felt that providing a date was important so that we could hold each other accountable, i.e. we wouldn’t just complete the work “whenever.” Teams were thus empowered to use the date as part of their sprint planning negotiations.
Audits and refinements, communication
Once the plan and dates were announced, it was time to get to work. SRE tracked the migration progress weekly. Initially, we had a few teams take on the migrations right away, and some weeks went by when we saw little to no progress towards completion. It was during these slow weeks that we decided to try different communication techniques in order to motivate the teams to keep the looming sunset date in mind.
We have an internal R&D “on-topic” Slack channel and this was where we chose to remind teams about the migration and sunset dates as well as where to find the migration landing page. Our CTO releases a weekly newsletter talking about Vendasta news, recent RFCs, and postmortems. It was also during this time that we decided to include an “active migration progress” section to the newsletter. We also reminded teams about the migration at a second tech demo, presented details on the progress, and further explained the DIY and DIWM models. We were sincere in our message about how well the DIWM model had gone so far — not only was the DIWM model completing the migration mission, it was also a great team-building activity and learning experience for everyone involved.
In hindsight, without progress tracking and consistent communication about planned work, the migration would not have been successful. I’ve witnessed firsthand moments when Vendasta has created a migration goal that was unsuccessful because there wasn’t a weekly touchpoint and a feedback loop open for iteration and improvement.
The cleanup phase
The day before the official (cough arbitrary) sunset date, we received word that the last migration was complete. We were able to formally announce that the teams had completed all of the migration work! While we still needed to complete the final migration steps (brownout testing and deletion tasks), the SRE team was overjoyed and ready to celebrate!
It was on a Friday morning that all teams had finished migrating their code to our new system. The SRE team felt it appropriate to announce R&D’s success at our next company all-hands meeting, which always takes place on Fridays over the lunch hour. This all-hands meeting is attended by Vendasta’s 400+ employees and it’s where we gather to align our goals and celebrate our wins week over week. I quickly put together a presentation and publicly thanked and congratulated the R&D group of approximately 100 people. Not only did R&D work together, but we completed the migration in an orderly and timely way—something almost worth celebrating more than the migration itself! The SRE team reinforced this success by putting a flag in the ground and giving everyone a pat on the back.
Complete the migration
Once the excitement of the codebase migration settled, the final steps for completing the migration were in front of us. These are often the steps that we struggle to complete, and the struggle is real. Why is this the case? For one, we’re explicitly (and permanently) deleting or removing something that has been in production for a long time; it’s hard to turn things off or to delete/archive repositories, just like it is hard to declutter and get rid of things that you think you need. Human nature, I suppose. Call me Marie Kondo, but I’m here to say that once you’ve removed the old code, old database, etc. you’ll feel a new sense of freedom (especially if you’ve done your due diligence before actually deleting it).
The first thing we needed to complete was to test that the migrations were functional and that we didn’t have any “stragglers” that weren’t found in the work definitions from the planning phase. We thoroughly tested this by actually taking down the service (brownout) for brief periods. We started with small periods (1 hour) and expanded to longer periods (1+ weeks) to ensure that we had some of the periodic workflows covered, too. After we were satisfied with the outcome, we deleted the resources.
The brownout testing went mostly without a hitch. We did notice a single discrepancy during the brownouts and the appropriate team was able to fix the issue before we had any impact on the platform. If there’s only one thing you take away from this post, let it be this: make sure to have a safety net in place before deleting resources. We’re very thankful that we decided to test the migrations before deleting the old resources.
Another word of advice: it’s a good idea to complete brownout tests initially for a few hours at a time, followed by a few days, and then even a few weeks. This will catch those weekly or monthly tasks (eg: a billing cycle) that may not show up in shorter brownout test durations.
Summary and celebration #2
All in all, the migration was complete and we couldn’t be happier with the outcome. When I say “we,” I mean Vendasta—it was really a milestone in R&D’s maturity to be able to agree and act on removing tech debt in a distributed, collaborative way.
We had a bit of an unintentional celebration over the next few weeks. There was a lot of chatter about “SRE’s migrations” (credit where credit is due, this should be “R&D’s migrations”) and how by implementing a process, we were able to complete a number of migrations all the way.
As a result, teams started to ask or consult with SRE about the process for many “migration” topics that I mentioned at the beginning of this post—moving customer data, removing tech debt, consolidating databases, etc.
To me, this was by far the biggest point to celebrate beyond completing the migration: we’re talking about removing tech debt and, when we are, we’re talking about how teams should have a plan in place to run their own successful migrations!
I’m awarding +10 experience points to Vendasta for this success! I’d also like to award an extra +10 experience points to Jenna Barth for making some great images for this blog. Thanks, Jenna!
This section provides a checklist that can serve as a starting point for your migration.
- Is there a clear definition of “done” established for a successful outcome? If not, work until that is complete.
- Measure what work needs to be completed to work towards a successful outcome. Make a list.
- if needed, establish a pattern to determine the list – this will help you audit your completion step.
- Based on the work above, create a landing page to organize everything. Have a place for:
- tracking the progress of work (put the work into a table or a similar tracking mechanism).
- the target timeline for the migration
- how-tos, help, references
- sample code
- Announce the migration in an unambiguous way. Provide details and dates, establish any definitions you need to establish, and assign the work.
- check-in every week – what is working, what isn’t working? Refine the plan if needed
- communicate frequently and in different venues – newsletter, group message, email, presentation, etc.
- test before deleting any resources
- complete a retrospective ceremony reflecting on:
- what worked and what didn’t
- capture highlights of the experience, and improving future iterations by updating your framework
- celebrate when complete