Fill in the form to arrange a demo
Written by Dave Marsland VP of Engineering
« Go back
23 Jul 2019  |  Engineering

Recent Technical Incidents and How We Dealt With Them

7 minute read

In Engineering at Paddle, we take reliability and uptime very seriously. Our users rely on us to always be available and it’s important to reflect on what we could have done better when mistakes and accidents arise. In this blog, I’ll be looking at a couple of recent issues we’ve had as a result of third-party incidents or our own mistakes and what we’re doing in each case to make sure they don’t happen again. We value being open and transparent and hold ourselves to an extremely high standard, so this is the first in a series on how the Paddle Product and Engineering teams deal with incidents.

How we respond to issues

Within our Engineering team, our Software Operations Centre (SOC) are on hand to deal with the first 15 minutes of any incident, escalating the issue to the correct teams and stakeholders and managing internal communications during the outage. Our status page is the source of truth for external communications to update our customers regularly on the state of ongoing incidents. The status page is also automatically hooked into our monitoring services for immediate state changes. After any incident, the SOC is responsible for organizing a post-mortem debrief and retrospective so we can analyze what happened, generate actions and make sure we put steps in place to reduce or remove the impact of this incident in the future. We strive to achieve a simple goal: no incident should occur twice.

Cloudflare -   June 24 and July 3, 2019

What happened

We use Cloudflare as a proxy for Distributed Denial of Service (DDOS) protection. Cloudflare is a well-respected industry player and an industry-standard way of protecting websites, used by companies such as Zendesk, IBM and Discord. 

On June 24, Cloudflare was impacted by a BGP route leak by Verizon - detailed in their blog post - which caused our checkout and dashboard to be unavailable externally whilst the issue was ongoing. On July 3, Cloudflare also had a problem with a misconfigured firewall rule which caused issues across their network - detailed in another blog post - which again caused checkout and dashboard instability.

During the first incident on June 24, it took us too long to realize that it was a Cloudflare-related issue as we saw periods of recovery followed by more instability.

The second issue, on July 3, affected Cloudflare’s customer-facing dashboard too, meaning our ability to route traffic away from Cloudflare was also unavailable.

paddle-technical-fix-back-end.jpg

What we did to fix the issue

Once we realized the cause of the first incident, we rerouted traffic away from Cloudflare to mitigate the issue until the network problem was resolved. The July 3 issue was more challenging as Cloudflare’s ability for us to route traffic away from them was also unavailable, which meant we needed to work through a time-consuming plan to remove our dependence on them completely.

Follow-up actions

Experiencing these two issues in quick succession highlighted that - however reliable a third party may seem - it’s vital to have plans in place to deal with their outages. We are currently evolving a plan to ensure no Cloudflare outage takes longer than 10 minutes to mitigate, minimizing potential impact as much as possible. Although these issues are rare, our customers expect us to be as prepared as possible and, in high pressure situations, actions to mitigate must be clear and quickly executed.

Safari - July 1, 2019

What happened

Whilst rolling out a fix for a Javascript error alert we identified through our logging, we made a change that clashed with our analytics code deployed via Google Tag Manager to the browser. This caused certain versions of the Safari internet browser to hang when loading the checkout, resulting in a 35% drop in opened checkouts for Safari traffic. Our logging for browser-specific issues didn’t alert us to this, meaning we didn’t catch the issue until our customers started reporting issues.

What we did to fix the issue

Once reported, we immediately reversed the change, restoring full Safari traffic.

Follow-up actions

On investigation, we discovered differences between our staging and production environments. These are meant to be identical copies of each other to ensure consistent testing and to reduce the number of bugs found in production. Although we have testing tools for specific browsers - and run them regularly to check for issues after Javascript changes (due to browser compatibility issues) - we have certain Google Tag Manager  snippets that only apply in our production environment, which we found to be the cause of the issue.

We are continually evaluating how we test all front-end changes cross-browser and we’re ensuring we have tools in place to regularly check that staging and production are as identical as possible.

Delayed license delivery - June 18, 2019

What happened

Part of Paddle’s licensing platform, which is being deprecated at the end of the year, was affected by an infrastructure issue. This meant that all licenses of that type failed to fulfil. Due to a shared queue between all types of licenses, a backlog for license delivery of all types continued for a number of hours.

paddle-technical-fix-queue.jpg

What we did to fix the issue

Our provider Amazon Web Services (AWS) notified us of the infrastructure issue overnight. Once we had terminated the affected instance, recovery was instant and the queue resumed processing. We also added logic to the license queue to ensure failing items are skipped in future, rather than causing a blockage in the whole queue. 

Follow-up actions

We are ensuring that notifications of infrastructure issues are received by the appropriate people and that we have out-of-hours technical support available to mitigate these infrastructure issues as they happen. This issue also highlighted that our licensing alerting should be improved to differentiate between items which are stuck in queues and items we are currently working on.

Our conclusion

Whether issues are caused by us or our third party partners, always being available for our customers is what Paddle’s success has been built upon and we continue to work hard to ensure these issues are as infrequent and non-impactful as possible. We recognize when we’ve let our customers down and reflect on these occasions, pushing hard to learn and improve to never step on the same rake twice.

   

Paddle is a merchant of record, meaning our checkout and licensing solution not only takes care of your billing - we’re also legally responsible for compliance, sales tax, fraud prevention and much more. We take the hassle out of billing and beyond so you can focus on scaling. Request a demo today!

Make It Easy to Run and Grow Your Software Business with Paddle

Everything you need to sell software with checkout, subscriptions, licenses, promotions and reporting bundled in one single platform.

Request a demo Learn more