Don’t break the internet.

What companies can learn from the Floyd Mayweather — Logan Paul Boxing event — aside from how to make it through 8 rounds with one of the highest-ranking boxers of all time.

Mondo Robot
MondoRobot

--

By: Ashley Einerson, Head of New Business and Account Strategy, and Jesse Manning, Principal Software Engineer at Mondo Robot.

Whether you’re a boxing fan or not, one thing you can take away from the recent exhibition between one of boxing’s greatest showmen and one of YouTube’s most-watched stars is how important it is to ensure your website, mobile applications and/or platforms can handle heavy blows.

Last night, a highly publicized event aired via Showtime and the event left many wanting. The app crashed, leaving most viewers unable to watch the bout, or new users able to even buy it. I sat down with Jesse Manning, Principal Software Engineer at Mondo Robot to ask why platforms crash, apps break, and what businesses can do to ensure their digital products don’t get knocked out by the pressure of heavy traffic.

Image Source: Logan Paul Instagram

Q: Last night the Floyd Mayweather, Logan Paul fight crashed on Showtime, can you share why applications and websites crash?

There are a lot of reasons that can cause sites to crash or become unresponsive. Some examples are large spikes in traffic, DoS (denial of service) attacks, degraded/insufficient resources, errors in application logic just to name a few.

Q: What kind of traffic do you think could take down a site/application like Showtime?

Large traffic spikes either benign or malicious could have contributed to the issues. Improper scalability or redundancy could have also played a factor in the site going down.

Q: What can companies do to prevent it from happening?

There are a lot of different mitigation techniques depending on the architecture.

It starts with designing a modular system with redundancy, scaling, and observability built into all layers. A modular system allows for a separation of concerns so that each piece is purpose-built and independent so that a failure in one area has a limited impact on others.

Observability and the ability to quickly and/or automatically scale resources are key when building a system to withstand large traffic. Load testing can help identify the most resource-intensive areas of the platform and guide the architecture around scaling. Additionally adding caching and DoS protection where applicable and splitting the platform appropriately can also help resiliency.

For pre-launch, launch, and ongoing maintenance efforts, having an on-call team rotation, good documentation of the architecture, and mitigation procedure checklists can help catch issues before they cause large sitewide problems. It‘s also a good idea before launch to do dry run scenarios to test the resiliency of the platform and mitigation procedures.

Q: What goes on behind the scenes to fix problems like this when they happen?

These platforms are likely large, complex, and involve several 3rd party services. In most cases, teams try to evaluate the platform using their various metrics to try and determine the root cause(s). Sometimes this is enough to move forward, but other times the problem is more nuanced and not immediately clear. The complexity oftentimes means that it is not as simple as scaling resources to fix the issue and it may take time and more investigation to see if a fix actually solves the issue. Additionally, once the system is degraded and still receiving large traffic it can be difficult to recover quickly.

Most likely their team was trying to identify the root cause using metrics. Then based on their investigation they probably put together a plan that included multiple options based on time to implement, risk, size of change to the platform, likelihood to work. Next, they would choose a path with input from the team and implemented solution. Then continue to monitor metrics for changes.

Q: What type of monitoring should apps and websites have in place?

This is heavily dependent on the type of application being built. In general, system and application-level performance monitoring, as well as error monitoring at both the application and resource levels, are good ideas.

For large distributed platforms with multiple resource layers and autonomy built into the platform, observability is a daunting task as tracing a request through all the different pieces is difficult. Ideally, each piece of the architecture has system-level monitoring for performance, errors, and other metrics as well as being integrated into a larger distributed telemetry tracing system to allow for tracking across the platform. This data across all the different levels of monitoring should then be aggregated together in dashboard(s) for ease of viewing.

Q: At what point should an app or website think about its ability to scale?

Scaling should be thought about from the beginning. Understanding the need to scale is very important in architecture design and resource estimation. Scaling is very dependent on the type of application or platform being built and there is a good chance that not all layers have the same scalability requirements. Planning for this upfront as well as testing for resiliency along the way will help in building a better overall platform.

Q: Anything else you’d like to share?

These problems are hard and have a lot of moving parts. While there are a lot of ways to mitigate risk generally there are constraints (time, budget, etc) that also play a factor into the final platform. Having strong planning up front and choosing an appropriate architecture go a long way in setting a project up for success and extensive testing is used to minimize risk. The hope is that you’ve built a solid foundation so that unforeseen issues can be investigated and responded to quickly to mitigate downtime.

In Conclusion: There is a lot that goes on behind the scenes designing, developing, and supporting digital products, it’s important to have the right team and partners in your corner. Making an investment, mitigating risk while navigating constraints is a difficult feat, almost as difficult as a YouTube star standing toe-to-toe with one of the greatest. Hopefully, this helped shed some light on how things like this can happen, and the importance of planning, investing in your product, and being able to roll with the punches when challenges arise.

--

--