Turn Downtime Into a Communication WIN

by Cindy Alvarez June 3rd, 2009 

Uh-oh, something's down. Emergency procedures at ProductoSoft spring into action. Pagers go off. The QA team works to reproduce the error. Engineers identify who can fix it and those folks push away from the dinner table and log in to the company VPN.

bombWithin an hour, the problem is identified. Within three hours, a fix has been coded and QA is testing it. Within five hours, the fix has been released. Customer service has been notified that some customers may need to log out and log back in to see the updated changes.

The ProductoSoft CEO, who just got off a plane, skims through fifty urgent email messages to the final one – "the all-clear" – and smiles: downtime stinks, but everyone did what they were supposed to.

Meanwhile — What Happened Outside

twitfail

6:53pm: First post to Twitter.

6:55pm: Followups start coming in on Twitter.

7:02pm: Followups start getting less polite, the #ServeProFAIL hashtag is invoked.

7:15pm: It may not be peak business hours in the US, but it's peak Internet time. Customer service wait time isn't long – less than 5 minutes – but people are used to Internet time. They hang up and do a Google search. But all Google hasn't mastered the real-time search yet: all that comes up is the official homepage and a couple of recent press releases.

8:12pm: Most of the key ProductoSoft customers have listened to their voicemails or read their emails, so they know there is an outage and that it's being worked on. They're understanding of the problem (everyone has downtimes) but silent about it.

9:03pm: A angry customer who runs the popular Joe's Blog posts a transcript of his call with customer support. The ProductoSoft customer service rep didn't do anything wrong, but the customer feels like she isn't taking him seriously, and that's how he frames the write-up on his blog.

9:12pm: Fifty people retweet the customer service transcript from @thisisjoesblog, reaching over 11,000 people.

12:04am: The fix has been released, but there's no one still awake who would announce this anywhere.

7:16am: It's Friday morning, and customers are trying to use ProductoSoft ServePro. However, only the people who logged out and back in have gotten the bug fix. The others are still seeing the same error.

Corporate Stress7:56am: Over 100 people have posted complaints on the ProductoSoft Facebook Page, which an intern created two months ago. It's no one's direct responsibility to monitor it, so all those posts sit unresponded-to.

9:02am: #ServeProFAIL is the top trend on Twitter. Two major tech blogs have published posts about the downtime and how "no one seems to know what's happening – apparently the official position from ProductoSoft is that 'it's fixed', but we have dozens of people who are still reporting issues." They post a screenshots of Twitter searches and the Facebook page: all customer complaints, no ProductoSoft responses.

At 9:08am, the head of marketing enters the building drinking her coffee and is immediately called into a conference room. She and her team scramble to write up an official explanation and make a call to the vacationing CEO to sign off so they can post it on the corporate website. Damage control is in full swing. Instead of a leisurely Friday before the long weekend, marketing, account managers, and product managers are spending the day on phone calls doing spin control and walking customers step-by-step through the fix.

How it could have been handled

6:59pm: Once the DBA who discovered the problem has sent out internal notifications, she logs on to Twitter and posts a standard message – "ProductoSoft #ProServe is experiencing an outage." She then runs a search and directly responds to @DCsalesguy and the other customer who had already tweeted.

7:03pm: The head of customer service sends an email to all customer service reps guiding them to tell customers that the service is down and that engineering is working on a fix. They can reassure customers who ask that this outage is not due to hackers and that no data has been lost. This information is also posted on the customer support forum, which is regularly monitored.

Five hands making a star shape7:14pm: A handful of ProductoSoft employees – product managers, engineers, account managers – have logged on to post replies on Twitter, the ProductoSoft customer support forum, and the ProductoSoft Facebook page. A product manager has posted his individual work email address and requested that customers are experiencing problems other than what he knows about, to email him directly and let him know.

7:49pm: Now that the cause of the problem has been identified, an engineer quickly writes up a short, factual description. It's posted on Twitter and forwarded internally, where other employees email it to customers, repeat it on customer service calls, and post it to all of the places where the public is discussing the issue.

8:18pm: Over 100 people have retweeted the ProductoSoft response, reaching over 38,000 people. Customer responses are mostly factual and helpful, with only a few "angry" tweets.

10:14pm: The QA engineer posts to Twitter anyway that the fix is being tested.

12:04am: The fix is released. The QA engineer posts that customers must log out and then log back in to get the fix. He thinks about it a minute, and adds a second post: "If you're still seeing errors after logging out and logging back in, email me at gibbs@productosoft.com #ProServe". He emails a quick bullet-point description of what happened to the internal team and goes to sleep.

8:00am: The ProductoSoft product manager has "cleaned up" the engineer's report to be more "customer-friendly" and posts it to the corporate website.

At 9:08am, the head of marketing enters the building and logs in to find a blog post forwarded to her email. A popular blogger writes a post about how well ProductoSoft handled the outage. "No one likes a downtime," he writes, "but by being transparent about what was going wrong and how it was being fixed, providing multiple updates, and employees volunteering to personally be contacted, ProductoSoft showed that they valued their customers."

3 E's

empoweredThe rules have changed about communication. It's not hard to get it right, but it requires giving up some control and 3 E's: educate, empower, example.

Educate: All employees need to understand that whatever they write gets indexed by Google, forwarded, reposted. Show them what Google Alerts and Twitter Search can uncover!

Empower: With that in mind, though, they need to feel like it's okay – that it's their job – to make factual statements or try to help people without having every word approved.

Example: Not everyone is a good writer. Provide bad examples "oh, crap, the ServePro DB is fried" and good examples "Looks like the ServePro database is down – looking for the problem now".

In the last few weeks, several major companies (Amazon, Google, Twitter, Facebook) have had downtimes, product issues, or bugs that resulted in runaway customer complaints and bad publicity. There's no way that they could have controlled the conversation about these issues – there are too many people and too many disparate sources of information – but they all suffered from at least some of these "Don'ts".

The "Don'ts" of Communication

  • Attempt to trivialize the problem or disclaim that it's only affecting certain users
  • Wait until you know exactly what's wrong to acknowledge that there's a problem
  • Ignore factually incorrect statements made by customers
  • Omit customer support numbers and email addresses – you can't keep up with responding to them anyways
  • Wait for a "qualified" person like your PR coordinator or an exec to make an "official" response
  • Argue with negative customers or dispute customer bug reports
  • Claim that "most of our users prefer this change" instead of acknowledging that it has negative impact on others
  • Make statements that contradict previous statements made by the company
  • Sound defensive
  • Claim that the problem has been resolved without asking for customers to report any additional issues
  • Claim that the problem has been resolved without acknowledging that you made a mistake and explaining how you will repair it / why it will not occur again

It's a hard list, and abiding by it can be painful to the ego. But most companies can't afford to not be brutally transparent and honest. Google survives bad press. But can you?

Cindy Alvarez is serious about launching great products. She blogs about product management and user experience at The Experience is the Product and runs the Smarter Product Managers book club.

Thanks to netmeg, amabaie and garrettfrench for populating the #serveprofail tag!

You May Also Like

4 Responses to “Turn Downtime Into a Communication WIN”

  1. David Locke says:

    When people resolve the problem, they think they are done. Nope. Now, you need to fix the process that allowed the problem to be expressed on the production server.

    In the software world we call QC QA, so we never have to do QA. QA, quality assurance, isn't about fixing the bug, its about fixing the cause of the bug, which is always a process. QC is finding the bug.

    Maybe customers shouldn't stop talking about a bug until they are reassured that QA has been done to the underlying processes.

    The Twitter fails seem constant. They originate from a wide variety of issues/bugs some deliberate. The fix does let us get on with our work, but the constant fails don't let us trust. Trust is harder to achieve than random uptime. Trust requires qualtity assurance in the manufacturing sense of the word.

  2. hendro says:

    Thanks for writing, I really enjoyed your latest post.

  3. [...] Social media is a fantastic way to build connections and awareness, but you need to know how to use it to handle the bad as well as the good, and this series of cause-and-effect examples can show you [...]

  4. Cool Gifts says:

    I really think your "Don'ts" are spot on. As I was reading through those, I could think of exact examples where each of those have happened. I really believe it's important for companies to suck up their pride and admit that there was a problem. In times of crisis, it usually makes the customers for comfortable when the company can be as transparent as possible.