RIM Outage Explanation Leaves Big QuestionsRIM Outage Explanation Leaves Big Questions

RIM did more dancing around the issues than frank sharing as it tried to explain the BlackBerry outage--leaving CIOs to speculate. And goodwill's running short.

Jonathan Feldman, CIO, City of Asheville, NC

October 13, 2011

4 Min Read
information logo in a gray background | information

After an almost four-day outage of RIM's Blackberry service, RIM's co-CEOs gave a status update Thursday morning. Mike Lazaridis delivered what appeared to be a prepared statement, followed by questions, largely from the media. The way that RIM reacted to the outage will likely shape the company's fortunes for the foreseeable future. And on the key question, the future health of RIM's network and its ability to scale, too many questions went unanswered.

Lazaridis started out with an apology and something of a promise. "You expect better of us, I expect better of us," he said. "We are, and will take every action feasibly, to minimize the risk of this happening again."

Apparently, one switch's failure with a bonked-up backup system had such a tremendous "ripple effect" that it caused a world-wide outage for days.

The question that many CIOs and CTOs are asking is, if architecture is planned out right, and testing occurred on a reasonably diligent basis, how exactly could that happen?

[ For more analysis, see BlackBerry Service Outage Spells RIM Doom. ]

Lazaridis says that "root cause analysis" is still ongoing. In his words: "A dual, redundant, high-capacity core switch designed to protect the core infrastructure failed." This apparently caused outages and delays in Europe, the Middle East, Africa, India, Brazil, Chile, and Argentina. "This caused a cascade failure in our system. There was a backup switch, but the backup did not function as intended and this led to a backlog of data in the system. The failure in Europe in turn overloaded systems elsewhere. When we restarted the system based in Europe, the queue processing took longer than expected."

This, in turn, caused service outages everywhere else, including the United States.

Lazaridis took pains to point out that RIM tests systems on a regular basis. He pointed out a 99.97% service level over the past 18 months, and promised that RIM is doing everything in its power to aggressively minimize the risk of a re-occurrence. Specifically, RIM will work with the vendor to "correct the particular failure mode in the switch that occurred Monday," audit the infrastructure, and continue to investigate root cause analysis, he said.

When asked what vendors were involved, RIM was cagey, saying that the company had a multi-vendor infrastructure and that it was too soon to start talking about vendors.

During the call, one analyst asked how the European failure could have cascaded the way that it did, and specifically asked whether RIM only had two operating centers. Jim Balsillie, co-CEO, jumped in and said that "it happened exactly the way Mike described it." Which, of course, didn't really answer the question.

Other great questions lobbed on the call weren't answered in ways that gave listeners great confidence.

Was it definitely a hardware failure? "We don't know why it failed the way that it did," Lazaridis said. RIM seemed clear that the outage was NOT preceded by changes in hardware or software, but didn't elaborate.

And, given RIM's layoffs, it was natural for some listeners to question whether the reliability of the infrastructure had been compromised by key staff departures. After all, once the hammer starts coming down, your best folks start leaving. The answer to this, too, was "no." But "the team that manages emergency ops is a highly skilled team that manages this, this would not have affected them," isn't exactly a resounding answer as to retention policies that might prevent a mass exodus of engineers.

CIOs listening to these answers will notice what was said--and what was not. The stakes could not be higher for RIM, nor the timing worse, as information's Fritz Nelson noted yesterday.

RIM's customers know that reliability, management tools, and security strength, are what RIM has to offer enterprises now. RIM's not competing on features.

There's no doubt that more information will be forthcoming in the days to come. But for now, serious worries remain about the architecture, testing procedures, and other aspects of RIM's venerable and once-mighty data service.

Jonathan Feldman is a contributing editor for information and director of IT services for a rapidly growing city in North Carolina. Write to him at [email protected] or at @_jfeldman.

Read more about:

20112011

About the Author

Jonathan Feldman

CIO, City of Asheville, NC

Jonathan Feldman is Chief Information Officer for the City of Asheville, North Carolina, where his business background and work as an information columnist have helped him to innovate in government through better practices in business technology, process, and human resources management. Asheville is a rapidly growing and popular city; it has been named a Fodor top travel destination, and is the site of many new breweries, including New Belgium's east coast expansion. During Jonathan's leadership, the City has been recognized nationally and internationally (including the International Economic Development Council New Media, Government Innovation Grant, and the GMIS Best Practices awards) for improving services to citizens and reducing expenses through new practices and technology.  He is active in the IT, startup and open data communities, was named a "Top 100 CIO to follow" by the Huffington Post, and is a co-author of Code For America's book, Beyond Transparency. Learn more about Jonathan at Feldman.org.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights