Thursday, March 26, 2009

Confiker Worm Warning

computer_worm.jpgWanted to make everyone aware of the Confiker (or Downadup) virus that's rumored to strike on April 1st. The Confiker worm disables Windows security features and can compromise an infected computer so that it can be used to attack others. The virus can gather personal and other forms of information as well.

To protect your personal and home machines:

  • Ensure virus protection and all security patches for Microsoft are up-to-date on your machine. Turn on your Windows firewall or download one such as Comodo

  • Run a malware scanning tool (such as Malwarebytes or Spybot Search and Destroy ) on your system.

  • If you have a current non-supported Microsoft Operating system (such as a trial period OS) and/or virus protection product, please purchase supported versions.

  • Members of the ASU community can visit MyApps for free anti-virus software (VirusScan).

To read more about this virus, visit ASU's Get Protected site or this article posted on CNET News.


Tuesday, March 10, 2009

Communicating System Outages

outage-notification-name-tag.jpgIn my last post I promised I'd tell you about the new mechanisms ASU is putting into place to improve the University's ability to notify the community of disruption in service.

If our mail is any indicator, these kinds of notifications are important to you. When systems are going to be changed or we have outages planned, you tell us you like to know well in advance. When we experience unplanned outages, you tell us that you want to know what's wrong and when it'll be fixed.

We've done some things already -- our System Health page is well visited and moved us forward in terms of keeping you informed about planned outages. The emergency notices we put on My ASU have helped a lot of you know when systems are out and when we can expect them back.

But they don't always work. Sometimes when we have extreme outages, like say the Internet is unavailable or power is lost to the data center, we can't get to System Health and My ASU to update them, and even if we could, sometimes you can't get to them to find out what's going on.

So this month, right after St. Patrick's Day, we're releasing a set of new improvements that we hope will make you better informed in the event of an emergency.

First, System Health is being moved off-site. We're moving it to our Denver facility to improve its availability in those times when major portions of our infrastructure are unavailable.

Second, we're expanding our outage notifications. In addition to announcing outage information on System Health, all unplanned outages will also be announced through a notification group associated with our new ASU Alert service provided by e2Campus. This new service will allow members of the ASU community to receive a text message and/or email message whenever System Health turns red. We are pre-subscribing members of the mailing list to this service, but if you are not already subscribed, you can self-subscribe to ASU Alert. Click here for complete instructions on signing up.

Third, all planned outages and system changes will be announced through UTO's new Change Management System. For authorized users, the Change Management System provides a complete history of proposed and implemented changes. Our system was designed by the Communications Subcommittee of the UTC. Again, if you are already a member of the list, you will be pre-subscribed to the system. If you are not a member and wish to subscribe, please send a note to

We're continuing to work on reliability and we hope not to have to use these notification systems as often as we have of late. But we know that when we do have a system disruption, you want to know as much information as you can about what's wrong and when we'll be back online. We're hoping these changes we're making will help with that.

As always, your comments and suggestions are welcome, particularly the constructive ones.


Thursday, March 05, 2009

System Outages Explained

unreliable.jpg If you've been at ASU this semester, you've probably been inconvenienced by one of the five major system outages we've experienced so far this term. And as you might imagine, I've received more than a few emails from concerned members of the ASU Community expressing their frustration with being unable to access Blackboard, or My ASU or the various other services that have been affected in the past several weeks.

For example:

I am writing as an ASU graduate student, enrolled in an online 8-week, 3-hour graduate course this semester. The repeated system outages, including the one this morning -- unable to get into Blackboard ... AGAIN ... unable to get into "System Health" to find out what's going on and how long it will take to correct -- has left me frustrated beyond words.

Or this one from an ASU professor:
This week, for the second time this semester, we have had large numbers of students unable to log in to Blackboard in order to take their weekly quiz. The problems seem to have been intermittent, but at a rough guess I would say that about 20 percent of the class has been affected.

I'd like everyone at ASU to know that here at the UTO, we all understand the frustration and lack of productivity that these outages cost and on behalf of myself and our team I apologize for the poor system performance we've had this term. This is not the level of service that you have come to expect from ASU and none of us are satisfied with our performance so far this term.

Given how central information technology is to the life of the University, we know that anything short of 100% uptime has become unacceptable. Our team works hard towards that goal, and we treat every system outage as a critical event. So we feel very keenly the amount of disruption that recent system instability has caused this term and we're committed to correcting it.

Since January 1st, we've had five major incidents:

  • Incident 1: On the afternoon of January 21/22, DARS degree audits and course catalog searches were unusually and inconveniently slow.

  • Incident 2: From the evening of Feb. 1st to the morning of Feb 3rd, Internet access from ASU was unreliable.

  • Incident 3: On February 5th, wireless access in a portion of the Tempe campus was unavailable. There was also a one hour interruption to My ASU and Citrix.

  • Incident 4: For two hours on the morning of February 16th, Internet access from ASU was unreliable.

  • Incident 5: On February 25th, many of ASU's Web services were inaccessible for most of the day.

To a user experiencing these issues -- even one who regularly visits our System Health site for updates and information -- it must seem that all these outages must arise from some common cause. Our graduate student writes:
What's even more troubling is that no one can explain what the problem is, or why it continues to happen, or what the plan is to get the University beyond this. Additionally, there's been no communication on Blackboard or via email to students with any information.

While the professor understands that people are working hard, but makes it clear that we can't confuse effort with results:
I have generally found your staff to be knowledgeable and hardworking, so I am sure that this down time has been hard on them. I have no idea what the technical problems are behind these time out and login errors. But these issues really do create havoc in this kind of course and make it impossible to keep to the weekly class schedule.

As tempting as it is to think all these incidents are manifestations of the same thing, each of the 5 incidents we've experienced this term has arisen from a different proximate cause:

  • Incident 1 arose from an unusually large demand for DARS and Catalog services, a demand far greater than in prior terms.

  • Incident 2 was caused by a latent defect in our border firewall, which a rogue server exploited successfully overloaded the ASU network. Working with CISCO engineers, our team worked around the clock to identify and successfully correct the problem, but only after many hours of interruption.

  • Incident 3 was the result of a flood in BAC.

  • Incident 4 was caused by a mistake made by ASU's Internet service provider that affected all of its clients.

  • Incident 5 was caused by a failure in the UPS backup system that protects the systems in the University data centers in the event of power outages. The resulting hard reset of the systems in the data center made service restoration complex -- it took more than 8 hours to rebuild the systems and restore services.

But even if each incident has a separate technical cause, surely all of them are evidence of incompetence? With all these problems, surely its a sign that the people running these systems don't know what they are about?

It's certainly a sentiment that more than one writer has communicated as an outgrowth of their frustration. And it's understandable. But understandable or not, it isn't true.

Having witnessed first hand the level of dedication shown by the ASU technical professionals who design and maintain our systems and who scramble out of bed in the middle of the night or spend their weekends responding to the equipment failures, power outages, floods, denial of service attacks and hundreds of other failure modes that information systems are heir to, one thing I can assure is that at ASU we don't lack for competent, dedicated, hard-working people committed to providing the information services the University requires.

Unfortunately, despite the dedication and skill of our systems people, to quote the t-shirt, "Stuff Happens." And when it happens, we are either in position with redundant equipment and services that allow us to recover without interruption -- or, if not, we have to rely on people to put the pieces back together again. While our investments in redundant equipment, our 24x7 monitoring, and the expertise and diligence of the ASU staff that oversee our systems successfully protect us from many different sources of outage, we remain vulnerable to system failures along many dimensions.

Over the long run, our overall systems performance is between 2 and 3 "nines." That means they are available a little more than 99% of the time. Sounds pretty good until you realize that 1% of a year is 87 hours. And that doesn't count planned outage windows. Include those and the number of off-line hours gets even worse. 90 hours a year may not seem like much, but if one of those hours is when you have a class that needs to take a quiz on Blackboard, its completely unacceptable.

So what does it take to achieve higher levels of reliability? What do we have to do to move our systems from 2 "nines" to 5?

The answer, unfortunately, lies in additional investments, which is not what anyone wants to hear during the present economic situation. ASU's primary data centers are more than 25 years old, and while they have served the University well, they were built for a day when IT was a luxury, not a necessity. We've helped the situation over the years, with some strategic investments and by working with strategic partners like Google and CedarCrestone. Because data services are our partners' core business and they operate at scales much greater than ours, they've helped us increase our levels of reliability. We've also migrated some of the services we run ourselves from our older data centers to some of ASU's newer facilities. But in doing so we've had to be conservative in our spending, moving gradually over time as hardware ages, to consolidate servers and storage and simplify their delivery.

And up until this term, we'd been pretty lucky. Term over term, our reliability was increasing. But clearly, this term, our lucky streak has run out.

I want you to know we're doing more than waiting for our luck to change.

The president has challenged UTO to quickly put a plan together to get us a couple more "nines" of reliability. We are hard at work on that plan now. It will suggest accelerating migration out of the oldest data centers, moving additional services to strategic partners, improving power redundancy and backup. And obviously, it will be constrained by the realities of our fiscal situation. But I'm confident we will get the support we need to make things better for our community.

In my next post, I'll tell you about some systems we're going to release this month to improve our ability to communicate with you during those times when the information systems are having trouble. We've learned a lot from our recent history and I think we have a good plan to make things better.

As always, we're interested in your comments on these issues, and any others. At the UTO, we're committed to overcoming these recent issues and steadily improving system reliability. We're sincerely sorry for the inconvenience when things don't work right.

Thanks for your patience. We don't take it for granted.