ISP Experiences with DDOS attacks

Steve Gibbard

Steve Gibbard Consulting – http://www.stevegibbard.com

Originally presented at PAIX Peering Forum, Palo Alto, CA, December 11, 2003

Introduction

 

I’ve been working in the Internet industry since 1996, originally handling everything network related for a small ISP, and then more recently working for Digital Island, which became part of Cable and Wireless, where I became probably far too specialized on issues that had nothing to do with security.  When I left there in December, I decided to try consulting rather than looking for a full time job.  One of the first consulting gigs I got was for a “dedicated server” hosting startup which was just getting big enough to start running into network issues.  Since then, they’ve more than tripled in size, bringing with it all sorts of scaling challenges that didn’t exist in my previous small ISP role, and that would have been somebody else’s job in the large hosting company/telco work I spent the previous few years in.

 

Perhaps the hardest challenge there has been dealing with DDOS attacks.

 

So, what are DDOS attacks?

 

As you’ve probably figured out by now, DDOS stands for “distributed denial of sleep,” or occasionally “distributed denial of service,” which is basically a fancy way of saying somebody’s sending your network more data than it can handle, from a distributed source.  Generally this gets targeted at a specific host, but often the real damage that gets done is to the network, somewhere before the attack traffic gets to the host.

           

When I first started dealing with this sort of thing it generally took the form of “smurf,” in which a ping would be directed at a network broadcast address, which would then send a reply from several hundred hosts, all on the same network.  These were relatively easy to defend against because you could look at the source addresses on the incoming floods, see that it was all one network, and block traffic from that network block.  Maybe you could even call whoever owned the “amplifier” network and ask them to block directed broadcast traffic.  Anyhow, those attacks were limited to whatever connectivity the network being used as an amplifier had, so their effectiveness was somewhat limited.

 

The current generation tends to be different.  Rob Thomas frequently tells stories of networks of tens of thousands of “owned” Windows boxes, which would all be sending tiny amounts of data at somebody their “owner” didn’t like.  When multiplied by 50 or 100 thousand, the small amount of traffic from each source can turn into hundreds of megabits.  Since it’s coming from all over the place, it’s hard to block based on source addresses.  Blocking based on the destination is generally easier and more effective, although it also makes the attack 100% effective against its intended target.

 

What’s the problem when attacks happen?

 

Generally, an attack is aimed at a specific host IP address.  Often the attack traffic is too big, and ends up taking out the network in front of the host instead.  The ISP I’m talking about now had a 100 Mb/s pipe to the Internet when I started working for them, and a single Cisco 7206 VXR router, the same sort of router as we were using in the simulation.  An attack of only 50 Mb/s, which seems to be a small attack these days, would use up more than the total of their unused bandwidth, and cause a slowdown.  I suppose we can say they were comparatively lucky – even a few years ago, a DS3, at 45 Mb/s, was considered a really large connection for a small ISP.

 

They’d see an attack come in and the whole network would grind to a halt.  Pretty soon my phone would start ringing, and I’d have to get in and tell them what was going on.  But there was a problem with this too, since the router’s CPU would get saturated as well.  We’d be trying to get in and figure out what was going on with the router, and it just wouldn’t respond.  This made troubleshooting really difficult, since if your router goes completely unresponsive you lose a lot of diagnostic information.  It was really hard, for example, to tell the difference between an attack and a router crash.

 

In many cases, the easiest way to solve a capacity related problem is to throw more capacity at it, and that’s been largely true in this case.  Partly just out of a need to solve this problem, and partly because the company’s been growing and more capacity was needed anyway, the network has gotten a lot bigger.  The upstream connectivity has gone from being 100 Mb/s of Ethernet to multiple OC-12s.  The routers the OC-12s plug into are now GSRs.  An access layer got created out of a bunch of 7206s, each handling a small subset of the customer-base.  This has solved a lot of the problem, but not all of it, and it cost a lot.  We haven’t yet seen an attack that filled one of the OC-12s, or knocked over one of the GSRs, but there will probably be one some day.  We do see attacks that knock over one of the 7206 access routers and take down a subset of the customers, so that’s still a very serious problem.

 

Meanwhile, as the network has grown, so too have the attacks.  Part of this has no doubt been a matter of the attackers getting more sophisticated, and part of it is because a bigger network has more targets.

           

What form do the attacks take?

 

Denial of service attacks come in several forms: distributed or non-distributed, internal or external, directed at a host or at a network.

 

Internal attacks – that is, attacks sourced by one’s own customers – can be the most destructive.  In fact, I think until we got those under control, attacks coming from the outside were pretty much lost in the noise.  All the servers in this datacenter are connected via 100 megabit Ethernet, so if they fill that port with small packets, that’s a lot of packets for the closest router to process, with nothing in-between to slow the attack down.  Generally, these take down their local access router, and are pretty well mitigated before they hit the network core, which is made up of more powerful routers, but they’re pretty devastating for a subset of the customer base.  Fortunately, internal attacks are also the easiest to block, since they’re the company’s direct customers, and the company can easily take control of those servers.

 

Attacks coming from outside are a much more difficult problem, since the source can’t be easily identified and stopped, and simply unplugging the target won’t help if there’s still a route sending traffic to the target into the ISP’s network.  For this reason, while attacks from the outside are generally a lot less disruptive, they’re also what can cause real lasting problems, because it’s harder to make them stop.

 

I understand that lower packet rate attacks directed at some service on the servers can also be a problem, but that’s where I can claim it’s a systems problem, and therefore not my problem.

 

How do we deal with them now?

 

The first project we took on was blocking internally sourced attacks.  Since blocking those in its simplest form was simply a matter of unplugging the customer’s server from its switch port, this was really a matter of detection.  This turned out to be pretty easy.  We use Cricket, with the polling interval set to 30 seconds rather than the default of five minutes, to track traffic on every switch port.  This doubles as the data source for the billing system.  Then we set Cricket to alarm and notify the NOC if a 100 Megabit switch port fills up.    The blocking is then done manually.  If we’re just dealing with too much traffic coming from a host, we can take care of that by turning the ports down to ten megabits per second.  Occasionally, we’ve seen attacks involving really large numbers of really small packets, which can be pretty destructive to the routers even at ten megabits per second, so in those cases we’ll turn down the port and completely cut off the box sourcing the attack.

 

This is still too slow.  It requires people to notice the attack is going on, figure out where it is, and get logged into the switch and turn the port down.

 

Dealing with outside attacks is so far a much more manual process, although we still have Cricket alerting us that something’s going on. 

 

Depending on the size of an external attack, the effect varies considerably, and the process of dealing with it does too.  The first step is, of course, detection.  The trigger to start investigating is generally an alarm from Cricket, but unlike the internal attacks for which Cricket tells us exactly where the attack is coming from, with the external attacks it merely tells us something’s wrong.  We still have to investigate it ourselves.

 

Since with distributed attacks from the outside what we generally see is traffic coming from hundreds or thousands of source addresses to a single destination address, blocking based on the source doesn’t work.  In general, we have to block based on the destination instead.  This has the unfortunate effect of making the attacks 100% effective against their target, but does save the rest of the network.

 

There are two main techniques we use to find the destination.  One is going back to the Cricket graphs on the switch ports, which if the host being attacked is up will tell us which switch port the host is connected to.  This narrows our target down to a relatively small number of IP addresses, but since most of these hosts have more than one IP address, it doesn’t tell us everything we need to know.  We also use “show ip cache flow” output to give us a more specific view of what the target is.

 

Initially, we had to call the upstream providers and ask them to black hole the IP address being attacked.  Now, with bigger border routers on the network, and bigger connections into those border routers, we can generally block them ourselves.

 

Earlier this week, on an evaluation basis, we installed a Riverhead box, which sources a BGP announcement for the destination being attacked, and then attempts to filter out the attack traffic while putting the legitimate traffic back into the network.  In the next week, we’re planning on installing a Riverhead box that will sniff the Ethernet and detect incoming attacks, thus triggering the BGP-based diversion to the “guard” box without us having to do anything.  If it works, this will be a very nice solution, but unfortunately it’s really expensive.  We probably can’t afford to keep it past the evaluation period.

 

How will we deal with them going forward?

 

I’d really like to automate the process of dealing with inbound attacks, like the Riverhead box will do.  This should both improve response time and allow those who currently deal with such things to sleep through them, but we haven’t found an affordable long-term solution for that yet.  There are a number of relatively cheap products that claim they can be put just inside the network borders, which will sit in the traffic path looking for attack traffic, and filter attack traffic out if they find any.  This is scary, since if those boxes die, or get overloaded during an attack, they are an extra point of failure and can take out the network on their own.  This isn’t what I’m trying to accomplish.  More encouraging is the Riverhead system we’re testing, as well as devices from Arbor Networks and others, which watch for attack traffic and respond to it.  I haven’t looked at the Arbor box much, but my impression is that it can either just send alerts to the NOC, or send access lists or null route statements to insert into a router.  At that point automating the process could be quite easy.  The Riverhead box does all that automation right out of the box, and at least in theory does some really fine grained filtering that should let us allow legitimate traffic through even to the host that’s being attacked.

 

Unfortunately, for my purposes, all the automated solutions that look like they’ll do the right thing are too expensive.  This client, like a lot of ISPs their size, does not have the multi-million dollar budgets that many of the really big networks do, so buying any one device with a hundred thousand dollar price tag is out of the question.  What I’d really like to see is for somebody to come out with a much cheaper device that would simply look for single IP addresses receiving enough traffic to fill the entire network, and issue a BGP statement to redirect the traffic elsewhere.  This wouldn’t be as elegant as the Riverhead solution, but saving all the customers that aren’t being attacked would probably be good enough.  We’ve also been looking at doing such a system in house, but that has its own economic problems if we aren’t sharing the costs with anybody else.

 

I’ve been hearing a lot of talk lately about how to tell the difference between attack traffic and non-attack traffic, and this capability is what separates something like the Riverhead from something we could do easily in house.  My answer is that in this context it really doesn’t matter.  While certain DDOS programs send certain traffic patterns that are recognizable with the right equipment, the most useful test seems to be whether or not the traffic is causing damage.  While there are situations where this isn’t true, in the cases I’ve been working on no one customer is so important as to justify taking down all the others.  The general rule seems to me to be that if somebody is sending or receiving enough traffic to take down a chunk of the network, I don’t care how legitimate the traffic is; I want it gone.  There is precedent for this in other utilities.  For example, the power grid (if it’s working properly) cuts off people in areas with voltage anomalies in order to save the rest of the grid.

 

A simpler solution is to increase capacity, but that’s also expensive.  In our case, once we got beyond the OC3 level and into OC12s, upstream bandwidth became less of an issue.  Since we had to switch to GSRs for our border routers at that point, we picked up a bunch of excess routing capacity as well.  For smaller ISPs, increasing capacity by that amount would be pretty cost prohibitive.  In our case, we still have a bunch of work to do on our access layer, simultaneously trying to keep the pieces small enough that a single problem doesn’t take out too many customers, and trying to build in enough excess capacity to soak up DOS attacks.  What we’ve learned so far is that while the NPE-300 and NPE-400 based 7206 VXRs are cheap and wonderfully sized for our normal operations, they simply don’t have enough excess capacity to handle attacks.  Newer Cisco routers, such as the NPE-1G for the 7206, or the 7301, look more promising, and we’ll probably end up upgrading our access layer in that direction fairly soon.

 

Why do attacks happen?

 

My big question on all of this is, of course, why do we have to deal with it at all?  Don’t people have anything better to do than to go knock networks over?  Wouldn’t it be much simpler if people would just stop?  Unfortunately, at this point, I’m not holding my breath.

 

Sometimes the attack victim is just an attractive target, or even selected completely randomly.  More often, the attack victim has done something to piss somebody off.  When we’re hosting servers for anybody who can come up with $100 per month, and we’ve got more than a thousand of them, there’s a lot of potential for that.  We get spammers who are upset about being cut off, and people who are upset that the hosting company hasn’t cut off spammers fast enough.  We’ve got IRC servers, which seem to have been everybody’s favorite attack target for years.  We’ve probably also got groups taking unpopular political positions.  We didn’t host Al Jazeera, but those who did saw some real nastiness.

 

What not to do

 

It’s also worth noting that there are some really bad ways to handle these attacks, which can end up just digging you deeper into the hole:

 

Obviously, don’t panic.  Most attacks at this point seem to be things that we’ve seen before, and if we’re calm we can pretty easily figure out where they’re coming from and how to block them.  On the other hand, panicking because the network is down probably means you’re going to make a bunch of mistakes before you actually manage to solve the problem.

 

Don’t attack back.  Over the years, I’ve seen a number of forms of this one.  In some cases, administrators of the attacked systems have tried to probe or break into the systems the attack was coming from, as a way of figuring out what was going on.  This didn’t actually tell them anything useful, and wasted a bunch of time that could have been spent on more effective responses.  Worse, it transforms you from an innocent victim into a fellow attacker, causing a loss of credibility when asking for help.  Often, the systems attacking you will be compromised systems, and it’s not their owners that you’re having a problem with.

 

Also, avoid using access-lists that log.  Much as they’re a useful source of information, they’re also really hard on the routers.  The security department at a former employer once decided they needed to know about all attack traffic, and thus set up logging access-lists to capture it.  I then got woken up in the middle of the night by the NOC, complaining about extensive packet loss.  Sure enough, it was a DOS attack, but of such a low grade that it wasn’t managing to fill the circuits, and didn’t look like it ought to be enough to take out the router.  But wait, there was that logging access-list, giving our security people a full picture of what was taking out the network, but apparently not prompting them to fix it.  I removed the logging statement and the network went back to working just fine, carrying its relatively small amount of extra attack traffic without issue.  I was able to go back to sleep.   The security people weren’t too happy the next morning that I’d removed their logging – how were they supposed to know we were being attacked without it?  There’s a lesson here.  If what you’re trying to protect is your network’s ability to forward packets, defense mechanisms that get in the way of that aren’t useful.

 

Acknowledgements

 

Irving Popovetsky has let me bounce these ideas off him, and contributed many ideas of his own, over the last several months.

 

Chris Quesada at Switch and Data invited me to give a talk at the PAIX peering forum in December, 2003, prompting me to write this paper.

 

The hosting company described here let me experiment on their network, paid me for several months of work, and then let me write this paper about them, for which I owe a lot of thanks.  Many of the solutions discussed here have been created by their employees.  They have, however, requested to remain anonymous.