[Battlemesh] Routing Tables of Death

Wed Mar 8 00:29:10 CET 2017

On Tue, Mar 7, 2017 at 2:29 PM, Pau <pau at dabax.net> wrote:
> On 07/03/17 22:56, Dave Taht wrote:
>> On Tue, Mar 7, 2017 at 1:29 PM, Pau <pau at dabax.net> wrote:
>>> Would be interesting to see if bmx7 (with security extensions [1]) is
>>> able to minimize the impact of such kind of attack. Of course the
>>> network flooding will still be a problem but if the routing protocol can
>>> survive to it the attack impact would reduce drastically.
>>>
>>> [1] http://bmx6.net/documents/30
>>
>> My principal observation on this front at the moment is that routers tend
>> to bottleneck on serving packets, and starve userspace daemons under
>> loads like that.
>>
>> More testing is needed in these circumstances. I was using the flent
>> tools to stress the forwarding path, and observing many anomalies on
>> routing behaviors, of late. Here for example is what happened to me
>> with 10k routes on an apu2, which as a quad core x86_64, is fully
>> capable of forwarding well over a gbit/sec - but as the routing daemon
>> ate more than a core and couldn't keep up, it would lose connectivity
>> periodically.
>>
>> http://www.taht.net/~d/apu_router_failure/apu2_out_of_cpu.png
>>
>> I've seen the same things happen at smaller scales on weaker boxes
>> (I'd been saying for years that many routing problems were congestive
>> in nature, but had not got around to reliably simulating it til this
>> past month, figuring that with fq_codel and ATF, many would go away.
>> They didn't - because we haven't fixed wifi multicast yet).
>>
>> I was also recently doing myself in with some bad interactions between
>> odhcpd, network-manager, and babeld in multiple circumstances.
>>
>> There's a ton of mostly crappy work to try and handle error handling
>> better, as well as deal with cpu overload and network bufferbloat
>> better going on in my rabeld repo that I'm in the process of
>> organizing and cleaning up. Don't look there. It's a mess, currently,
>> with only 6 or so good ideas out of 200+ commits.....
>>
>> As for bmx7:
>>
>> Adding userspace crypto into the mix may make things worse, rather
>> than better, and perhaps punting more of the work to the kernel, and
>> adopting a more hard RT perspective in multiple daemons to getting out
>> vital packets like hellos and ihus while doing other processing, will
>> help.
>
> I've not got deeply into your research so correct my if I'm saying
> something incorrect.

Until recently I was mostly working on implementing the ideas
discussed at battlemesh two years back[1,2] - which have all mostly
landed in lede and in the upstream linux kernel.

but we didn't fix multicast. Analyzing that and fixing it is next, but
I'm not sure if we'll get around to implement the obvious fixes before
this years' bmesh.

(answer: short, fq'd drop head mcast/powersave queues)

...

I've run out of time to pursue some of the ideas generated on the
rabeld effort, I will try to write up the ideas that worked, and those
that didn't over the coming week.

> As far as I understand from bmx7 (not much yet) it uses crypto to very
> and identify nodes. Then you can create trusted rings and different
> routing policies/strategies. For instance you can set the daemon to not
> relay hello or OGM messages from not trusted origins or from nodes doing
> suspicious things. In that case, create 1k fake nodes would not have any
> impact to the network (just the flooding). This is what I mean for
> mitigate the impact of the attack.

A) The cpu overhead of validating each packet via authentication is high.

(anyone know what the pps is for bmx7?)

As one example of a further mitigation, you could register that a
given node is being evil and start dropping most of its packets (via
netfilter? ipset? bpf) without doing the "verify signature step" at
all.

B) If the router is trusted, but either overburdened itself or the
network is, what happens?

C) The flooding effects are significant, regardless. We can end up
flooding the local mcast queue so much, that in one test, on one
chipset, I had 11 minutes of queuing there. (no kidding)

...

[1]  https://www.youtube.com/watch?v=Rb-UnHDw02o&t=242s
[2]  https://arxiv.org/abs/1703.00064