+1 650 460 1486
sales@terabitsecurity.com

 

Blog

A words from our team

Capturing the packets in Linux at speeds equal to tens of millions of packets per second without the use of external libraries

admin 8 November 2015 No Comments

My article will tell you how to take 10 million packets per second without the use of libraries such as Netmap, PF_RING, DPDK and others. We will perform this with the help of normal Linux core version 3.16, and a certain amount of codes in C and C ++.

First, I would like to tell a few words about how the pcap (a well-known way to capture packets) works. It is used in such popular utilities like iftop, tcpdump, arpwatch. In addition, it has a very high CPU load.

So, you have shared the interface and waiting to receive the packages in regular way – bind / recv. The core receives data from the network card and stores it in the core space; then it defines that the user wants to get to the user space and transmits through the command argument recv the address of the buffer where the data are stored. The core dutifully copies the data (for the second time!). It appears to be difficult, but it is not the only problem with pcap.

In addition, remember that recv is a system call and we call it on every packet comes to an interface. The system calls usually are very fast, but the high speed of modern 10GE interfaces (up to 14.6 million calls per second) provoke even a slight challenge to become very resources consuming because of the frequency of calls.

It is also better to note that usually we have more than 2 logical cores on the server. The data can fly to any of them! And the application that receives the data through pcap uses only one core. There appears the locking from the core side and dramatically slows down the capturing process – now we do not just copy the memory / processing packets but also waiting for the release of the locking, occupied by other cores. Believe me, the locking can often take up to 90% of the CPU resources of the entire server.

Nice list, isn’t it? So, let’s try to solve the matter!

Therefore, to avoid the misunderstandings let’s try to clarify that we are dealing with the mirrored ports (which means that we are receiving a copy of all traffic of a particular server from somewhere outside the network). They, in turn, receive the traffic – SYN flood packets of minimum size on the rate of 14.6 mpps / 7.6GE.

Network ixgbe, drivers Source Forge 4.1.1, Debian 8 Jessie. Configuration module: modprobe ixgbe RSS = 8,8 (this is important!). I i7 Processor 3820, with 8 logical cores. Therefore, whenever I use “8” (including the code) you should substitute my number with the number of cores that you have.

Distribute interrupts to available cores

Let me draw your attention to the fact that the port receives packets which target MAC addresses do not match with the MAC address of our network card. Otherwise, the TCP/IP Stack Linux will turn on and device will die out because of the traffic. This point is very important, because we are discussing only the capture of outer traffic, not the traffic processing that intends to the particular device (but my method easily solves this issue).

Now let’s see how much traffic we can get among all available.

Let’s turn on the promisc mode on the NIC:

ifconfig eth6 promisc

 

After that we will see the unpleasant picture in htop: total reboot of one of the cores.

1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]

2 [ 0.0%]

3 [ 0.0%]

4 [ 0.0%]

5 [ 0.0%]

6 [ 0.0%]

7 [ 0.0%]

8 [ 0.0%]

 

We will use special script pps.sh: gist.github.com/pavel-odintsov/bc287860335e872db9a5 to determine the interface speed.

The speed in the interface is quite small: 4 million packets per second:
bash /root/pps.sh eth6

TX eth6: 0 pkts / s RX eth6: 3,882,721 pkts / s
TX eth6: 0 pkts / s RX eth6: 3,745,027 pkts / s

 

 

To solve this issue and spread the load between all the logical cores (I have 8) you need to run the following script: gist.github.com/pavel-odintsov/9b065f96900da40c5301 that will distribute interrupts from all 8 NIC lines to all available logical cores.

Great, the speed begins immediately growing up to 12mpps (but it is not captured, it’s only the proof that we can read the traffic from a network on such speed) :

bash /root/pps.sh eth6

TX eth6: 0 pkts/s RX eth6: 12528942 pkts/s

TX eth6: 0 pkts/s RX eth6: 12491898 pkts/s

TX eth6: 0 pkts/s RX eth6: 12554312 pkts/s

Loading of the core has been stabilized:

1 [||||| 7.4%]

2 [||||||| 9.7%]

3 [|||||| 8.9%]

4 [|| 2.8%]

5 [||| 4.1%]

6 [||| 3.9%]

7 [||| 4.1%]

8 [||||| 7.8%]

 

I’d like to draw you attention to the fact that there are two types of codes in the text will be used, here they are:
AF_PACKET, AF_PACKET + FANOUT: gist.github.com/pavel-odintsov/c2154f7799325aed46ae
AF_PACKET RX_RING, AF_PACKET + RX_RING + FANOUT: gist.github.com/pavel-odintsov/15b7435e484134650f20

These are the completed applications with the maximum level of optimization. I do not mention the intermediate (slower versions of the code), but all the marks to operate with all the optimizations are shown in code as bool – you can easily replicate it on your device.

The first attempt to launch AF_PACKET capture without optimization

So, launch the application to capture traffic with AF_PACKET:

We process: 222,048 pps
We process: 186,315 pps

And the maximum load:

1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||| 86.1%]

2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||| 84.1%]

3 [|||||||||||||||||||||||||||||||||||||||||||||||||||| 79.8%]

4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 88.3%]

5 [||||||||||||||||||||||||||||||||||||||||||||||||||||||| 83.7%]

6 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 86.7%]

7 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 89.8%]

8 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||90.9%]

 

The reason is that the core died out in the locks, spending all the processing time on it:

Samples: 303K of event ‘cpu-clock’, Event count (approx.): 53015222600
59.57% [kernel] [k] _raw_spin_lock
9.13% [kernel] [k] packet_rcv
7.23% [ixgbe] [k] ixgbe_clean_rx_irq
3.35% [kernel] [k] pvclock_clocksource_read
2.76% [kernel] [k] __netif_receive_skb_core
2.00% [kernel] [k] dev_gro_receive
1.98% [kernel] [k] consume_skb
1.94% [kernel] [k] build_skb
1.42% [kernel] [k] kmem_cache_alloc
1.39% [kernel] [k] kmem_cache_free
0.93% [kernel] [k] inet_gro_receive
0.89% [kernel] [k] __netdev_alloc_frag
0.79% [kernel] [k] tcp_gro_receive

 

Optimization AF_PACKET capture using FANOUT

So, what is next? Let’s think 🙂 Locking occurs when several processors are trying to use the same resource. In our case, we have a single socket served by one application that leaves the other eight logical processors in a constant waiting.

FANOUT is the excellent feature that will help us. We can run a few for AF_PACKET (of course, the most effective number of processes is the one that equals to the number of logical cores). In addition, we can specify the algorithm according to which the data will be distributed to these sockets. I chose the mode PACKET_FANOUT_CPU, as in our case, the data is equally distributed to the all NIC lines and this, in my opinion, the least resource-consuming option to keep the balance (but it is recommend to double check in core code).

Adjusting in the example of code bool use_multiple_fanout_processes = true;

Again, launch the application.

It’s a miracle! Speed rises by 10 times:

We process: 2250709 pps

We process: 2234301 pps

We process: 2266138 pps

 

Processors, of course, are still fully loaded:

1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||92.6%]

2 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.1%]

3 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.2%]

4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.3%]

5 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.1%]

6 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.7%]

7 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.7%]

8 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||93.2%]

 

But the perf top card looks quite different – no more locks:

Samples: 1M of event ‘cpu-clock’, Event count (approx.): 110166379815
17.22% [ixgbe] [k] ixgbe_clean_rx_irq
7.07% [kernel] [k] pvclock_clocksource_read
6.04% [kernel] [k] __netif_receive_skb_core
4.88% [kernel] [k] build_skb
4.76% [kernel] [k] dev_gro_receive
4.28% [kernel] [k] kmem_cache_free
3.95% [kernel] [k] kmem_cache_alloc
3.04% [kernel] [k] packet_rcv
2.47% [kernel] [k] __netdev_alloc_frag
2.39% [kernel] [k] inet_gro_receive
2.29% [kernel] [k] copy_user_generic_string
2.11% [kernel] [k] tcp_gro_receive
2.03% [kernel] [k] _raw_spin_unlock_irqrestore

 

Besides, the socket (I’m not sure about AF_PACKET) can set the receive buffer, SO_RCVBUF, but it didn’t give any result during my test.

Optimization AF_PACKET capture using RX_RING – circular buffer

 

What should we do? Why is it still slow? The answer is a function build_skb, it means that there the double coping is still performing in the core!

Now we will try to allocate the memory through the use of RX_RING.

And hurray 4 MPPS!!!

We process: 3582498 ppsWe process: 3757254 ppsWe process: 3669876 ppsWe process: 3757254 ppsWe process: 3815506 ppsWe process: 3873758 pps

 

This speed growth was achieved due to the correction introduced – now the memory copying from NIC buffer performs only once. And the copy doesn’t run again during the transfer from core to user space. That became possible due to a common buffer allocated in the core and transmitted to user space.

Also, this changes the approach to work. Now with the help of the poll call, we can expect a signal when the whole block is filled! And then start processing it.

 

Optimization AF_PACKET capture using RX_RING with FANOUT

We have problems with locks still! How can we break it? The old method – enable FANOUT and allocate a block of memory for each thread!

Samples: 778K of event ‘cpu-clock’, Event count (approx.): 87039903833

74.26% [kernel] [k] _raw_spin_lock

4.55% [ixgbe] [k] ixgbe_clean_rx_irq

3.18% [kernel] [k] tpacket_rcv

2.50% [kernel] [k] pvclock_clocksource_read

1.78% [kernel] [k] __netif_receive_skb_core

1.55% [kernel] [k] sock_def_readable

1.20% [kernel] [k] build_skb

1.19% [kernel] [k] dev_gro_receive

0.95% [kernel] [k] kmem_cache_free

0.93% [kernel] [k] kmem_cache_alloc

0.60% [kernel] [k] inet_gro_receive

0.57% [kernel] [k] kfree_skb

0.52% [kernel] [k] tcp_gro_receive

0.52% [kernel] [k] __netdev_alloc_frag

So, connect FANOUT mode to RX_RING version!
HOORAY! RECORD !!! 9 MPPS !!!

 

We process: 9611580 pps

We process: 8912556 pps

We process: 8941682 pps

We process: 8854304 pps

We process: 8912556 pps

We process: 8941682 pps

We process: 8883430 pps

We process: 8825178 pps

perf top:

Samples: 224K of event ‘cpu-clock’, Event count (approx.): 42501395417

21.79% [ixgbe] [k] ixgbe_clean_rx_irq

9.96% [kernel] [k] tpacket_rcv

6.58% [kernel] [k] pvclock_clocksource_read

5.88% [kernel] [k] __netif_receive_skb_core

4.99% [kernel] [k] memcpy

4.91% [kernel] [k] dev_gro_receive

4.55% [kernel] [k] build_skb

3.10% [kernel] [k] kmem_cache_alloc

3.09% [kernel] [k] kmem_cache_free

2.63% [kernel] [k] prb_fill_curr_block.isra.57

 

By the way, the updating to the core 4.0.0 did not provide a speed acceleration. The speed remains the same. However, the loading in the core is dropping significantly!

1 [||||||||||||||||||||||||||||||||||||| 55.1%]

2 [||||||||||||||||||||||||||||||||||| 52.5%]

3 [|||||||||||||||||||||||||||||||||||||||||| 62.5%]

4 [|||||||||||||||||||||||||||||||||||||||||| 62.5%]

5 [||||||||||||||||||||||||||||||||||||||| 57.7%]

6 [|||||||||||||||||||||||||||||||| 47.7%]

7 [||||||||||||||||||||||||||||||||||||||| 55.9%]

8 [||||||||||||||||||||||||||||||||||||||||| 61.4%]

 

At the end I would like to add that Linux is a simply stunning platform for the traffic analysis even in the environment where you cannot collect any specialized core module. It is very, very good. There is a hope that in the nearest next versions of the core it will be possible to process 10GE wire-speed to 14.6 million / second using a packet processor 1800 MHz 🙂