Reducing UDP Latency

栏目: IT技术 · 发布时间: 4年前

内容简介:Hi! I’m one ofControl and responsibility is a key point for a wide range of embedded systems. On the one hand, sensors and detectors must notify some other devices that some event occurred, on the other hand, other systems should react as soon as possible.

Reducing UDP latency

Hi! I’m one of Embox RTOS developers, and in this article I’ll tell you about one of the typical problems in the world of embedded systems and how we were solving it.

Reducing UDP Latency

Stating the problem

Control and responsibility is a key point for a wide range of embedded systems. On the one hand, sensors and detectors must notify some other devices that some event occurred, on the other hand, other systems should react as soon as possible. Examples of such systems include CNC, vehicle control, avionics, distributed sensor systems and lot of others.

At the same time, it’s really hard to develop bare-metal programs for a number of reasons:

  • Developers don’t have much choice for frameworks and languages: it probably will be ANSI C and assembly language even for non-time-critical parts of code which can be developed faster with something else (for example, debugging output, collecting statistics, some user interface for diagnostics and so on)
  • There are really lots of solutions which require different hardware hardware drivers: network, interrupt controller, timer and UART driver are bare minimum
  • Some systems have both FPGA and HPS, which leads to additional steps to “glue” all the parts together

This leads popularity of Linux kernel in embedded systems and it works great in lots of applications, as it gives portable and stable code base.

But let’s see some specific case: time-critical applications that rely on network.

“Time-critical” may mean different things:

  • Applications that require high bandwidth
  • Applications that require low latency

Linux works great with the first case as there are a number of possible optimizations (turn off interrupt coalescing and so on…), but can you achieve better results in terms of low latency? Let’s find out!

Real-life example

We had a following task: minimize possible latency for every single UDP response over the ethernet. DE0-Nano-Soc board was used as an embedded system core which would control some peripheral devices as a reaction for commands in those UDP packets.

Network topology is Point-to-Point, so there are no intermediate hubs, routers and other network devices.

Maximum acceptable latency is 0.1ms while basic Linux solution could only provide 0.5ms.

At the same time it was necessary to support POSIX-compatible programs.

Reducing UDP Latency

To measure estimated response time we will use two hosts.

The first host will be a desktop computer running GNU/Linux operating system, the second host will be DE0-Nano-SoC development board. This board has FPGA and HPS (Harp Processing System, it’s basically ARM), and we’re going to minimize response time for HPS running Embox RTOS.

We will use simple testing application which looks like this:

while (1) {
 char buf[BUFLEN];recvfrom(s, buf, BUFLEN); 
 sendto(s, buf, BUFLEN);
}

This program will run on the second host, i.e. DE0-Nano-SoC.

First host will be sending UDP packets and waiting response for each of them, measuring time for the response.

for (int i = 0; i < N; i++) {
  char buf_tx[BUFLEN], buf_rx[BUFLEN];sprintf(buf_tx, “This is packet %d\n”, i);time_t time_begin = time_now();sendto(s, buf_tx, BUFLEN);
  recvfrom(s, buf_rx, BUFLEN);time_t time_end = time_now();
  if (memcmp(buf_tx, buf_rx, sizeof(buf))) {
    printf(“%d: Buffer mismatch\n”, i);
  }
  if (time_end — time_begin > TIME_LIMIT) {
    printf(“Slow answer #%d: %d\n”, i, time_end — time_begin);
  }
}

Also we measure average, minimal and maximal response time.

Source code is available on GitHub .

With test run we made sure that packets are received successfully, so we have started to make some basic optimizations:

  • Get rid of all debug UART output: it turned out to be the slowest part
  • Compiling with -O2
  • Enabling L2 cache controller PL310 (this point was the least effective)

After sending 500 000 packets we have following measurements:

Avg: 4.52ms
Min: 3.12ms
Max: 12.24ms

This is still multiple times slower than time limits we need to meet, and average response time should be almost ten times lower to compete with Linux.

Finding out the reason

One of the possible sources for slow data processing may be other processes who use system resources, but in this case nothing else is running.

May be there are too much interrupts from some peripherals? But that’s not the case too: we only process network and timer interrupts; first ones are necessary to process ethernet frames and second ones do not tend to do any effect: if timer goes slower, response time doesn’t decrease anyway.

Eventually we have found out that high latency was caused by low link speed: we used 100 Mbit/s USB-to-ethernet adapter; net driver didn’t support 1Gbit/s link too.

After patching driver and replacing ethernet adapter with faster one we’re getting following results:

Avg: 0.08ms
Min: 0.07ms
Max: 4.31ms

Linux comparison

As we are using POSIX-compatible application for our measurements, it’s very easy to cross-build it for Linux:

arm-linux-gnueabihf-gcc server.c -O2 , which builds ELF file.

Running with the same client on the host side:

Avg: 0.77ms
Min: 0.74ms
Max: 5.31ms

As you can see, in this test Embox is able to respond almost 9 times faster than Linux, which is a pretty good result.

Dispersion

While average response time is pretty good, maximum time kills the positive effect for two reasons:

  • Of course it’s long enough to fail time limit, but even more importantly
  • It creates significant uncertainty to system behavior

How can you investigate the reason for such dispersion? We decided to start with measuring time which takes ethernet frame to be fully processed between receiving and responding. It was possible to collect statistics on the development board for future analyzing, but it’s much simpler just to send this data in the UDP packet itself and process in on the desktop computer.

Time of packet receive time is written to some variable inside interrupt handler, send time is written just before activating netcard DMA.

int net_tx(…) {
  if (is_udp_packet()) {
    timestamp2 = timer_get();
    memcpy(packet[UDP_OFFT],
            &timestamp1,
            sizeof(timestamp1));
    memcpy(packet[UDP_OFFT + sizeof(timestamp2)],
            &timestamp2,
            sizeof(timestamp2));
    …
  }
}

This time we got following results:

Avg: 8673
Min: 6191 
Max: 11950

It turned out that dispersion for Embox processing UDP packet is not big at all: it’s just about 25% which hardly explains final 5000% dispersion (Avg: 0.08ms Max: 4.31ms).

Even if Embox will process every UDP packet with the same time, it may reduce it by just 1/4 which still will be too much, so we have started to find out another reason for such behavior.

What if problem is on the other side?

So now we have two potential problems:

  • Hardware issues
  • Linux host latency

It’s much harder to solve the first problem, so in hope that it’s not the case we started to think how to solve this problem.

How do we check it?

First of all, we can just try to set highest priority to this test on the host system.

nice -n -20 ./client

However, this didn’t have any significant effect. It seemed that average time reduced slightly, but still was too small compared to big dispersion.

Another solution is to change scheduling policy to round-robin. You can do it with chrt command like this:

chrt --rr 99 ./client

Finally, it worked!

Number of “slow” responses has decreased dramatically. This histogram shows difference for round-robin and regular scheduling:

Reducing UDP Latency

Other ways to reduce latency for Linux host

  • Using raw sockets. It’s not exactly the same task, but if you really need lowest possible latency, it’s probably not that good idea to use UDP at all :)
  • Interrupt coalescing may increase network latency, so it can be helpful to turn it off
  • You can use libpcap and TPACKETv3 supported by Linux kernel. Speed increase is being achieved by removing the overhead for copying from kernel space to user space. pcap also allows to apply packet filtering
  • XDP or eXpress Data Path is a BPF-like project which allows to lower overhead too
  • Some other ways are considered in this Cloudflare blogpost

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

产品心经

产品心经

闫荣 / 机械工业出版社 / 2014-9-30 / 59

产品经理如何才能迅速地、全方位地提升自己的能力,从而打造出让用户尖叫并疯狂爱上的产品?有没有捷径?从成功的、有经验的产品经理的实践真知和智慧中学习是一个很好的途径!本书就是一位拥有近10年产品经验的资深产品经理的实践真知和智慧的结晶,从产品经理核心素养、产品认知、战略与规划、精益开发、需求分析与管理、用户体验、精细运营7大方面,系统梳理了能全面、迅速提升产品经理能力,从而打造出让用户尖叫的产品的5......一起来看看 《产品心经》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具