nnnoiseless: porting audio code from C to rust

栏目: IT技术 · 发布时间: 3年前

内容简介:I ported a C library to rust last week, and it went pretty smoothly. This is the story, andThe library in question isSo why bother porting it? Well, I need to patch it so that it would compile with MSVC, but my PR went unnoticed for a month. I thought abou

I ported a C library to rust last week, and it went pretty smoothly. This is the story, and here is the repo.

The library in question is RNNoise , a library for removing noise from audio. It works well, it runs fast, and best of all it has no knobs that you need to tune. There’s even a rust binding .

So why bother porting it? Well, I need to patch it so that it would compile with MSVC, but my PR went unnoticed for a month. I thought about maintaining my own fork, but it’s been more than 10 years since I last wrote anything in C or C++. And that’s how I ended up porting RNNoise to rust. It probably wasn’t the most efficient use of my time, but I had fun and learned something.

There’s a lot of information out there about porting C to rust, but the most useful resource for me was the fantastic talk by Carol (Nichols || Goulding). It lays out a simple process for porting one function at a time: first, you set up the cargo to compile as a static library and you set up the C build system to link that static library into the C library (see the slides for the relevant Makefile and Cargo.toml snippets). Then you can port one function at time: the C code goes like this:

+extern void _celt_lpc(opus_val16 *_lpc, const opus_val32 *ac, int p);
+void __celt_lpc(opus_val16 *_lpc, const opus_val32 *ac, int p)
-void _celt_lpc(opus_val16 *_lpc, const opus_val32 *ac, int p)
{
    /* C body of _celt_lpc */
}

and the rust code goes like this:

+#[no_mangle]
+pub extern "C" fn _celt_lpc(lpc: *mut f32, ac: *const f32, p: c_int) {
+    unsafe {
+        let lpc_slice = std::slice::from_raw_parts_mut(lpc, p as usize);
+        let ac_slice = std::slice::from_raw_parts(ac, p as usize + 1);
+        rs_celt_lpc(lpc_slice, ac_slice);
+    }
+}
+
+fn rs_celt_lpc(lpc: &mut [f32], ac: &[f32]) {
+// rust body of celt_lpc
+}

If you’ve watched the talk (which you should), you might notice that this is a tiny bit different from what they recommend: I’ve renamed the original C function instead of deleting it. I found that this helped me narrow down porting mistakes, because it made it easy to switch back and forth between the C and rust implementations.

Pain points

Most of the porting process was mechanical and easy. One of the less fun parts was porting code involving C structs. RNNoise has structs that (when ported to rust) look like this:

#[repr(C)]
struct RnnState {
    model: *const RnnModel,
    // Various buffers, whose sizes are determined by some subfields of `model`.
    vad_gru_state: *mut f32,
    noise_gru_state: *mut f32,
    denoise_gru_state: *mut f32,
}

An idomatic rust version might look something like

struct RnnState {
    model: &'static RnnModel,
    vad_gru_state: Vec<f32>,
    noise_gru_state: Vec<f32>,
    denoise_gru_state: Vec<f32>,
}

but this isn’t layout-compatible with the original C version, and so I need to stick with the original struct for as long as RnnState is being accessed by both C and rust code. This increases the amount of unsafe sprinkled around the rust code, and it was also the source of an annoying bug of the sort that I thought I had left behind by moving to rust.

A Heisenbug

At some point during the porting process, my tests started failing in release mode, but not in debug mode. Most likely some undefined behavior triggered by my amateurish attempts at unsafe code, but I couldn’t quickly spot the problem and the prospect of a more careful round of debugging didn’t spark a whole lot of joy. So I did something that I never would have dared to do in my C/C++ days: I ignored the problem and kept porting; after all, the tests were still working in debug mode. And sure enough, a few more ported functions later and rustc found the problem for me: in a function taking a &RnnState parameter, I was modifying data in the vad_gru_state buffer. Since I was using unsafe code, rustc didn’t complain at first. But once I ported the RnnState struct to safe and idiomatic rust, the compiler flagged the problem immediately.

Performance

After getting everything to 100% safe (if not particulary idiomatic) rust, it was time to check whether performance had suffered.

nnnoiseless: porting audio code from C to rust

Yes, apparently, by about 50%. The most obvious culprit was bounds checking: there was a lot of indexing in the C code, and some of it wasn’t trivial to convert to a more rust-friendly, iterator-based version. First priority was the neural network evaluation:

let m = ...; // At most 114.
let n = ...; // At most 96.

for i in 0..n {
    let output[i] = layer.bias[i] as f32;
    for j in 0..m {
        output[i] += layer.input_weights[j * n + i] as f32 * input[j];
    }
}

I can already see you shaking your head. I’m doing naive matrix-vector multiplication with a 100x100ish matrix in column-major format ? Not only is this costing me bounds checks, it’s terrible for memory locality. Swapping the weights storage from column- to row-major order only made things about 1.5% faster, but more importantly it made the whole thing iterator-friendly. Converting to zips and sums bought another 15%, leaving me only about 25-30% slower than the C code.

for i in 0..n {
    let output[i] =
        layer.bias[i] as f32 + 
        layer.input_weights[(i * m)..((i + 1) * m)]
            .iter()
            .zip(input)
            .map(|(&x, &y)| x as f32 * y)
            .sum();
}

For my next optimization opportunity, I moved on to the function that computes cross-correlations . The un-optimized version of this function looks like

fn pitch_xcorr(xs: &[f32], ys: &[f32], xcorr: &mut [f32]) {
    for i in 0..xcorr.len() {
        xcorr[i] = xs.iter().zip(&ys[i..]).map(|(&x, &y)| x * y).sum();
    }
}

but the C code contained a massive, manually-unrolled version. I’d skipped it while porting, but maybe I’d gain something from porting it over. Here’s an abbreviated version of the optimized function, assuming that all lengths are a multiple of 4 (the real code also handles the case that they aren’t).

for i in (0..xcorr.len()).step_by(4) {
    let mut c0 = 0.0;
    let mut c1 = 0.0;
    let mut c2 = 0.0;
    let mut c3 = 0.0;

    let mut y0 = ys[i + 0];
    let mut y1 = ys[i + 1];
    let mut y2 = ys[i + 2];
    let mut y3 = ys[i + 3];

    for (x, y) in xs.chunks_exact(4).zip(ys[(i + 4)..].chunks_exact(4)) {
        c0 += x[0] * y0;
        c1 += x[0] * y1;
        c2 += x[0] * y2;
        c3 += x[0] * y3;

        y0 = y[0];
        c0 += x[1] * y1;
        c1 += x[1] * y2;
        c2 += x[1] * y3;
        c3 += x[1] * y0;

        y1 = y[1];
        c0 += x[2] * y2;
        c1 += x[2] * y3;
        c2 += x[2] * y0;
        c3 += x[2] * y1;

        y2 = y[2];
        c0 += x[3] * y3;
        c1 += x[3] * y0;
        c2 += x[3] * y1;
        c3 += x[3] * y2;

        y3 = y[3];
    }
}

Basically, both inner and outer loops have been unrolled four times, and I’ve exploited the inner loop’s unrolling to optimize the memory access pattern. Thanks to the amazing cargo asm , I can happily report that there’s no bounds-checking in the inner loop and that all the arithmetic has been auto-vectorized to work four f32 s at a time. (Maybe it would get even faster if I unrolled 8 times and compiled with AVX enabled; I haven’t tried that yet.)

This change more than doubled the speed of pitch_xcorr , and gained me about 10% overall. More importantly, it showed me how to coerce the compiler into auto-vectorizing something that it hadn’t auto-vectorized before. I went back to the neural network code and replaced things like

xs.iter().zip(ys).map(|(&x, &y)| x as f32 * y).sum()

with things like

{
    let mut sum0 = 0.0;
    let mut sum1 = 0.0;
    let mut sum2 = 0.0;
    let mut sum3 = 0.0;

    for (x, y) in xs.chunks_exact(4).zip(ys.chunks_exact(4)) {
        sum0 += x[0] * y[0];
        sum1 += x[1] * y[1];
        sum2 += x[2] * y[2];
        sum3 += x[3] * y[3];
    }
    sum0 + sum1 + sum2 + sum3
}

for another 20% improvement.

Current score: the rust version (still 100% safe) is about 15% faster, and there’s probably plenty more still on the table.

nnnoiseless: porting audio code from C to rust

The performance lesson I learned from this is that bounds checking can be expensive in numerical code and iterator-style code can help a bit, but if you really want faster numerical code then you need to write in a style that the auto-vectorizer likes. (Or you could use the SIMD intrinsics directly, but that’s another story.)

A huge thank you to cargo

Like I wrote above, it’s been a while since I did any C/C++, and because of that I’ve started to take tools like cargo for granted. This little porting project brought back some memories, mostly because about half of the code in RNNoise was actually “vendored” from opus . I put “vendored” in quotes because I usually think of vendoring as involving a subdirectory (maybe even a git submodule if I’m lucky) with its own build artifacts. That’s not what’s going on here, though; I’m just talking about files that were copied from the source directory of one project to the source directory of another, complete with never-used functions and never-def’ed ifdefs. The thing is, though, that I understand exactly why they did it: it’s by far the easiest way to share code between C projects. So I just want to finish by saying a big “thank you” to cargo and crates.io for making me not have to deal with C dependency management any more.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

深入理解程序设计

深入理解程序设计

[美] Jonathan Bartlett / 郭晴霞 / 人民邮电出版社 / 2014-1 / 49.00

是否真正理解汇编语言,常常是普通程序员和优秀程序员的分水岭。《深入理解程序设计:使用Linux汇编语言》介绍了Linux平台下的汇编语言编程,教你从计算机的角度看问题,从而了解汇编语言及计算机的工作方式,为成就自己的优秀程序员之梦夯实基础。 很多人都认为汇编语言晦涩难懂,但New Medio技术总监Jonathan Bartlett的这本书将改变人们的看法。本书首先介绍计算机的体系结构,然后......一起来看看 《深入理解程序设计》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具