Friday, February 20, 2015

Turing tarpits in Rust's macro system

Bitwise Cyclic Tag is an extremely simple automaton slash programming language. BCT uses a program string and a data string, each made of bits. The program string is interpreted as if it were infinite, by looping back around to the first bit.

The program consists of commands executed in order. There is a single one-bit command:

0: Delete the left-most data bit.

and a single two-bit command:

1 x: If the left-most data bit is 1, copy bit x to the right of the data string.

We halt if ever the data string is empty.

Remarkably, this is enough to do universal computation. Implementing it in Rust's macro system gives a proof (probably not the first one) that Rust's macro system is Turing-complete, aside from the recursion limit imposed by the compiler.

#![feature(trace_macros)]

macro_rules! bct {
    // cmd 0:  d ... => ...
    (0, $($ps:tt),* ; $_d:tt)
        => (bct!($($ps),*, 0 ; ));
    (0, $($ps:tt),* ; $_d:tt, $($ds:tt),*)
        => (bct!($($ps),*, 0 ; $($ds),*));

    // cmd 1p:  1 ... => 1 ... p
    (1, $p:tt, $($ps:tt),* ; 1)
        => (bct!($($ps),*, 1, $p ; 1, $p));
    (1, $p:tt, $($ps:tt),* ; 1, $($ds:tt),*)
        => (bct!($($ps),*, 1, $p ; 1, $($ds),*, $p));

    // cmd 1p:  0 ... => 0 ...
    (1, $p:tt, $($ps:tt),* ; $($ds:tt),*)
        => (bct!($($ps),*, 1, $p ; $($ds),*));

    // halt on empty data string
    ( $($ps:tt),* ; )
        => (());
}

fn main() {
    trace_macros!(true);
    bct!(0, 0, 1, 1, 1 ; 1, 0, 1);
}

This produces the following compiler output:

bct! { 0 , 0 , 1 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }
bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }
bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }
bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }
bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }
...
bct.rs:19:13: 19:45 error: recursion limit reached while expanding the macro `bct`
bct.rs:19         => (bct!($($ps),*, 1, $p ; $($ds),*));
                      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can try it online, as well.

Notes about the macro

I would much rather drop the commas and write

// cmd 0:  d ... => ...
(0 $($ps:tt)* ; $_d:tt $($ds:tt)*)
    => (bct!($($ps)* 0 ; $($ds)*));

// cmd 1p:  1 ... => 1 ... p
(1 $p:tt $($ps:tt)* ; 1 $($ds:tt)*)
    => (bct!($($ps)* 1 $p ; 1 $($ds)* $p));

// cmd 1p:  0 ... => 0 ...
(1 $p:tt $($ps:tt)* ; $($ds:tt)*)
    => (bct!($($ps)* 1 $p ; $($ds)*));

but this runs into the macro future-proofing rules.

If we're required to have commas, then it's at least nice to handle them uniformly, e.g.

// cmd 0:  d ... => ...
(0 $(, $ps:tt)* ; $_d:tt $(, $ds:tt)*)
    => (bct!($($ps),*, 0 ; $($ds),*));

// cmd 1p:  1 ... => 1 ... p
(1, $p:tt $(, $ps:tt)* ; $($ds:tt),*)
    => (bct!($($ps),*, 1, $p ; 1 $(, $ds)*, $p));

// cmd 1p:  0 ... => 0 ...
(1, $p:tt $(, $ps:tt)* ; $($ds:tt),*)
    => (bct!($($ps),*, 1, $p ; $($ds),*));

But this too is disallowed. An $x:tt variable cannot be followed by a repetition $(...)*, even though it's (I believe) harmless. There is an open RFC about this issue. For now I have to handle the "one" and "more than one" cases separately, which is annoying.

In general, I don't think macro_rules! is a good language for arbitrary computation. This experiment shows the hassle involved in implementing one of the simplest known "arbitrary computations". Rather, macro_rules! is good at expressing patterns of code reuse that don't require elaborate compile-time processing. It does so in a way that's declarative, hygienic, and high-level.

However, there is a big middle ground of non-elaborate, but non-trivial computations. macro_rules! is hardly ideal for that, but procedural macros have problems of their own. Indeed, the bct! macro is an extreme case of a pattern I've found useful in the real world. The idea is that every recursive invocation of a macro gives you another opportunity to pattern-match the arguments. Some of html5ever's macros do this, for example.

Saturday, January 10, 2015

151-byte static Linux binary in Rust

Part of the sales pitch for Rust is that it's "as bare metal as C".1 Rust can do anything C can do, run anywhere C can run,2 with code that's just as efficient, and at least as safe (but usually much safer).

I'd say this claim is about 95% true, which is pretty good by the standards of marketing claims. A while back I decided to put it to the test, by making the smallest, most self-contained Rust program possible. After resolving a few issues along the way, I ended up with a 151-byte, statically linked executable for AMD64 Linux. With the release of Rust 1.0-alpha, it's time to show this off.

Here's the Rust code:

#![crate_type="rlib"]
#![allow(unstable)]

#[macro_use] extern crate syscall;

use std::intrinsics;

fn exit(n: usize) -> ! {
    unsafe {
        syscall!(EXIT, n);
        intrinsics::unreachable()
    }
}

fn write(fd: usize, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

#[no_mangle]
pub fn main() {
    write(1, "Hello!\n".as_bytes());
    exit(0);
}

This uses my syscall library, which provides the syscall! macro. We wrap the underlying system calls with Rust functions, each exposing a safe interface to the unsafe syscall! macro. The main function uses these two safe functions and doesn't need its own unsafe annotation. Even in such a small program, Rust allows us to isolate memory unsafety to a subset of the code.

Because of crate_type="rlib", rustc will build this as a static library, from which we extract a single object file tinyrust.o:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 00 00 00 00          mov    $0x0,%esi
                        b: R_X86_64_32  .rodata.str1625
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

We disable stack exhaustion checking, as well as position-independent code, in order to slim down the output. After optimization, the only instructions that survive come from inline assembly blocks in the syscall library.

Note that main doesn't end in a ret instruction. The exit function (which gets inlined) is marked with a "return type" of !, meaning "doesn't return". We make good on this by invoking the unreachable intrinsic after syscall!. LLVM will optimize under the assumption that we can never reach this point, making no guarantees about the program behavior if it is reached. This represents the fact that the kernel is actually going to kill the process before syscall!(EXIT, n) can return.

Because we use inline assembly and intrinsics, this code is not going to work on a stable-channel build of Rust 1.0. It will require an alpha or nightly build until such time as inline assembly and intrinsics::unreachable are added to the stable language of Rust 1.x.

Note that I didn't even use #![no_std]! This program is so tiny that everything it pulls from libstd is a type definition, macro, or fully inlined function. As a result there's nothing of libstd left in the compiler output. In a larger program you may need #![no_std], although its role is greatly reduced following the removal of Rust's runtime.

Linking

This is where things get weird.

Whether we compile from C or Rust,3 the standard linker toolchain is going to include a bunch of junk we don't need. So I cooked up my own linker script:

SECTIONS {
    . = 0x400078;
  
    combined . : AT(0x400078) ALIGN(1) SUBALIGN(1) {
        *(.text*)
        *(.data*)
        *(.rodata*)
        *(.bss*)
    }
}

We smash all the sections together, with no alignment padding, then extract that section as a headerless binary blob:

$ ld --gc-sections -e main -T script.ld -o payload tinyrust.o
$ objcopy -j combined -O binary payload payload.bin

Finally we stick this on the end of a custom ELF header. The header is written in NASM syntax but contains no instructions, only data fields. The base address 0x400078 seen above is the end of this header, when the whole file is loaded at 0x400000. There's no guarantee that ld will put main at the beginning of the file, so we need to separately determine the address of main and fill that in as the e_entry field in the ELF file header.

$ ENTRY=$(nm -f posix payload | grep '^main ' | awk '{print $3}')
$ nasm -f bin -o tinyrust -D entry=0x$ENTRY elf.s
$ chmod +x ./tinyrust
$ ./tinyrust
Hello!

It works! And the size:

$ wc -c < tinyrust
158

Seven bytes too big!

The final trick

To get down to 151 bytes, I took inspiration from this classic article, which observes that padding fields in the ELF header can be used to store other data. Like, say, a string constant. The Rust code changes to access this constant:

use std::{mem, raw};

#[no_mangle]
pub fn main() {
    let message: &'static [u8] = unsafe {
        mem::transmute(raw::Slice {
            data: 0x00400008 as *const u8,
            len: 7,
        })
    };

    write(1, message);
    exit(0);
}

A Rust slice like &[u8] consists of a pointer to some memory, and a length indicating the number of elements that may be found there. The module std::raw exposes this as an ordinary struct that we build, then transmute to the actual slice type. The transmute function generates no code; it just tells the type checker to treat our raw::Slice<u8> as if it were a &[u8]. We return this value out of the unsafe block, taking advantage of the "everything is an expression" syntax, and then print the message as before.

Trying out the new version:

$ rustc tinyrust.rs \
    -O -C no-stack-check -C relocation-model=static \
    -L syscall.rs/target
$ ar x libtinyrust.rlib tinyrust.o
$ objdump -dr tinyrust.o
0000000000000000 <main>:        
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   bf 01 00 00 00          mov    $0x1,%edi
   a:   be 08 00 40 00          mov    $0x400008,%esi
   f:   ba 07 00 00 00          mov    $0x7,%edx
  14:   0f 05                   syscall 
  16:   b8 3c 00 00 00          mov    $0x3c,%eax
  1b:   31 ff                   xor    %edi,%edi
  1d:   0f 05                   syscall 

...
$ wc -c < tinyrust
151
$ ./tinyrust
Hello!

The object code is the same as before, except that the relocation for the string constant has become an absolute address. The binary is smaller by 7 bytes (the size of "Hello!\n") and it still works!

You can find the full code on GitHub. The code in this article works on rustc 1.0.0-dev (44a287e6e 2015-01-08). If I update the code on GitHub, I will also update the version number printed by the included build script.

I'd be curious to hear if anyone can make my program smaller!


  1. C is not really "bare metal", but that's another story

  2. From a pure language perspective. If you want to talk about availability of compilers and libraries, then Rust still has a bit of a disadvantage ;) 

  3. In fact, this code grew out of an earlier experiment with really small binaries in C. 

Wednesday, October 29, 2014

A taste of Rust (yum) for C/C++ programmers

If, like me, you've been frustrated with the status quo in systems languages, this article will give you a taste of why Rust is so exciting. In a tiny amount of code, it shows a lot of ways that Rust really kicks ass compared to C and C++. It's not just safe and fast, it's a lot more convenient.

Web browsers do string interning to condense the strings that make up the Web, such as tag and attribute names, into small values that can be compared quickly. I recently added event logging support to Servo's string interner. This will allow us to record traces from real websites, which we can use to guide further optimizations.

Here are the events we can log:

#[deriving(Show)]
pub enum Event {
    Intern(u64),
    Insert(u64, String),
    Remove(u64),
}

Interned strings have a 64-bit ID, which is recorded in every event. The String we store for "insert" events is like C++'s std::string; it points to a buffer in the heap, and it owns that buffer.

This enum is a bit fancier than a C enum, but its representation in memory is no more complex than a C struct. There's a tag for the three alternatives, a 64-bit ID, and a few fields that make up the String. When we pass or return an Event by value, it's at worst a memcpy of a few dozen bytes. There's no implicit heap allocation, garbage collection, or anything like that. We didn't define a way to copy an event; this means the String buffer always has a unique owner who is responsible for freeing it.

The deriving(Show) attribute tells the compiler to auto-generate a text representation, so we can print an Event just as easily as a built-in type.

Next we declare a global vector of events, protected by a mutex:

lazy_static! {
    pub static ref LOG: Mutex<Vec<Event>>
        = Mutex::new(Vec::with_capacity(50_000));
}

lazy_static! will initialize both of them when LOG is first used. Like String, the Vec is a growable buffer. We won't turn on event logging in release builds, so it's fine to pre-allocate space for 50,000 events. (You can put underscores anywhere in a integer literal to improve readability.)

lazy_static!, Mutex, and Vec are all implemented in Rust using gnarly low-level code. But the amazing thing is that all three expose a safe interface. It's simply not possible to use the variable before it's initialized, or to read the value the Mutex protects without locking it, or to modify the vector while iterating over it.

The worst you can do is deadlock. And Rust considers that pretty bad, still, which is why it discourages global state. But it's clearly what we need here. Rust takes a pragmatic approach to safety. You can always write the unsafe keyword and then use the same pointer tricks you'd use in C. But you don't need to be quite so guarded when writing the other 95% of your code. I want a language that assumes I'm brilliant but distracted :)

Rust catches these mistakes at compile time, and produces the same code you'd see with equivalent constructs in C++. For a more in-depth comparison, see Ruud van Asseldonk's excellent series of articles about porting a spectral path tracer from C++ to Rust. The Rust code performs basically the same as Clang / GCC / MSVC on the same platform. Not surprising, because Rust uses LLVM and benefits from the same backend optimizations as Clang.

lazy_static! is not a built-in language feature; it's a macro provided by a third-party library. Since the library uses Cargo, I can include it in my project by adding

[dependencies.lazy_static]
git = "https://github.com/Kimundi/lazy-static.rs"

to Cargo.toml and then adding

#[phase(plugin)]
extern crate lazy_static;

to src/lib.rs. Cargo will automatically fetch and build all dependencies. Code reuse becomes no harder than in your favorite scripting language.

Finally, we define a function that pushes a new event onto the vector:

pub fn log(e: Event) {
    LOG.lock().push(e);
}

LOG.lock() produces an RAII handle that will automatically unlock the mutex when it falls out of scope. In C++ I always hesitate to use temporaries like this because if they're destroyed too soon, my program will segfault or worse. Rust has compile-time lifetime checking, so I can do things that would be reckless in C++.

If you scroll up you'll see a lot of prose and not a lot of code. That's because I got a huge amount of functionality for free. Here's the logging module again:

#[deriving(Show)]
pub enum Event {
    Intern(u64),
    Insert(u64, String),
    Remove(u64),
}

lazy_static! {
    pub static ref LOG: Mutex<Vec<Event>>
        = Mutex::new(Vec::with_capacity(50_000));
}

pub fn log(e: Event) {
    LOG.lock().push(e);
}

This goes in src/event.rs and we include it from src/lib.rs.

#[cfg(feature = "log-events")]
pub mod event;

The cfg attribute is how Rust does conditional compilation. Another project can specify

[dependencies.string_cache]
git = "https://github.com/servo/string-cache"
features = ["log-events"]

and add code to dump the log:

for e in string_cache::event::LOG.lock().iter() {
    println!("{}", e);
}

Any project which doesn't opt in to log-events will see zero impact from any of this.

If you'd like to learn Rust, the Guide is a good place to start. We're getting close to 1.0 and the important concepts have been stable for a while, but the details of syntax and libraries are still in flux. It's not too early to learn, but it might be too early to maintain a large library.

By the way, here are the events generated by interning the three strings foobarbaz foo blockquote:

Insert(0x7f1daa023090, foobarbaz)
Intern(0x7f1daa023090)
Intern(0x6f6f6631)
Intern(0xb00000002)

There are three different kinds of IDs, indicated by the least significant bits. The first is a pointer into a standard interning table, which is protected by a mutex. The other two are created without synchronization, which improves parallelism between parser threads.

In UTF-8, the string foo is smaller than a 64-bit pointer, so we store the characters directly. blockquote is too big for that, but it corresponds to a well-known HTML tag. 0xb is the index of blockquote in a static list of strings that are common on the Web. Static atoms can also be used in pattern matching, and LLVM's optimizations for C's switch statements will apply.

Wednesday, September 17, 2014

Raw system calls for Rust

I wrote a small library for making raw system calls from Rust programs. It provides a macro that expands into in-line system call instructions, with no run-time dependencies at all. Here's an example:

#![feature(phase)]

#[phase(plugin, link)]
extern crate syscall;

fn write(fd: uint, buf: &[u8]) {
    unsafe {
        syscall!(WRITE, fd, buf.as_ptr(), buf.len());
    }
}

fn main() {
    write(1, "Hello, world!\n".as_bytes());
}

Right now it only supports x86-64 Linux, but I'd love to add other platforms. Pull requests are much appreciated. :)

Wednesday, August 27, 2014

Calling a Rust library from C (or anything else!)

One reason I'm excited about Rust is that I can compile Rust code to a simple native-code library, without heavy runtime dependencies, and then call it from any language. Imagine writing performance-critical extensions for Python, Ruby, or Node in a safe, pleasant language that has static lifetime checking, pattern matching, a real macro system, and other goodies like that. For this reason, when I started html5ever some six months ago, I wanted it to be more than another "Foo for BarLang" project. I want it to be the HTML parser of choice, for a wide variety of applications in any language.

Today I started work in earnest on the C API for html5ever. In only a few hours I had a working demo. And this is a fairly complicated library, with 5,000+ lines of code incorporating

It's pretty cool that we can use all this machinery from C, or any language that can call C. I'll describe first how to build and use the library, and then I'll talk about the implementation of the C API.

html5ever (for C or for Rust) is not finished yet, but if you're feeling adventurous, you are welcome to try it out! And I'd love to have more contributors. Let me know on GitHub about any issues you run into.

Using html5ever from C

Like most Rust libraries, html5ever builds with Cargo.

$ git clone https://github.com/kmcallister/html5ever
$ cd html5ever
$ git checkout dev
$ cargo build
    Updating git repository `https://github.com/sfackler/rust-phf`
   Compiling phf_mac v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)
   Compiling html5ever-macros v0.0.0 (file:///tmp/html5ever)
   Compiling phf v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)
   Compiling html5ever v0.0.0 (file:///tmp/html5ever)

The C API isn't Cargo-ified yet, so we'll build it using the older Makefile-based system.

$ mkdir build
$ cd build
$ ../configure
$ make libhtml5ever_for_c.a
rustc -D warnings -C rpath -L /tmp/html5ever/target -L /tmp/html5ever/target/deps \
  -o libhtml5ever_for_c.a --cfg for_c --crate-type staticlib /tmp/html5ever/src/lib.rs
warning: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: rt
note: library: dl
note: library: pthread
note: library: gcc_s
note: library: pthread
note: library: c
note: library: m

Now we can build an example C program using that library, and following the link instructions produced by rustc.

$ H5E_PATH=/tmp/html5ever
$ gcc -Wall -o tokenize tokenize.c -I $H5E_PATH/capi -L $H5E_PATH/build \
  -lhtml5ever_for_c -lrt -ldl -lpthread -lgcc_s -lpthread -lc -lm

$ ./tokenize 'Hello&comma; <i class=excellent>world!</i>'
CHARS : Hello
CHARS : ,
CHARS :  
TAG   : <i>
  ATTR: class="excellent"
CHARS : world!
TAG   : </i>

The build process is pretty standard for C; we just link a .a file and its dependencies. The biggest obstacle right now is that you won't find the Rust compiler in your distro's package manager, because the language is still changing so rapidly. But there's a ton of effort going into stabilizing the language for a Rust 1.0 release this year. It won't be too long before rustc is a reasonable build dependency.

Let's look at the C client code.

#include <stdio.h>

#include "html5ever.h"

void put_str(const char *x) {
    fputs(x, stdout);
}

void put_buf(struct h5e_buf text) {
    fwrite(text.data, text.len, 1, stdout);
}

void do_start_tag(void *user, struct h5e_buf name, int self_closing, size_t num_attrs) {
    put_str("TAG   : <");
    put_buf(name);
    if (self_closing) {
        putchar('/');
    }
    put_str(">\n");
}

// ...

struct h5e_token_ops ops = {
    .do_chars = do_chars,
    .do_start_tag = do_start_tag,
    .do_tag_attr = do_tag_attr,
    .do_end_tag = do_end_tag,
};

struct h5e_token_sink sink = {
    .ops = &ops,
    .user = NULL,
};

int main(int argc, char *argv[]) {
    if (argc < 2) {
        printf("Usage: %s 'HTML fragment'\n", argv[0]);
        return 1;
    }

    struct h5e_tokenizer *tok = h5e_tokenizer_new(&sink);
    h5e_tokenizer_feed(tok, h5e_buf_from_cstr(argv[1]));
    h5e_tokenizer_end(tok);
    h5e_tokenizer_free(tok);
    return 0;
}

The struct h5e_token_ops contains pointers to callbacks. Any events we don't care to handle are left as NULL function pointers. Inside main, we create a tokenizer and feed it a string. html5ever for C uses a simple pointer+length representation of buffers, which is this struct h5e_buf you see being passed by value.

This demo only does tokenization, not tree construction. html5ever can perform both phases of parsing, but the API surface for tree construction is much larger and I didn't get around to writing C bindings yet.

Implementing the C API

Some parts of Rust's libstd depend on runtime services, such as task-local data, that a C program may not have initialized. So the first step in building a C API was to eliminate all std:: imports. This isn't nearly as bad as it sounds, because large parts of libstd are just re-exports from other libraries like libcore that we can use with no trouble. To be fair, I did write html5ever with the goal of a C API in mind, and I avoided features like threading that would be difficult to integrate. So your library might give you more trouble, depending on which Rust features you use.

The next step was to add the #![no_std] crate attribute. This means we no longer import the standard prelude into every module. To compensate, I added use core::prelude::*; to most of my modules. This brings in the parts of the prelude that can be used without runtime system support. I also added many imports for ubiquitous types like String and Vec, which come from libcollections.

After that I had to get rid of the last references to libstd. The biggest obstacle here involved macros and deriving, which would produce references to names under std::. To work around this, I create a fake little mod std which re-exports the necessary parts of core and collections. This is similar to libstd's "curious inner-module".

I also had to remove all uses of format!(), println!(), etc., or move them inside #[cfg(not(for_c))]. I needed to copy in the vec!() macro which is only provided by libstd, even though the Vec type is provided by libcollections. And I had to omit debug log messages when building for C; I did this with conditionally-defined macros.

With all this preliminary work done, it was time to write the C bindings. Here's how the struct of function pointers looks on the Rust side:

#[repr(C)]
pub struct h5e_token_ops {
    do_start_tag: extern "C" fn(user: *mut c_void, name: h5e_buf,
        self_closing: c_int, num_attrs: size_t),
    
    do_tag_attr: extern "C" fn(user: *mut c_void, name: h5e_buf,
        value: h5e_buf),

    do_end_tag:  extern "C" fn(user: *mut c_void, name: h5e_buf),

    // ...
}

The processing of tokens is straightforward. We pattern-match and then call the appropriate function pointer, unless that pointer is NULL. (Edit: eddyb points out that storing NULL as an extern "C" fn is undefined behavior. Better to use Option<extern "C" fn ...>, which will optimize to the same one-word representation.)

To create a tokenizer, we heap-allocate the Rust data structure in a Box, and then transmute that to a raw C pointer. When the C client calls h5e_tokenizer_free, we transmute this pointer back to a box and drop it, which will invoke destructors and finally free the memory.

You'll note that the functions exported to C have several special annotations:

  • #[no_mangle]: skip name mangling, so we end up with a linker symbol named h5e_tokenizer_free instead of _ZN5for_c9tokenizer18h5e_tokenizer_free.
  • unsafe: don't let Rust code call these functions unless it promises to be careful.
  • extern "C": make sure the exported function has a C-compatible ABI. The data structures similarly get a #[repr(C)] attribute.

Then I wrote a C header file matching this ABI:

struct h5e_buf {
    unsigned char *data;
    size_t len;
};

struct h5e_buf h5e_buf_from_cstr(const char *str);

struct h5e_token_ops {
    void (*do_start_tag)(void *user, struct h5e_buf name,
        int self_closing, size_t num_attrs);

    void (*do_tag_attr)(void *user, struct h5e_buf name,
        struct h5e_buf value);

    void (*do_end_tag)(void *user, struct h5e_buf name);

    /// ...
};

struct h5e_tokenizer;

struct h5e_tokenizer *h5e_tokenizer_new(struct h5e_token_sink *sink);
void h5e_tokenizer_free(struct h5e_tokenizer *tok);
void h5e_tokenizer_feed(struct h5e_tokenizer *tok, struct h5e_buf buf);
void h5e_tokenizer_end(struct h5e_tokenizer *tok);

One remaining issue is that Rust is hard-wired to use jemalloc, so linking html5ever will bring that in alongside the system's libc malloc. Having two separate malloc heaps will likely increase memory consumption, and it prevents us from doing fun things like allocating Boxes in Rust that can be used and freed in C. Before Rust can really be a great choice for writing C libraries, we need a better solution for integrating the allocators.

If you'd like to talk about calling Rust from C, you can find me as kmc in #rust and #rust-internals on irc.mozilla.org. And if you run into any issues with html5ever, do let me know, preferably by opening an issue on GitHub. Happy hacking!

Tuesday, June 10, 2014

On depression, privilege, and online activism

[Content warning: depression, privilege, online activism]

This isn't a general account of my experiences with depression. Many people have written about that, and I don't have much to add. But there's one aspect that I don't hear about very often. It's something that bothers me a lot, and others have told me that it bothers them too.

The thing is, I'm not just a person with a mental illness. I'm also a well-off white guy, and I enjoy a whole set of unearned privileges from that. Every day people around the world are harassed, abused, and killed over things I never have to worry about. Even in mundane daily life, most everyone is playing on a higher difficulty setting than I ever will.

I've thought about this a lot over the past few years, and I'm trying to understand how I can help make the world more fair and less oppressive. So I give money and I volunteer a little and I speak up when it seems useful, but mostly I listen. I listen to the experiences of people who are different from me. I try to get some understanding of how they feel and why.

How is this related to depression? Because the reality of privilege and oppression is fucking depressing. Of course it's depressing to those who are directly harmed. That's a lot of what I read about, and some of the despair transfers to me. But my profiting from the suffering of others in a way that I mostly can't change is also depressing, at least if I make an attempt not to ignore it.

And my distress over my role in systems of oppression brings its own layer of guilt. People are actually suffering and I feel sorry for myself because I'm dimly aware of it? But this comes from the voice that has always taunted me about depression. “How can you be sad? Your life is great. If you had real problems you wouldn't be so pathetic. You're not really sick. You're just a whiner.”

All of which is part of the disease. I need to own it and work on it every day. But it seems like every time I read an online discussion about social justice, I take a huge step backwards.

It's hard to shrug off the “men are horrible” comments when I spend so much effort trying to convince myself that I'm not horrible. When I hear people gloating about delicious white male tears, I think about all the times when I would come home from work and collapse in bed crying. Is this what they want my life to be?

I can't give myself permission to tune out, because the same people lecture constantly about my obligation to be a good ally, which mostly takes the form of “shut up and listen.” And then when I'm upset by the things they say, the response is “This isn't for you! Why are you listening?”

A local group, one that had recently invited me to hang out as a guest, retweeted a member's declaration to would-be allies: “We're not friends. Fuck you.” Can you see why it feels like they're trying to hurt me?


Let me be clear: I truly don't care if people in a room somewhere are talking about how men are the worst. I don't feel oppressed by it, and I have no desire to argue with it. But I can't handle direct exposure.

And don't tell me that I'm too stupid to understand why they say these things. I know intellectually that it's not about me. I understand the need to vent and the importance of building solidarity. None of that matters on the emotional level where these comments register like a punch to the gut. I do feel this way, even if I shouldn't and I wish I didn't.

I'm talking about mental health, triggers, and unintentionally hurtful speech. Does that sound familiar? One reason I was drawn to intersectional feminism is that it seemed to have a good set of ground rules for how to treat everyone decently. But now I feel like I'm excluded from protection. “Men are horrible” is apparently the one form of speech where intent is all that matters, and I'm a bad person if it triggers something. I've been told it's offensive that I would even try to describe my experience in those terms.

It hurts a whole lot to try and really feel someone's pain, and then realize they don't even slightly give a shit about me. It hurts even more when they'll bend over backwards for anyone except me.

Look, I get it. You argue all the time with trolls who claim that men have it just as bad as women and will shout “what about the men” as a way to disrupt any discussion. When you're engaged in meme warfare, you can't show them any human empathy. They certainly wouldn't return the favor. And if my voice sounds a little like theirs, that's just too bad for me.

I know that this article will serve as ammunition for some people with views I find disgusting. That sucks, but I'm done using political strategy as a reason to stay silent. I understand tone policing as a derailing tactic, and I understand the need to call it out. But at this point it seems there's no room for a sincere request for kindness, especially coming from someone who doesn't get much benefit of the doubt. (The Geek Feminism Wiki basically says that asking for kindness is tone policing if and only if you're a man.)

I'm not trying to silence anyone here. I'm not jumping in and derailing an existing conversation. I'm writing on my own blog, on my own schedule, about my own feelings. But I'm told that even this is crossing a line.

I know that I can't dictate how others feel about our fucked-up world. Does that mean I must absolutely suppress the way I feel? Even when we agree about the substance of what's wrong? I know that if I ask someone to share their life experiences, they have a right to express anger. When does expressing anger become sustained, deliberate cruelty?

“People are being oppressed and you're asking us to care about your feelings?” Yes, I am asking you to care. Just a little bit. I don't claim that my feelings should be a top priority. I hope it wouldn't come up very often. But according to the outspoken few who set the tone, I'm never allowed to bring it up. I don't deserve to ask them to be nice.

And that's why I can no longer have anything to do with this movement. It's really that simple. I guess it says something about my state of mind that I felt the need to attach 1,700 words of preemptive defenses.


The truth is, when I'm not allowed to say or even think “not all men,” part of me hears “Yes, all men, especially you.” And if I'm ever confused about whether I'm allowed to say “not all men,” there are a dozen unprompted reminders every day. Little jokes, repeated constantly to set the climate about what will and won't be tolerated.

When you treat me like one of the trolls, I start to believe that I am one. Guys who say “I support feminism but sometimes they go too far” are usually trying to excuse sexist behavior. So what do I conclude about myself when I have the same thought?

I get that “ally” is not a label you self-apply, it's a thing you do, and the label comes from others. The problem is, if a hundred people say I'm a good ally, and one person says I'm a sexist asshole, who do you think I'm going to believe?

I'm not allowed to stand up for myself, because doing so is automatically an act of oppression. If a woman treats me like shit, and she's being “more feminist” than me, I conclude that I deserve to be treated like shit. That is the model I've learned of a good ally.

I'm not a good ally, or even a bad one. I'm collateral damage.

If the point of all this is to give me a tiny little taste of the invalidation that others experience on a regular basis, then congratulations, it worked. You've made your point. Now that you've broken me, how can I possibly help you, when it seems like I'm part of the problem just by existing? It feels like all I can do is engage in emotional self-harm to repay the debt of how I was born.

I can't just take a break “until I feel better.” My depressive symptoms will always come and go, and some thoughts will reliably bring them back. I spent years reading about how the most important thing I can do, as a winner of the birth lottery, is to be an ally to marginalized people. And now I've realized that I'm too sick and weak to do it.

Even if I give up on being an ally, I can't avoid this subject. It affects a lot of my friends, and I feel even worse when I ask them not to talk about it around me. I don't want to silence anyone. At least I've mostly stopped using Twitter.

So this is how I feel, but I'm not sure anyone else can do anything about it. Really, most of the people I've talked to have been sympathetic. Maybe I need to learn not to let bullies get to me, even when they're bullying in service of a cause I support. They don't seem to get much pushback from the wider community, at any rate.

What gives me hope is, I recognize that my participation in the endless shouting online wasn't really useful to anyone. If I can let myself ignore all that, maybe I can recover some of my energy for other activities that actually help people.

That's all I have to say right now. Thank you for listening to me.

Tuesday, February 11, 2014

x86 is Turing-complete with no registers

In which x86 has too many registers after all.

Introduction

The fiendish complexity of the x86 instruction set means that even bizarrely restricted subsets are capable of arbitrary computation. As others have shown, we can compute using alphanumeric machine code or English sentences, using only the mov instruction, or using the MMU as it handles a never-ending double-fault. Here is my contribution to this genre of Turing tarpit: x86 is Turing-complete with no registers.

No registers?

What do I mean by "no registers"? Well, really just whatever makes the puzzle interesting, but the basic guideline is:

No instruction's observable behavior can depend on the contents of any ordinary user-space register.

So we can't read from R[ABCD]X, R[SD]I, R[SB]P (that's right, no stack), R8-R15, any of their smaller sub-registers, or any of the x87 / MMX / SSE registers. This forbids implicit register access like push or movsb as well as explicit operands. I think I would allow RIP-relative addressing, but it's probably not useful when you're building a single executable which loads at a fixed address.

We also can't use the condition flags in EFLAGS, so conditional jumps and moves are right out. Many instructions will set these flags, but those dead stores are okay by me.

All memory access depends on segment selectors, the page table base in CR3, and so on. We trust that the OS (Linux in my example) has set up a reasonable flat memory model, and we shouldn't try to modify that. Likewise there are debug registers, parts of EFLAGS (such as the trap bit), and numerous MSRs which can influence the execution of nearly any instruction. We ignore all that too. Basically, the parts of CPU state which normal user-space code doesn't touch are treated as constants.

So what's left that we can work with? Just

  • the instruction pointer,
  • memory operands, and
  • self-modifying code.

But it would be too easy to self-modify an instruction into having a register operand. The above restrictions must hold for every instruction we execute, not just those appearing in our binary. Later on I'll demonstrate experimentally that we aren't cheating.

The instruction set

In a RISC architecture, every memory access is a register load or store, and our task would be completely impossible. But x86 does not have this property. For example we can store a constant directly into memory. Here's machine code along with NASM (Intel syntax) assembly:

c6042500004000ba          mov byte  [0x400000], 0xba
66c7042500004000dbba      mov word  [0x400000], 0xbadb
c7042500004000efbead0b    mov dword [0x400000], 0xbadbeef
48c7042500004000efbead0b  mov qword [0x400000], 0xbadbeef

In the latter case the 4-byte constant is sign-extended to 8 bytes.

We can also perform arithmetic on a memory location in place:

8304250000400010          add dword [0x400000], 0x10

But moving data around is going to be hard. As far as I know, every instruction which loads from one address and stores to another, for example movsb, depends on registers in some way.

Conditional control flow is possible thanks to this gem of an instruction:

ff242500004000            jmp qword [0x400000]

This jumps to whatever address is stored as a 64-bit quantity at address 0x400000. This seems weird but it's really just a load where the destination register is the instruction pointer. Many RISC architectures also allow this.

Compiling from Brainfuck

Let's get more concrete and talk about compiling Brainfuck code to this subset of x86. Brainfuck isn't the simplest language out there (try Subleq) but it's pretty familiar as an imperative, structured-control language. So I think compiling from Brainfuck makes this feel "more real" than compiling from something super weird.

A Brainfuck program executes on a linear tape of (typically) byte-size cells.

TAPE_SIZE equ 30000

tape_start:
    times TAPE_SIZE dq cell0

head equ tape_start + (TAPE_SIZE / 2)

Like many Brainfuck implementations, the tape has a fixed size (more on this later) and we start in the middle. head is not a variable with a memory location; it's just an assembler constant for the address of the middle of the tape.

Since our only way to read memory is jmp [addr], the tape must store pointers to code. We create 256 short routines, each representing one of the values a cell can hold.

cont_zero:     dq 0
cont_nonzero:  dq 0
out_byte:      db 0

align 16
cell_underflow:
    jmp inc_cell

align 16
cell0:
    mov byte [out_byte], 0
    jmp [cont_zero]

%assign cellval 1
%rep 255
    align 16
    mov byte [out_byte], cellval
    jmp [cont_nonzero]
    %assign cellval cellval+1
%endrep

align 16
cell_overflow:
    jmp dec_cell

There are two things we need to do with a cell: get its byte value for output, and test whether it's zero. So each routine moves a byte into out_byte and jumps to the address stored at either cont_zero or cont_nonzero.

We produce most of the routines using a NASM macro. We also have functions to handle underflow and overflow, so a cell which would reach -1 or 256 is bumped back to 0 or 255. (We could implement the more typical wrap-around behavior with somewhat more code.)

The routines are aligned on 16-byte boundaries so that we can implement Brainfuck's + and - by adding or subtracting 16. But how do we know where the head is? We can't store it in a simple memory variable because we'd need a double-indirect jump instruction. This is where the self-modifying code comes in.

test_cell:
    jmp [head]

inc_cell:
    add qword [head], 16
    jmp test_cell

dec_cell:
    sub qword [head], 16
    jmp test_cell

move_right:
    add dword [inc_cell+4],  8
    add dword [dec_cell+4],  8
    add dword [test_cell+3], 8
    jmp [cont_zero]

move_left:
    sub dword [inc_cell+4],  8
    sub dword [dec_cell+4],  8
    sub dword [test_cell+3], 8
    jmp [cont_zero]

Recall that head is an assembler constant for the middle of the tape. So inc_cell etc. will only touch the exact middle of the tape — except that we modify the instructions when we move left or right. The address operand starts at byte 3 or 4 of the instruction (check the disassembly!) and we change it by 8, the size of a function pointer.

Also note that inc_cell and dec_cell jump to test_cell in order to handle overflow / underflow. By contrast the move instructions don't test the current cell and just jump to [cont_zero] unconditionally.

To output a byte we perform the system call write(1, &out_byte, 1). There's no escaping the fact that the Linux system call ABI uses registers, so I allow them here. We can do arbitrary computation without output; it's just nice if we can see the results. Input is messier still but it's not fundamentally different from what we've seen here. Code that self-modifies by calling read() is clearly the future of computing.

Putting it all together, I wrote a small Brainfuck compiler which does little more than match brackets. For each Brainfuck instruction it outputs one line of assembly, a call to a NASM macro which will load cont_[non]zero and jump to one of test_cell, inc_cell, etc. For the program [+] the compiler's output looks like

k00000000: do_branch k00000003, k00000001
k00000001: do_inc    k00000002
k00000002: do_branch k00000003, k00000001
k00000003: jmp exit

which blows up into something like

401205:  48c70425611240005c124000  mov qword ptr [0x401261], 0x40125c
401211:  48c704256912400022124000  mov qword ptr [0x401269], 0x401222
40121d:  e90cefffff                jmp 40012e <test_cell>

401222:  48c70425611240003f124000  mov qword ptr [0x401261], 0x40123f
40122e:  48c70425691240003f124000  mov qword ptr [0x401269], 0x40123f
40123a:  e9f6eeffff                jmp 400135 <inc_cell>

40123f:  48c70425611240005c124000  mov qword ptr [0x401261], 0x40125c
40124b:  48c704256912400022124000  mov qword ptr [0x401269], 0x401222
401257:  e9d2eeffffe9d2eeffff      jmp 40012e <test_cell>

40125c:  e9c1eeffff                jmp 400122 <exit>

Even within our constraints, this code could be a lot more compact. For example, a test could be merged with a preceding inc or dec.

Demos

Let's try it out on some of Daniel B Cristofani's Brainfuck examples.

$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | ./rip
Hello, world!

$ curl -s http://www.hevanet.com/cristofd/brainfuck/fib.b | ./compiler
$ ./rip
0
1
1
2
3
5
8
13
…

And now let's try a Brainfuck interpreter written in Brainfuck. There are several, but we will choose the fastest one (by Clive Gifford), which is also compatible with our handling of end-of-file and cell overflow.

$ curl -s http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi2.b | ./compiler
$ (curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b;
   echo '!Uryyb, jbeyq!') | ./rip
Hello, world!

This takes about 4.5 seconds on my machine.

Verifying it with ptrace

How can we verify that a program doesn't use registers? There's no CPU flag to disable registers, but setting them to zero after each instruction is close enough. Linux's ptrace system call allows us to manipulate the state of a target process.

uint64_t regs_boundary;

void clobber_regs(pid_t child) {
    struct user_regs_struct   regs_int;
    struct user_fpregs_struct regs_fp;

    CHECK(ptrace(PTRACE_GETREGS, child, 0, &regs_int));
    if (regs_int.rip < regs_boundary)
        return;

    CHECK(ptrace(PTRACE_GETFPREGS, child, 0, &regs_fp));

    // Clear everything before the instruction pointer,
    // plus the stack pointer and some bits of EFLAGS.
    memset(&regs_int, 0, offsetof(struct user_regs_struct, rip));
    regs_int.rsp = 0;
    regs_int.eflags &= EFLAGS_MASK;

    // Clear x87 and SSE registers.
    memset(regs_fp.st_space,  0, sizeof(regs_fp.st_space));
    memset(regs_fp.xmm_space, 0, sizeof(regs_fp.xmm_space));

    CHECK(ptrace(PTRACE_SETREGS,   child, 0, &regs_int));
    CHECK(ptrace(PTRACE_SETFPREGS, child, 0, &regs_fp));

    clobber_count++;
}

For the layout of struct user_regs_struct, see /usr/include/sys/user.h.

We allow registers in the first part of the program, which is responsible for system calls. regs_boundary is set by looking for the symbol FORBID_REGS in the binary.

We run the target using PTRACE_SINGLESTEP, which sets the trap flag so that the CPU will raise a debug exception after one instruction. Linux handles this exception, suspends the traced process, and wakes up the tracer, which was blocked on waitpid.

while (1) {
    // For this demo it's simpler if we don't deliver signals to
    // the child, so the last argument to ptrace() is zero.
    CHECK(ptrace(PTRACE_SINGLESTEP, child, 0, 0));
    CHECK(waitpid(child, &status, 0));

    if (WIFEXITED(status))
      finish(WEXITSTATUS(status));

    inst_count++;
    clobber_regs(child);
}

And the demo:

$ gcc -O2 -Wall -o regclobber regclobber.c
$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler
$ echo 'Uryyb, jbeyq!' | time ./regclobber ./rip
Hello, world!

Executed 81366 instructions; clobbered registers 81189 times.
0.36user 1.81system 0:01.96elapsed 110%CPU (0avgtext+0avgdata 1392maxresident)k

At almost two seconds elapsed, this is hundreds of times slower than running ./rip directly. Most of the time is spent in the kernel, handling all those system calls and debug exceptions.

I wrote about ptrace before if you'd like to see more of the things it can do.

Notes on universality

Our tape has a fixed size of 30,000 cells, the same as Urban Müller's original Brainfuck compiler. A system with a finite amount of state can't really be Turing-complete. But x86 itself also has a limit on addressable memory. So does C, because sizeof(void *) is finite. These systems are Turing-complete when you add an external tape using I/O, but so is a finite-state machine!

So while x86 isn't really Turing-complete, with or without registers, I think the above construction "feels like" arbitrary computation enough to meet the informal definition of "Turing-complete" commonly used by programmers, for example in the mov is Turing-complete paper. If you know of a way to formalize this idea, do let me know (I'm more likely to notice tweets @miuaf than comments here).