Documenting the ar file format

2022-04-01

Introduction

Wikipedia has this to say about what ar actually is:

The archiver, also known simply as ar, is a Unix utility that maintains groups of files as a single archive file. Today, ar is generally used only to create and update static library files that the link editor or linker uses and for generating .deb packages for the Debian family; it can be used to create archives for any purpose, but has been largely replaced by tar for purposes other than static libraries.

As it turns out, ar is kind of a big deal, mainly used to create an archive of object files which serves as a static library.

A curious thing about the ar file format is that it has never been standardized. Over the past fifty or so years of its existence, few variants have cropped up as a result. This is an attempt towards documenting the main variants of the ar file format by writing parsers for each of them, starting with the common variant:

Common Variant

Format Description

The common variant, used for Debian package (.deb) files among other things only supports filenames up to 16 characters.

  • Each archive starts with the eight character string !<arch>\n
  • Each archive member is preceded by a 60 byte header containing:
    • The name of the member, right padded to 16 characters by spaces
    • The modification date, as a decimal number of seconds since the beginning of 1970
    • The user and group IDs as decimal numbers
    • The UNIX file mode as an octal number
    • The size of the file in bytes as a decimal number. If the file size is odd, the file’s contents are padded with a newline character to make the total length even, although the pad character isn’t counted in the size field
    • The two characters reverse quote and newline, to make the header a line of text and provide a simple check that the header is indeed a header

ar common variant

Parser

With the format described let's move on to writing a parser for it. We are going to write the parser in Rust using nom.

Tip

Basic understanding of nom is assumed. To grok nom, this is the tutorial I recommend.

Note

These parsers are aimed at being as easy to understand as possible. Performance is a non-goal.

Let's parse the file signature first:

fn parse_file_signature(input: &[u8]) -> IResult<&[u8], &[u8]> {
    tag(b"!<arch>\n")(input)
}

and test it:

#[test]
fn test_parse_file_signature() {
    let input = "\
        !<arch>\n\
        foo.txt         1487552916  501   20    100644  7         `\n\
        foobar\n\n\
        bar.awesome.txt 1487552919  501   20    100644  22        `\n\
        This file is awesome!\n\
        baz.txt         1487552349  42    12345 100664  4         `\n\
        baz\n";
    let (_, file_sig) =
        parse_file_signature(input.as_bytes()).expect("failed in parsing file signature");
    assert_eq!(file_sig, b"!<arch>\n");
}

Tip

As the headers only include printable ASCII characters and line feeds, an archive containing only text files therefore still appears to be a text file itself.

Next, let's start parsing the file member/entry header, starting with parsing the file entry identifier.

We first extract 16 bytes consisting of the file identifier possibly right padded by spaces. After that we remove the spaces on the right if any:

fn parse_identifier(input: &[u8]) -> IResult<&[u8], Vec<u8>> {
    map(take(16usize), |identifier_padded: &[u8]| {
        let mut identifier: Vec<u8> = identifier_padded.to_vec();
        while identifier.last() == Some(&b' ') {
            identifier.pop();
        }
        identifier
    })(input)
}

#[test]
fn test_parse_identifier() {
    let input = "\
        !<arch>\n\
        foo.txt         1487552916  501   20    100644  7         `\n\
        foobar\n\n\
        bar.awesome.txt 1487552919  501   20    100644  22        `\n\
        This file is awesome!\n\
        baz.txt         1487552349  42    12345 100664  4         `\n\
        baz\n";
    let (i, file_sig) =
        parse_file_signature(input.as_bytes()).expect("failed in parsing file signature");
    assert_eq!(file_sig, b"!<arch>\n");
    let (_, identifier) = parse_identifier(i).expect("failed in parsing entry identifier");
    assert_eq!(identifier, b"foo.txt");
}

Rest of the fields in the entry header are numbers encoded as ASCII, let's write a generic parser for that case:

fn byte_slice_to_number(byte_slice: &[u8], radix: u32) -> u64 {
    let s = str::from_utf8(byte_slice).expect("failed in converting byte slice to string slice");
    let num = u64::from_str_radix(s.trim_end(), radix)
        .unwrap_or_else(|_| panic!("failed in converting: {} to number", s));
    num
}

fn parse_number(input: &[u8], length: usize, radix: u32) -> IResult<&[u8], u64> {
    map(take(length), |byte_slice| {
        byte_slice_to_number(byte_slice, radix)
    })(input)
}

parse_number takes in length bytes of data representing a number in ASCII with the base radix and returns that number as u64. We can use parse_number to write parsers for rest of the entry header fields:

fn parse_mtime(input: &[u8]) -> IResult<&[u8], u64> {
    parse_number(input, 12, 10)
}

fn parse_uid(input: &[u8]) -> IResult<&[u8], u64> {
    parse_number(input, 6, 10)
}

fn parse_gid(input: &[u8]) -> IResult<&[u8], u64> {
    parse_number(input, 6, 10)
}

fn parse_mode(input: &[u8]) -> IResult<&[u8], u64> {
    parse_number(input, 8, 8)
}

fn parse_size(input: &[u8]) -> IResult<&[u8], u64> {
    parse_number(input, 10, 10)
}

#[test]
fn test_parse_header_numbers() {
    let input = "\
        !<arch>\n\
        foo.txt         1487552916  501   20    100644  7         `\n\
        foobar\n\n\
        bar.awesome.txt 1487552919  501   20    100644  22        `\n\
        This file is awesome!\n\
        baz.txt         1487552349  42    12345 100664  4         `\n\
        baz\n";
    let (i, file_sig) =
        parse_file_signature(input.as_bytes()).expect("failed in parsing file signature");
    assert_eq!(file_sig, b"!<arch>\n");
    let (i, identifier) = parse_identifier(i).expect("failed in parsing entry identifier");
    assert_eq!(identifier, b"foo.txt");
    let (i, mtime) = parse_mtime(i).expect("failed in parsing mtime");
    assert_eq!(mtime, 1487552916);
    let (i, uid) = parse_uid(i).expect("failed in parsing uid");
    assert_eq!(uid, 501);
    let (i, gid) = parse_gid(i).expect("failed in parsing gid");
    assert_eq!(gid, 20);
    let (i, mode) = parse_mode(i).expect("failed in parsing mode");
    assert_eq!(mode, 0o100644);
    let (_, size) = parse_size(i).expect("failed in parsing size");
    assert_eq!(size, 7);
}

Next up we need to parse the entry header terminator:

fn parse_entry_header_terminator(input: &[u8]) -> IResult<&[u8], &[u8]> {
    tag(b"`\n")(input)
}

With rest of the pieces in place we can finally parse the complete entry header:

#[derive(Debug, PartialEq)]
struct EntryHeader {
    identifier: Vec<u8>,
    mtime: u64,
    uid: u32,
    gid: u32,
    mode: u32,
    size: u64,
}

fn parse_entry_header(input: &[u8]) -> IResult<&[u8], EntryHeader> {
    let (i, identifier) = parse_identifier(input).expect("failed in parsing entry identifier");
    let (i, mtime) = parse_mtime(i).expect("failed in parsing mtime");
    let (i, uid) = parse_uid(i).expect("failed in parsing uid");
    let (i, gid) = parse_gid(i).expect("failed in parsing gid");
    let (i, mode) = parse_mode(i).expect("failed in parsing mode");
    let (i, size) = parse_size(i).expect("failed in parsing size");
    let (i, _) =
        parse_entry_header_terminator(i).expect("failed in parsing entry header terminator");
    let entry_header = EntryHeader {
        identifier,
        mtime,
        uid: uid as u32,
        gid: gid as u32,
        mode: mode as u32,
        size,
    };
    Ok((i, entry_header))
}

#[test]
fn test_parse_entry_header() {
    let input = "\
        !<arch>\n\
        foo.txt         1487552916  501   20    100644  7         `\n\
        foobar\n\n\
        bar.awesome.txt 1487552919  501   20    100644  22        `\n\
        This file is awesome!\n\
        baz.txt         1487552349  42    12345 100664  4         `\n\
        baz\n";
    let (i, file_sig) =
        parse_file_signature(input.as_bytes()).expect("failed in parsing file signature");
    assert_eq!(file_sig, b"!<arch>\n");
    let (_, entry_header) = parse_entry_header(i).expect("failed in parsing entry header");
    assert_eq!(entry_header.identifier, b"foo.txt");
    assert_eq!(entry_header.mtime, 1487552916);
    assert_eq!(entry_header.uid, 501);
    assert_eq!(entry_header.gid, 20);
    assert_eq!(entry_header.mode, 0o100644);
    assert_eq!(entry_header.size, 7);
}

Moving on to the step of parsing the entry data:

fn parse_entry_data(input: &[u8], size: u64) -> IResult<&[u8], &[u8]> {
    take(size)(input)
}

#[test]
fn test_parse_entry_data() {
    let input = "\
        !<arch>\n\
        foo.txt         1487552916  501   20    100644  7         `\n\
        foobar\n\n\
        bar.awesome.txt 1487552919  501   20    100644  22        `\n\
        This file is awesome!\n\
        baz.txt         1487552349  42    12345 100664  4         `\n\
        baz\n";
    let (i, file_sig) =
        parse_file_signature(input.as_bytes()).expect("failed in parsing file signature");
    assert_eq!(file_sig, b"!<arch>\n");
    let (i, entry_header) = parse_entry_header(i).expect("failed in parsing entry header");
    assert_eq!(entry_header.identifier, b"foo.txt");
    assert_eq!(entry_header.mtime, 1487552916);
    assert_eq!(entry_header.uid, 501);
    assert_eq!(entry_header.gid, 20);
    assert_eq!(entry_header.mode, 0o100644);
    assert_eq!(entry_header.size, 7);
    let (_, data) = parse_entry_data(i, entry_header.size).expect("failed in parsing entry data");
    assert_eq!(data, "foobar\n".as_bytes());
}

Now that we can parse the entry header as well as data, let's parse the complete entry:

#[derive(Debug, PartialEq)]
struct Entry {
    header: EntryHeader,
    data: Vec<u8>,
}

fn parse_newline_padding(input: &[u8]) -> IResult<&[u8], Vec<&[u8]>> {
    many0(tag(b"\n"))(input)
}

fn parse_entry(input: &[u8]) -> IResult<&[u8], Entry> {
    let (i, header) = parse_entry_header(input).expect("failed in parsing entry header");
    let (i, data) = parse_entry_data(i, header.size).expect("failed in parsing entry data");
    let (i, _) = parse_newline_padding(i).expect("failed in parsing newline padding");
    let entry = Entry {
        header,
        data: data.to_vec(),
    };
    Ok((i, entry))
}

We need to parse the newline padding because if you recall from the format description:

File’s contents are padded with a newline character to make the total length even

We can finish writing the parser now:

fn parser(input: &[u8]) -> IResult<&[u8], (Vec<Entry>, &[u8])> {
    let (i, _) = parse_file_signature(input).expect("failed in parsing file signature");
    many_till(parse_entry, eof)(i)
}

#[test]
fn test_parser() {
    let input = "\
        !<arch>\n\
        foo.txt         1487552916  501   20    100644  7         `\n\
        foobar\n\n\
        bar.awesome.txt 1487552919  501   20    100644  22        `\n\
        This file is awesome!\n\
        baz.txt         1487552349  42    12345 100664  4         `\n\
        baz\n";
    let (i, (entries, empty)) =
        parser(input.as_bytes()).expect("failed in parsing the archive file");
    assert_eq!(i, b"");
    assert_eq!(empty, b"");

    assert_eq!(entries[0].header.identifier, b"foo.txt");
    assert_eq!(entries[0].header.mtime, 1487552916);
    assert_eq!(entries[0].header.uid, 501);
    assert_eq!(entries[0].header.gid, 20);
    assert_eq!(entries[0].header.mode, 0o100644);
    assert_eq!(entries[0].header.size, 7);
    assert_eq!(entries[0].data, "foobar\n".as_bytes());

    assert_eq!(entries[1].header.identifier, b"bar.awesome.txt");
    assert_eq!(entries[1].header.mtime, 1487552919);
    assert_eq!(entries[1].header.uid, 501);
    assert_eq!(entries[1].header.gid, 20);
    assert_eq!(entries[1].header.mode, 0o100644);
    assert_eq!(entries[1].header.size, 22);
    assert_eq!(entries[1].data, "This file is awesome!\n".as_bytes());
}

You can find the complete code for this parser here.

Note

This is a living document and will be updated as parsers for other variants are written.

This work is licensed under CC BY-NC-SA 4.0.