Build a Font Reader from Zig (Part 2)

2025-06-17 48:09:31/5 Hours ago

In the last post, we took our first steps into the fascinating world of fonts. We explored what fonts actually are under the hood - those mysterious binary files that somehow transform into beautiful text on our screens. We also built a simple font reader that could parse basic font structures and extract some fundamental information.

Today, we’ll extend our font reader to handle more complex font features and dive deeper into the inner workings of font files. Ready to get our hands dirty with some more advanced parsing? Let’s go!

Remember those required tables I mentioned in the previous post? Well, it’s time to roll up our sleeves and start implementing them!

CMAP - The Character Translation Engine

Before diving into the CMAP table implementation, let me paint a picture of what this table actually does. Think of CMAP as a massive “translation dictionary” inside your font file. When you type the character ‘A’ on your keyboard, your computer needs to figure out which specific glyph (visual shape) in the font corresponds to that ‘A’. The CMAP table is that crucial translator - it maps Unicode codepoints to glyph indices within the font.

It’s like having a phone book, but instead of mapping names to phone numbers, it maps characters to their visual representations!

Let’s start by updating our parser to handle table parsing:

--- parser.zig
+++ parser.zig
@@ +1,1 @@
for (0..num_tables) |_| {
  const tag_bytes = try self.reader.read_tag();
  const tag = tag_bytes;
  const checksum = try self.reader.read_u32_be();
  const offset = try self.reader.read_u32_be();
  const length = try self.reader.read_u32_be();
  self.table_records.appendAssumeCapacity(TableRecord{
      .tag = tag,
      .checksum = checksum,
      .offset = offset,
      .length = length,
  });
}
+ self.parse_all_tables();

For this tutorial, we’ll keep things simple with a linear parsing approach. First, let’s create a structure to hold our parsed tables:


pub const ParsedTables = struct {
    cmap: ?table.cmap.Table = null,
    // We'll add more tables as we implement them
    _,

    pub fn deinit(self: *ParsedTables) void {
        if (self.cmap) |*cmap_table| cmap_table.deinit();
    }

    pub inline fn is_parsed(self: *const ParsedTables, tag: TableTag) bool {
        return switch (tag) {
            .cmap => self.cmap != null,
            else => false,
        };
    }
};

Now let’s extend our Parser with table parsing capabilities:

pub const Parser = struct {
    // ...existing code...
    parsed_tables: ParsedTables,
    
    fn parse_all_tables(self: *Self) !void {
        for (self.table_records.items) |record| {
            if (!self.parsed_tables.is_parsed(record.tag)) {
                try self.parse_single_table(record);
            }
        }
    }

    fn parse_single_table(self: *Self, record: TableRecord) !void {
        // Seek to the table's offset from the beginning of the font file
        try self.reader.seek_to(record.offset);

        switch (record.tag) {
            .cmap => self.parsed_tables.cmap = try table.cmap.Table.parse(&self.reader),
            else => {
                // Other tables will be implemented later
            },
        }
    }
};

Understanding CMAP Formats

Here’s where things get interesting! CMAP tables support multiple encoding formats - think of them as different “translation methods.” For our educational journey, we’ll focus on the two most important ones:

  • Format 4: The classic workhorse for Basic Multilingual Plane (BMP) characters - this covers most of the text you see daily
  • Format 12: The modern approach that supports the full Unicode range, including those emoji and extended characters

Let’s implement the foundation of our CMAP table:

// table/cmap.zig
const std = @import("std");
const Reader = @import("../reader.zig").Reader;


pub const SubTableFormat = enum(u16) {
    format4 = 4,
    format12 = 12,
};

pub const Format4 = struct {};

pub const Format12 = struct {};

pub const FormatData = union(SubTableFormat) {
    format4: Format4,
    format12: Format12,
};

pub const Table = struct {
    version: u16,
    num_tables: u16,
    encoding_records: []EncodingRecord,
    subtables: []SubTable,
    allocator: std.mem.Allocator,

    const Self = @This();

    pub const EncodingRecord = struct {
        platform_id: u16,
        encoding_id: u16,
        subtable_offset: u32,
    };

    pub fn parse(reader: *Reader, allocator: std.mem.Allocator) !Self {
        const version = try reader.read_u16_be();
        const num_tables = try reader.read_u16_be();
        
        // Parse encoding records first
        var encoding_records = try allocator.alloc(EncodingRecord, num_tables);
        errdefer allocator.free(encoding_records);
        for (0..num_tables) |i| {
            encoding_records[i] = EncodingRecord{
                .platform_id = try byte_reader.read_u16_be(),
                .encoding_id = try byte_reader.read_u16_be(),
                .subtable_offset = try byte_reader.read_u32_be(),
            };
        }

        var subtables = std.ArrayList(FormatData).init(allocator);
        errdefer {
            for (subtables.items) |*subtable| {
                switch (subtable.*) {
                    .format4 => |*f4| f4.deinit(),
                    .format12 => |*f12| f12.deinit(),
                    else => {},
                }
            }
            subtables.deinit();
        }

        var parsed_offsets = std.ArrayList(u32).init(allocator);
        defer parsed_offsets.deinit();
        // Because there are multiple duplicate tables in the sub table, but the platforms are different.
        // This design is to save space.
        // EncodingRecord 1: platform=3, encoding=1,  offset=100  (Windows Unicode BMP)
        // EncodingRecord 2: platform=0, encoding=3,  offset=100  (Unicode BMP) 
        // EncodingRecord 3: platform=3, encoding=10, offset=200  (Windows Unicode UCS-4)
        for (encoding_records) |record| {
            var already_parsed = false;
            for (parsed_offsets.items) |offset| {
                if (offset == record.subtable_offset) {
                    already_parsed = true;
                    break;
                }
            }
            if (!already_parsed) {
                try parsed_offsets.append(record.subtable_offset);
                try byte_reader.seek_to(table_start_offset + record.subtable_offset);
                const subtable = try parse_subtable(allocator, byte_reader);
                try subtables.append(subtable);
            }
        }


        return Self{
            .version = version,
            .num_tables = num_tables,
            .encoding_records = encoding_records,
            .subtables = subtables,
            .allocator = allocator,
        };
    }

    pub fn get_glyph_index(self: *const Self, codepoint: u32) ?u16 {
        // Try format 12 first (it supports the full Unicode range)
        for (self.subtables.items) |subtable| {
            switch (subtable) {
                .format12 => |f12| {
                    if (f12.get_glyph_id(codepoint)) |glyph_id| {
                        return glyph_id;
                    }
                },
                else => {},
            }
        }
        
        // Fall back to format 4 for BMP characters
        if (codepoint <= 0xFFFF) {
            for (self.subtables.items) |subtable| {
                switch (subtable) {
                    .format4 => |f4| {
                        const glyph_id = f4.get_glyph_id(@intCast(codepoint));
                        if (glyph_id != 0) {
                            return glyph_id;
                        }
                    },
                    else => {},
                }
            }
        }
        
        return null; // Character not found in this font
    }

    pub fn deinit(self: *Self) void {
        self.allocator.free(self.encoding_records);
        for (self.subtables.items) |*subtable| {
            switch (subtable.*) {
                .format4 => |*f4| f4.deinit(),
                .format12 => |*f12| f12.deinit(),
                else => {},
            }
        }
        self.subtables.deinit();
    }

    fn parse_subtable(
        allocator: Allocator,
        byte_reader: *reader.ByteReader,
    ) !FormatData {
        const format = try byte_reader.read_u16_be();
        try byte_reader.seek_to(byte_reader.current_offset() - 2);

        return switch (format) {
            4 => FormatData{ .format4 = try Format4.init(allocator, byte_reader) },
            12 => FormatData{ .format12 = try Format12.init(allocator, byte_reader) },
            else => return error.UnsupportedFormat,
        };
    }
};

Format 4 - The BMP Workhorse

Format 4 is like a clever indexing system. Instead of storing every single character mapping, it uses segments (ranges) to efficiently map character codes to glyph indices. Think of it as having multiple “chapters” in our translation dictionary, where each chapter covers a range of characters.

pub const Format4 = struct {
    const Self = @This();
    allocator: Allocator,
    format: u16,
    length: u16,
    language: u16,
    seg_count_x2: u16,
    search_range: u16,
    entry_selector: u16,
    range_shift: u16,
    end_code: []u16,
    reserved_pad: u16,
    start_code: []u16,
    id_delta: []i16,
    id_range_offset: []u16,
    glyph_id_array: []u16,

    pub fn init(allocator: Allocator, reader: *Reader) !Self {
        const start_offset = reader.current_offset();
        const format = try reader.read_u16_be();
        const length = try reader.read_u16_be();
        const language = try reader.read_u16_be();

        const seg_count_x2 = try reader.read_u16_be();
        const seg_count = seg_count_x2 / 2;
        const search_range = try reader.read_u16_be();
        const entry_selector = try reader.read_u16_be();
        const range_shift = try reader.read_u16_be();

        var end_code = try allocator.alloc(u16, seg_count);
        errdefer allocator.free(end_code);
        for (0..seg_count) |i| {
            end_code[i] = try reader.read_u16_be();
        }

        const reserved_pad = try reader.read_u16_be();

        var start_code = try allocator.alloc(u16, seg_count);
        errdefer allocator.free(start_code);
        for (0..seg_count) |i| {
            start_code[i] = try reader.read_u16_be();
        }

        var id_delta = try allocator.alloc(i16, seg_count);
        errdefer allocator.free(id_delta);
        for (0..seg_count) |i| {
            id_delta[i] = try reader.read_i16_be();
        }

        var id_range_offset = try allocator.alloc(u16, seg_count);
        errdefer allocator.free(id_range_offset);
        for (0..seg_count) |i| {
            id_range_offset[i] = try reader.read_u16_be();
        }

        const remaining_bytes = length - @as(u16, @intCast(reader.current_offset() - start_offset));
        const glyph_id_array_size = remaining_bytes / 2;

        var glyph_id_array: []u16 = undefined;
        if (glyph_id_array_size > 0) {
            glyph_id_array = try allocator.alloc(u16, glyph_id_array_size);
            errdefer allocator.free(glyph_id_array);
            for (0..glyph_id_array_size) |i| {
                glyph_id_array[i] = try reader.read_u16_be();
            }
        } else {
            glyph_id_array = &[_]u16{};
        }

        return Self{
            .allocator = allocator,
            .format = format,
            .length = length,
            .language = language,
            .seg_count_x2 = seg_count_x2,
            .search_range = search_range,
            .entry_selector = entry_selector,
            .range_shift = range_shift,
            .end_code = end_code,
            .reserved_pad = reserved_pad,
            .start_code = start_code,
            .id_delta = id_delta,
            .id_range_offset = id_range_offset,
            .glyph_id_array = glyph_id_array,
        };
    }

    pub fn get_glyph_id(self: *const Self, char_code: u16) u16 {
        const seg_count = self.seg_count_x2 / 2;

        // Find the segment that contains our character
        for (0..seg_count) |i| {
            if (char_code <= self.end_code[i]) {
                if (char_code >= self.start_code[i]) {
                    // We found our segment! Now decode the glyph ID
                    if (self.id_range_offset[i] == 0) {
                        // Simple case: just add the delta
                        const result = @as(i32, char_code) + @as(i32, self.id_delta[i]);
                        return @intCast(@as(u32, @bitCast(result)) & 0xFFFF);
                    } else {
                        // Complex case: look up in the glyph_id_array
                        const offset = self.id_range_offset[i] / 2;
                        const index = offset + (char_code - self.start_code[i]);

                        if (index < self.glyph_id_array.len) {
                            const glyph_id = self.glyph_id_array[index];
                            if (glyph_id != 0) {
                                const result = @as(i32, glyph_id) + @as(i32, self.id_delta[i]);
                                return @intCast(@as(u32, @bitCast(result)) & 0xFFFF);
                            }
                        }
                        return 0;
                    }
                }
                break;
            }
        }
        return 0; // Character not found
    }

    pub fn deinit(self: *Self) void {
        self.allocator.free(self.end_code);
        self.allocator.free(self.start_code);
        self.allocator.free(self.id_delta);
        self.allocator.free(self.id_range_offset);
        if (self.glyph_id_array.len > 0) {
            self.allocator.free(self.glyph_id_array);
        }
    }
};

Format 12 - The Unicode Champion

Format 12 is much simpler in concept - it’s basically a sorted list of character ranges with their corresponding glyph mappings. It’s like having a well-organized index that can handle any Unicode character you throw at it.

pub const Format12 = struct {
    const Self = @This();
    allocator: Allocator,
    format: u16,
    reserved: u16,
    length: u32,
    language: u32,
    groups: []SequentialMapGroup,

    pub const SequentialMapGroup = struct {
        start_char_code: u32,
        end_char_code: u32,
        start_glyph_id: u32,
    };

    pub fn init(allocator: Allocator, reader: *Reader) !Self {
        const format = try reader.read_u16_be();
        const reserved = try reader.read_u16_be();
        const length = try reader.read_u32_be();
        const language = try reader.read_u32_be();
        const num_groups = try reader.read_u32_be();

        var groups = try allocator.alloc(SequentialMapGroup, num_groups);
        errdefer allocator.free(groups);

        for (0..num_groups) |i| {
            groups[i] = SequentialMapGroup{
                .start_char_code = try reader.read_u32_be(),
                .end_char_code = try reader.read_u32_be(),
                .start_glyph_id = try reader.read_u32_be(),
            };
        }

        return Self{
            .allocator = allocator,
            .format = format,
            .reserved = reserved,
            .length = length,
            .language = language,
            .groups = groups,
        };
    }

    pub fn get_glyph_id(self: *const Self, char_code: u32) ?u16 {
        // Binary search through the groups for efficiency
        var left: usize = 0;
        var right: usize = self.groups.len;

        while (left < right) {
            const mid = left + (right - left) / 2;
            const group = self.groups[mid];

            if (char_code < group.start_char_code) {
                right = mid;
            } else if (char_code > group.end_char_code) {
                left = mid + 1;
            } else {
                // Found our group! Calculate the glyph ID
                const offset = char_code - group.start_char_code;
                const glyph_id = group.start_glyph_id + offset;
                return if (glyph_id <= 0xFFFF) @intCast(glyph_id) else null;
            }
        }

        return null; // Character not found
    }

    pub fn deinit(self: *Self) void {
        self.allocator.free(self.groups);
    }
};

Putting It All Together

Now that we have both Format 4 and Format 12 implementations, our CMAP table can handle virtually any character you throw at it! The beauty of this design is that it automatically falls back from the more comprehensive Format 12 to Format 4 when needed, ensuring maximum compatibility.

Here’s a quick example of how you’d use it:

// Get the glyph index for the letter 'A'
if (font.parsed_tables.cmap) |cmap| {
    if (cmap.get_glyph_index('A')) |glyph_index| {
        std.debug.print("Character 'A' maps to glyph {}\n", .{glyph_index});
    }
}

// Get the glyph index for an emoji 🚀
if (font.parsed_tables.cmap) |cmap| {
    if (cmap.get_glyph_index(0x1F680)) |glyph_index| {
        std.debug.print("Rocket emoji maps to glyph {}\n", .{glyph_index});
    }
}

Pretty awesome, right? We’ve just built a complete character-to-glyph mapping system that can handle everything from basic ASCII to complex Unicode characters and emoji!

In our next post, we’ll dive into the GLYF table to actually extract and render those glyphs we’re now able to find. Stay tuned - we’re about to bring these fonts to life! 🎨

CC BY-NC-SA 4.0  2024-PRESENT © Kanno