I’ve been meaning to write about this for a year! Finally got around to building a font reader tool in Zig that creates minimal font files with only the characters my blog actually uses. Figured it’s time to document the whole process and share what I learned along the way.
Note
- I’m following the Microsoft openType as reference for this series. I’ll do my best to write them as I understand, but if you notice anything that seems off, don’t hesitate to let me know. Always happy to learn!
- All code snippets from this series will be collected into a complete, runnable project. I’ll open-source it once everything is polished up!
- Coding style is opinionated.
Before we Begin
There are some terms I need to introduce.
Typeface
Think of typefaces like music genres. Just as music has classical, jazz, pop, and rock - each with their own distinct feel and expression - typefaces have their own variations like
Bold
,Regular
,Light
,Italic
, and more. Each typeface style conveys a different mood and serves a different purpose, much like how a jazz piece feels completely different from a rock song, even if they share the same melody.Font
Following our music analogy, a font is like a specific album within a genre. If “Helvetica” is the genre, then “Helvetica Bold” or “Helvetica Light” are the individual albums. Technically, a font is a collection of glyphs with a specific style, weight, and size.
Character
A character is the abstract concept - like the letter “A” or the number “5”. It’s the idea of what something represents, regardless of how it looks. Think of it as the lyrics of a song - the meaning stays the same no matter who sings it.
Glyph
A glyph is the visual representation of a character in a specific font. The letter “A” might look completely different in Times New Roman versus Comic Sans - those are different glyphs representing the same character. It’s like how different singers perform the same song with their own unique style.
CodePoint
A codepoint is like a unique catalog number for each character in the Unicode system. Just as every song has a unique identifier in a music database (like a UPC code), every character has a specific number assigned to it. For example, the letter “A” is always U+0041, no matter what font you use. It’s the universal way computers identify which character you’re talking about.
Glyph Index
A glyph index is like a track number on a specific album. While the codepoint tells us which character we want (like saying “we want that Roselia song”), the glyph index tells us exactly where to find the visual representation of that character within a particular font file. Different fonts might store the same character at different positions, just like different albums might put the same song at different track numbers.
Main
OTF files use a format called SFNT
(Scalable Font) which is a container format that hosts various font technologies, including PostScript fonts (OTF), TrueType fonts (TTF), and others.
According to the Microsoft data-type specification, all data in OpenType fonts is stored in Big Endian
format, so we need to parse the binary accordingly. Well, let’s start parsing. Before we do a parse also need a simple utility function readFile
.
const std = @import(std);
const Allocator = std.mem.Allocator;
const fs = std.fs;
pub fn readFile(allocator: Allocator, input: []const u8) ![]u8 {
const file = fs.cwd().openFile(input, .{}) catch |err| switch (err) {
error.FileNotFound => {
std.log.err("File not found: {s}", .{input});
return err;
},
error.AccessDenied => {
std.log.err("Access denied: {s}", .{input});
return err;
},
else => return err,
};
defer file.close();
const file_size = try file.getEndPos();
const contents = try allocator.alloc(u8, file_size);
_ = try file.readAll(contents);
return contents;
}
Now we need to read Big Endian
, also define a utility collection.
// zig provide a powerful API to help us do these tings quickly
// All methods needed in the table we're defined here
const std = @import("std");
pub const ByteReaderError = error{
BufferTooSmall,
InvalidOffset,
};
pub const ByteReader = struct {
const Self = @This();
buffer: []const u8,
offset: usize = 0,
pub fn init(buffer: []const u8) Self {
return Self{
.buffer = buffer,
};
}
pub fn seek_to(self: *Self, offset: usize) ByteReaderError!void {
if (offset >= self.buffer.len) {
return ByteReaderError.InvalidOffset;
}
self.offset = offset;
}
pub fn skip(self: *Self, bytes: usize) ByteReaderError!void {
if (self.offset + bytes > self.buffer.len) {
return ByteReaderError.BufferTooSmall;
}
self.offset += bytes;
}
pub fn read_i8(self: *Self) ByteReaderError!i8 {
const value = try self.read_u8();
return @bitCast(value);
}
pub fn read_u8(self: *Self) ByteReaderError!u8 {
if (self.offset + 1 > self.buffer.len) {
return ByteReaderError.BufferTooSmall;
}
const value = self.buffer[self.offset];
self.offset += 1;
return value;
}
pub fn read_u16_be(self: *Self) ByteReaderError!u16 {
if (self.offset + 2 > self.buffer.len) {
return ByteReaderError.BufferTooSmall;
}
const bytes = self.buffer[self.offset .. self.offset + 2];
self.offset += 2;
return std.mem.readInt(u16, bytes[0..2], .big);
}
pub fn read_u32_be(self: *Self) ByteReaderError!u32 {
if (self.offset + 4 > self.buffer.len) {
return ByteReaderError.BufferTooSmall;
}
const bytes = self.buffer[self.offset .. self.offset + 4];
self.offset += 4;
return std.mem.readInt(u32, bytes[0..4], .big);
}
pub fn read_i16_be(self: *Self) ByteReaderError!i16 {
const value = try self.read_u16_be();
return @bitCast(value);
}
pub fn read_i32_be(self: *Self) ByteReaderError!i32 {
const value = try self.read_u32_be();
return @bitCast(value);
}
pub fn read_i64_be(self: *Self) ByteReaderError!i64 {
if (self.offset + 8 > self.buffer.len) {
return ByteReaderError.BufferTooSmall;
}
const bytes = self.buffer[self.offset .. self.offset + 8];
self.offset += 8;
return std.mem.readInt(i64, bytes[0..8], .big);
}
pub fn read_bytes(self: *Self, len: usize) ByteReaderError![]const u8 {
if (self.offset + len > self.buffer.len) {
return ByteReaderError.BufferTooSmall;
}
const bytes = self.buffer[self.offset .. self.offset + len];
self.offset += len;
return bytes;
}
pub fn read_tag(self: *Self) ByteReaderError![4]u8 {
const bytes = try self.read_bytes(4);
return bytes[0..4].*;
}
pub fn remaining(self: Self) usize {
return self.buffer.len - self.offset;
}
pub fn current_offset(self: Self) usize {
return self.offset;
}
};
After preparing these tools, we can start writing the parser. (I’ll include them in test code blocks so we don’t need a main function or CLI tool.) Font files are composed of tables that contain different types of information. Each table has a 4-byte tag that can be interpreted as either a uint32 or 4 ASCII characters. These tags are not stored alongside the table data itself; instead, they are stored in a special table called the table directory, which appears at the beginning of the file and identifies the offsets and locations of all other tables within the font file. To keep simple only consider linear parser.
According the document Table Directory we can define the following typed struct.
const std = @import("std");
const Allocator = std.mem.Allocator;
const ParserError = error{
InvalidInputBuffer,
};
pub const TableRecord = struct {
tag: [4]const u8,
checksum: u32,
offset: u32,
length: u32,
};
pub const Parser = struct {
const Self = @This();
buffer: []const u8,
allocator: Allocator,
reader: ByteReader,
table_records: std.ArrayList(TableRecord),
pub fn init(allocator: Allocator, buffer: []const u8) !Self {
if (buffer.len == 0) {
return ParserError.InvalidInputBuffer;
}
return Self {
.allocator = allocator,
.buffer = buffer,
.reader = ByteReader.init(buffer),
.table_records = std.ArrayList(TableRecord).init(allocator),
};
}
pub fn parse(self: *Self) !void {
// The first step we should read the Table header
_ = try self.reader.read_u32_be();
const num_tables = try self.reader.read_u16_be();
_ = try self.reader.read_u16_be(); // search_range
_ = try self.reader.read_u16_be(); // entry_selector
_ = try self.reader.read_u16_be(); // range_shift
try self.table_records.ensureTotalCapacity(num_tables);
for (0..num_tables) |_| {
const tag_bytes = try self.reader.read_tag();
const tag = tag_bytes;
const checksum = try self.reader.read_u32_be();
const offset = try self.reader.read_u32_be();
const length = try self.reader.read_u32_be();
self.table_records.appendAssumeCapacity(TableRecord{
.tag = tag,
.checksum = checksum,
.offset = offset,
.length = length,
});
// add a print to show the table records
std.debug.print("Table: {s}, Offset: {d}, Length: {d}\n", .{
tag,
offset,
length,
});
}
}
pub fn deinit(self: *Self) void {
self.table_records.deinit();
}
};
test "Parser" {
const allocator = std.testing.allocator;
const file_path = "./lxgw.ttf";
const file_content = readFile(allocator, file_path);
defer allocator.free(file_content);
var parser = try Parser.init(allocator, file_content);
defer parser.deinit();
try parser.parse();
}
Running the test produces the following output. Next, let’s proceed to extract the specific tables we require.
Table: DSIG, Offset: 7656, Length: 8
Table: GSUB, Offset: 7424, Length: 230
Table: OS/2, Offset: 5672, Length: 96
Table: cmap, Offset: 5768, Length: 124
Table: glyf, Offset: 204, Length: 4796
Table: head, Offset: 5216, Length: 54
Table: hhea, Offset: 5636, Length: 36
Table: hmtx, Offset: 5272, Length: 364
Table: loca, Offset: 5032, Length: 184
Table: maxp, Offset: 5000, Length: 32
Table: name, Offset: 5892, Length: 1091
Table: post, Offset: 6984, Length: 439
According to the Required Tables specification, we need to parse several essential tables. However, given the length of this article, I’ll defer the detailed analysis of these tables to a later section. Remember the table tags we discussed earlier? For better code maintainability, we should create an enumeration to handle them more elegantly.
// https://learn.microsoft.com/en-us/typography/opentype/spec/otff#required-tables
pub const TableTag = enum(u32) {
// Required tables
cmap = std.mem.readInt(u32, "cmap", .big),
head = std.mem.readInt(u32, "head", .big),
hhea = std.mem.readInt(u32, "hhea", .big),
hmtx = std.mem.readInt(u32, "hmtx", .big),
maxp = std.mem.readInt(u32, "maxp", .big),
name = std.mem.readInt(u32, "name", .big),
os2 = std.mem.readInt(u32, "OS/2", .big),
post = std.mem.readInt(u32, "post", .big),
// unknow or unsupported tags can be added here
_,
pub inline fn from_bytes(bytes: [4]u8) TableTag {
const value = std.mem.readInt(u32, &bytes, .big);
return @enumFromInt(value);
}
pub inline fn to_str(self: TableTag) [4]u8 {
const value = @intFromEnum(self);
var result: [4]u8 = undefined;
std.mem.writeInt(u32, &result, value, .big);
return result;
}
pub inline fn is_required(self: TableTag) bool {
return switch (self) {
.cmap, .head, .hhea, .hmtx, .maxp, .name, .os2, .post => true,
else => false,
};
}
};
@@ -1,1 +1,1 @@
- const tag = tag_bytes;
+ const tag = TableTag.from_bytes(tag_bytes);
Alright, that’s a wrap for today! We’ve got our basic table structure working, which is pretty cool.
I was thinking about diving into the required tables parsing next, but honestly, this post is getting long enough already. Plus, that topic deserves its own dedicated post anyway - there’s quite a bit to unpack there.
So I’ll save that for next time. Should be fun to explore!
Thanks for sticking around, and catch you in the next one! 👋