a human-friendly ID encoding

Published on 17/03/2025 by Morgan Bazalgette • 7 minutes

I have spent a unhealthy amount of time thinking about ID systems and encodings, both as a user and a developer. In this post I’m going to be talking about cford32, a Go implementation for the base32 encoding scheme specified by Douglas Crockford, and its implementation for encoding uint64 values.

A critique of existing systems

Simple sequential IDs with binary representation: 1, 2, 3, 4… 1000, 1001, 1002…
- Pro: these are simple to think of, and simple to implement, whether it is through an AUTO_INCREMENT field in a SQL database, or any other counter. It is obvious to a human eye which item is “older”.
- Pro: reasonably short; the length grows with the size of the dataset, and up to 10 billion values can be stored with 10 figures, which is a typical size of a phone number.
- Con: however, they cannot be sorted lexicographically: 1 < 10 < 11 < 2, so any contexts which use this (directory listing, to name one), will need specific sorting code.
- Con: sequential ID generation requires a centralized system to establish a clear order; this is why companies at scale developed other systems.
Snowflake decimal IDs, like “1900547963908174030”
- These were published by Twitter in 2010, to solve ID generation at scale with decentralized systems.
- Pro: usable at scale, while still being “k-sorted”.
- Con: very long; numbers up to 2^64 can have up to 20 characters, which is very long.
Instagram/YouTube IDs, like “DHNihFxqeNO”, or “0oIIRc0zc_U”
- These are often 64 bits encoded using some implmentation of base64.RawURLEncoding - which is why they’re 11 characters long.
- Pro: about half of the bytes of a Snowflake decimal ID, with the same amount of information.
- Pro: good especially for when the ID is generated from a content hash, or randomly.
- Con: case sensitive, a nightmare for humans to ever have to communicate via voice or over the phone.
- Con: not sortable lexicographically by default; but possible using a modified encoding.¹
Spotify IDs, like “6tyGfi7HUGg14j0b4u4h7a”
- 128 bits in base64.
- Pro: making decentralized ID generation with this many bits is trivial.
- Con: these are long, case sensitive, and just terribly unfriendly overall.
Slack IDs, “T411CCMET” / “C08DDX13LX9” / “D03D9PPM0EM”
- base32; they’re prefixed by the type of the resource (Team, Channel, Direct message), and the ID part grows in size with the number of underlying resources.
- Pro: case insensitive.
- Con: given that the size grows with the amount of underlying values, it is not sortable lexicographically; ie. 08DDX13LX9 < 411CCMET.
- Con: no typo correction; substituting 0 with O leads to a different ID.

A better system?

I wanted to devise a system that takes the best of all of the above and improves them, so that it has all of the following properties:

Lexicographic sorting: this makes it possible to use the system in directory lists, or binary trees, while not having to necessarily decode the value from its encoding. In a client-server configuration, this makes it possible for the client to easily compare two IDs without needing to have implemented the encoding scheme.
Compactness: many services we develop are destined to not have billions of entities; for these use-cases, a good ID encoding should be comparatively as compact as using the “dumb” sequential ID.
Case insensitivity: the ID should resist both programmatic errors in converting the ID to upper / lowercase, as well as human errors in identifying characters and trying to communicate an ID manually.
Typo correction: the encoding should be able to fix and understand common human errors, like O instead of 0, or i I l L instead of 1.

With the “compact encoding” of the package cford32, IDs look like this:

      0000001: 1
      000019s: 1337
      00cr7sk: 13377331
      cenkr6w: 13377331420
g00016xvhqnhn: 1337733142069 # switch to "expanded encoding"

These ID’s are lexicographically sortable, and as you can see, they are reasonably compact even for “large” decimal values (the compact encoding is 7 bytes long, and can hold values up to 2^34, ~17 billion, which as decimals would take 11 bytes).

The compact encoding automatically switches over to the “expanded encoding”, which is 13 bytes long, and can represent the full range of uint64 values. This encoding maintains the correct lexicographic ordering with the compact encoding (ie. g00016xvhqnhn > cenkr6w, like their underlying numerical values).

Furthermore, under the hood they use the base32 encoding scheme specified by Douglas Crockford, which is crafted to implement the typo correction mentioned in the desired properties. This means that i I l L 1 all decode to the value of 1, and so o O 0 decode to the value of 0.

I devised this encoding both because of my interest in ID systems, but also because gno.land needed an ID-generation system that was lexicographically sortable, for use in AVL trees (a kind of binary tree). This is why seqid, a simple package to generate sequential IDs, uses this encoding - especially seeing as sequential ID generation on a blockchain is unlikely to exceed 17 billion values for most contracts; but in case it does, there’s a smooth transition.

Implementation

The implementation of the two encodings works from the fact that 13 bytes of a base32 string encode 5*13 = 65 bits of information; in other words, a “simple” encoding of a big-endian 64 bit integer would waste 1 bit of information.

Crockford’s base32 scheme is already lexicographically sorted, so no adjustments need to be made in that sense.

Given this starting information, for encoding the base64 values, the package applies a simple algorithm:

const (
	encTableLower = "0123456789abcdefghjkmnpqrstvwxyz"
	maxCompact    = 1 << 34
	mask          = 31
)
if id < maxCompact {
	return append(b,
		encTableLower[id>>30&mask],
		encTableLower[id>>25&mask],
		// ...
		encTableLower[id&mask],
	)
}
return append(b,
	encTableLower[id>>60&mask|0x10],
	encTableLower[id>>55&mask],
	// ...
	encTableLower[id>>5&mask],
	encTableLower[id&mask],
)

In other words:

The compact encoding always starts with a value within ‘0’ and ‘f’, the first 16 values of the compact encoding; ie. with the high bit of the leftmost value set to 0.
The expanded encoding starts with a value within ‘g’ and ‘z’, ie. with the high bit set to 1 (see the mask to | 0x10), whereas in the “trivial” encoding (ie. b32Encode(encodeBigEndian(n))) it would always be set to 0, as it is the “wasted” bit we mentioned above.

Taking advantage of this bit allows us to have IDs which seamlesssly switch from compact to expanded encoding, while maintaining the lexicographic ordering.

Conversely, the decoding function checks the length and asserts on the value of the first byte; then, in the expanded encoding, ignores the high bit.

switch {
default:
	return 0, CorruptInputError(0)
case len(b) == 7 && b[0] >= '0' && b[0] <= 'f':
	decVals := [7]byte{
		decTable[b[0]],
		// ...
		decTable[b[6]],
	}
	for idx, v := range decVals {
		if v >= 32 {
			return 0, CorruptInputError(idx)
		}
	}

	return 0 +
		uint64(decVals[0])<<30 |
		// ...
		uint64(decVals[5])<<5 |
		uint64(decVals[6]), nil
case len(b) == 13 && b[0] >= 'g' && b[0] <= 'z':
	decVals := [13]byte{
		decTable[b[0]] & 0x0F, // disregard high bit
		decTable[b[1]],
		// ...
		decTable[b[12]],
	}
	for idx, v := range decVals {
		if v >= 32 {
			return 0, CorruptInputError(idx)
		}
	}

	return 0 +
		uint64(decVals[0])<<60 |
		// ...
		uint64(decVals[11])<<5 |
		uint64(decVals[12]), nil
}

The difference between compact and extended was arbitrarily picked at 34 bits, but it could be picked at any other (n*5)-1 number of bits (like 24, 29, 34, 39…). I think 17 billion is a reasonable discriminant between identifiers used by applications at a small / medium scale, and those used at very large scales.

Other uses

Package cford32 forks from the Go’s standard library encoding/base32, and as such matches its API also for encoding and decoding arbitrary byte slices. This can be useful for encoding arbitrary data, or content-addressable IDs, like diffy uses 40 bits of the sha256 sum to encode identifiers for its uploaded files.

There is currently no associated decentralized ID-generation; using sonyflake, values are very large and thus lead to always using the expanded ID:

package main

import (
	"fmt"

	"github.com/sony/sonyflake"
	"github.com/thehowl/cford32"
)

func main() {
	sf, err := sonyflake.New(sonyflake.Settings{
		MachineID: func() (uint16, error) { return 0, nil },
	})
	if err != nil {
		panic(err)
	}
	ni, err := sf.NextID()
	if err != nil {
		panic(err)
	}
	b := cford32.PutCompact(ni)
	fmt.Println(string(b))
}

One generated ID with this set up is the following: gffqp6ej00000. My advice, though, is that most applications are unlikely to reach the scale where decentralized ID generation is truly necessary; and so sequential IDs with the compact encoding are likely to be better for you and users.

In case you do reach that level of scale, switching over is trivial, as the IDs are most definitely larger than the largest of your sequentially generated IDs.

The default base64 URL-safe encoding has the following order: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_. It can be made lexicographiphically sortable by moving the numbers, dash and underscore to the order they are in ASCII: -0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz. ↩︎