I have spent a unhealthy amount of time thinking about ID systems and encodings, both as a user and a developer. In this post I’m going to be talking about cford32, a Go implementation for the base32 encoding scheme specified by Douglas Crockford, and its implementation for encoding uint64 values.
AUTO_INCREMENT
field in a SQL database, or any other
counter. It is obvious to a human eye which item is “older”.0
with O
leads to a different
ID.I wanted to devise a system that takes the best of all of the above and improves them, so that it has all of the following properties:
O
instead of 0
, or i I l L
instead of 1
.With the “compact encoding” of the package cford32
, IDs look like this:
0000001: 1
000019s: 1337
00cr7sk: 13377331
cenkr6w: 13377331420
g00016xvhqnhn: 1337733142069 # switch to "expanded encoding"
These ID’s are lexicographically sortable, and as you can see, they are reasonably compact even for “large” decimal values (the compact encoding is 7 bytes long, and can hold values up to 2^34, ~17 billion, which as decimals would take 11 bytes).
The compact encoding automatically switches over to the “expanded encoding”,
which is 13 bytes long, and can represent the full range of uint64 values. This
encoding maintains the correct lexicographic ordering with the compact encoding
(ie. g00016xvhqnhn
> cenkr6w
, like their underlying numerical values).
Furthermore, under the hood they use the base32 encoding scheme
specified by Douglas Crockford, which
is crafted to implement the typo correction mentioned in the desired properties.
This means that i I l L 1
all decode to the value of 1
, and so o O 0
decode to the value of 0
.
I devised this encoding both because of my interest in ID systems, but also because gno.land needed an ID-generation system that was lexicographically sortable, for use in AVL trees (a kind of binary tree). This is why seqid, a simple package to generate sequential IDs, uses this encoding - especially seeing as sequential ID generation on a blockchain is unlikely to exceed 17 billion values for most contracts; but in case it does, there’s a smooth transition.
The implementation of the two encodings works from the fact that 13 bytes of a base32 string encode 5*13 = 65 bits of information; in other words, a “simple” encoding of a big-endian 64 bit integer would waste 1 bit of information.
Crockford’s base32 scheme is already lexicographically sorted, so no adjustments need to be made in that sense.
Given this starting information, for encoding the base64 values, the package applies a simple algorithm:
const (
encTableLower = "0123456789abcdefghjkmnpqrstvwxyz"
maxCompact = 1 << 34
mask = 31
)
if id < maxCompact {
return append(b,
encTableLower[id>>30&mask],
encTableLower[id>>25&mask],
// ...
encTableLower[id&mask],
)
}
return append(b,
encTableLower[id>>60&mask|0x10],
encTableLower[id>>55&mask],
// ...
encTableLower[id>>5&mask],
encTableLower[id&mask],
)
In other words:
| 0x10
), whereas in the “trivial”
encoding (ie. b32Encode(encodeBigEndian(n))
) it would always be set to 0,
as it is the “wasted” bit we mentioned above.Taking advantage of this bit allows us to have IDs which seamlesssly switch from compact to expanded encoding, while maintaining the lexicographic ordering.
Conversely, the decoding function checks the length and asserts on the value of the first byte; then, in the expanded encoding, ignores the high bit.
switch {
default:
return 0, CorruptInputError(0)
case len(b) == 7 && b[0] >= '0' && b[0] <= 'f':
decVals := [7]byte{
decTable[b[0]],
// ...
decTable[b[6]],
}
for idx, v := range decVals {
if v >= 32 {
return 0, CorruptInputError(idx)
}
}
return 0 +
uint64(decVals[0])<<30 |
// ...
uint64(decVals[5])<<5 |
uint64(decVals[6]), nil
case len(b) == 13 && b[0] >= 'g' && b[0] <= 'z':
decVals := [13]byte{
decTable[b[0]] & 0x0F, // disregard high bit
decTable[b[1]],
// ...
decTable[b[12]],
}
for idx, v := range decVals {
if v >= 32 {
return 0, CorruptInputError(idx)
}
}
return 0 +
uint64(decVals[0])<<60 |
// ...
uint64(decVals[11])<<5 |
uint64(decVals[12]), nil
}
The difference between compact and extended was arbitrarily picked at 34 bits,
but it could be picked at any other (n*5)-1
number of bits
(like 24, 29, 34, 39…). I think 17 billion is a reasonable discriminant
between identifiers used by applications at a small / medium scale, and those
used at very large scales.
Package cford32 forks from the Go’s
standard library encoding/base32
, and as such matches its API also for
encoding and decoding arbitrary byte slices. This can be useful for encoding
arbitrary data, or content-addressable IDs, like
diffy uses 40 bits of the sha256 sum to
encode identifiers for its uploaded files.
There is currently no associated decentralized ID-generation; using sonyflake, values are very large and thus lead to always using the expanded ID:
package main
import (
"fmt"
"github.com/sony/sonyflake"
"github.com/thehowl/cford32"
)
func main() {
sf, err := sonyflake.New(sonyflake.Settings{
MachineID: func() (uint16, error) { return 0, nil },
})
if err != nil {
panic(err)
}
ni, err := sf.NextID()
if err != nil {
panic(err)
}
b := cford32.PutCompact(ni)
fmt.Println(string(b))
}
One generated ID with this set up is the following: gffqp6ej00000
. My advice,
though, is that most applications are unlikely to reach the scale where
decentralized ID generation is truly necessary; and so sequential IDs with the
compact encoding are likely to be better for you and users.
In case you do reach that level of scale, switching over is trivial, as the IDs are most definitely larger than the largest of your sequentially generated IDs.
The default base64 URL-safe encoding has the following order:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
.
It can be made lexicographiphically sortable by moving the numbers, dash and
underscore to the order they are in ASCII:
-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
. ↩︎