How to generate a Base36 quick code sequence using the SHA2 hash
On a 'slowish' 2GHz machine it can generate 20,000 quickcodes per minute. The spread of the resulting quickcodes is near-perfectly even, which means that substrings of the quickcode can be used to construct a hashed directory tree for holding, for example, the huge amount of product image files associated with a product catalog. Using a hashed directory tree is a quick and efficient method to host millions of separate files for quick, random access, as most file systems only perform optimally with less than 1,000 contained in a directory
This deterministic techique uses a hash of the sequential Integer Id of an item to generate a typical product quickcode for it, as found in many shopping catalogues. Example of how a quickcode is generated from its primary key integer Id:
- 1 => 8M9LFLN2
- 2 => HZ40H3K0
- 3 => 02LUJYQ2
You can download the Base36 Quickcode Generator test script, which demonstrates an implementation in Perl and MySQL.
What are Quickcodes?
Base36 Quickcodes are used as a human-friendly way to indentify items in very large catalogues as opposed to having to reel off endlessly-long and ambigous integer identifiers - consider why car registrations are not all integers. You nearly always find them alongside items in printed product catalogs and online shops. Airline bookings are also identified with such quickcodes, called PNR's (Passenger Name Records). One well known and altogether different quickcode from that shown here is the one from IKEA, which actually attempts to make spoken words out of their product quickcodes. This results in constant hilarity and occasionally, outrageous cultural insensitivity.
How do I use Quickcodes?
- Use quickcodes for public references to your products, instead of integers. This way, no-one can track your inventory or product history.
- Use a sequential Integer key for internal database references of your product as you normally would, for best performance
- You sequence should be unique to your business. You can do this my applying your own 'salt' to the hash.
- Decide how many Interger Keys your application may require over its lifetime.
- Run the test script for the number of Integer Keys and check that you don't get any duplicate collisions
- Store your unique sequence in a table
Why are Quickcodes useful?
- It is a more human-friendly way of reading and writing a unique product identifier compared to say, an integer identifier.
- They hide your product tables primary key from the outside word, like web apps. This way no-one can guess the size of, say, your product inventory or when last you added new product to your inventory. An adversary can easily iterate along the integer sequence, starting at 1, and slurp your entire product inventory off your website.
- It is near impossible to guess an 8-character base36 quickcode. For example; in a product database of 1,000,000 items, an adversary has a 1 in 2,800,000 chance of guessing a correct product. The chances of winning the UK Lottery are only 5 times worse.
- It is impossible to determine the next, or any other values, in a sequence when one value is known. Not so with sequential integers - you simply add 1 to the key and test.
- Gaps in the series of Integer keys from which quickcodes are formed are not detectable.
- Because of the even (random) distribution of quickcodes (using this technique at least) one can use it as the basis of a hashed directory tree to hold millions of product images.
Durability of this sequence
Durability of such a hash sequence can be quantified in terms of the length of the sequence before a previously-used value occurs and the evenness of the distribution of all the characters in any given column over the whole sequence. This sequence is based on the SHA hash. SHA1 and SHA2 work equally well, but avoid MD5, as it fails this important criterion; the Strict Avalanche Criterion (SAC), i.e. the inversion of any one bit in the input should cause half or more of the bits in the output to change. The result is that the output varies dramatically from one integer input to the next incremental integer input. Here is a statistical view of the evenness of the distribution after 4,500,000 quickcodes have been generated, looking at the first term only of all quickcodes:
|# of quickcodes||stddev||variance||avg||min||max||StdDev/Avg %
The evenness of the distribution can be seen in the relatively low standard deviation from the average.
How does it work?
To make sure that you have a unique sequence that no-one else can replicate, you need to salt the sequence with a secret term. The function below takes the integer value and with your salt value generates an SHA1/SHA2 hash in hex. We keep 11 of the 40 hex digits and convert it to a base-36 value. Sometimes we may end with a base36 result that is longer than 8 characters long, so we always keep only the right-most 8 characters. The important thing here is that the process is consistent. The code below can be
my $hash = Digest::SHA2->new;
# Generate hash in hex form:
my $x = Math::BigInt->new('0x'.$hexhash);
my $quickcode = encode_base36($x);
# Rightmost 8 chars part of base36 hash
LOGDIE("$quickcode too short") if length($quickcode)!=QUICKCODE_SIZE;
But collisions will eventually occur!
Why? Because and 8-digit base36 value is 42 bits wide and the SHA2 hash is 256/384/512 bits wide . We only use the 42 bits of the SHA2 hash and discard the remainder of the hash - it is irrelevant which 42 bits we use, as long we are consistent through the sequence.
Experiments with the downloaded Base36 Quickcode Generator test script create unique sequences of between 1,000,000 and 4,000,000 quickcodes for various salts before a new quickcode collides with a previously-created quickcode. This may or may not be a sufficiently long sequence for you. If you need a longer sequence, you can only find one through trial and error with different salts.