magnify
Home Grokbase Base58: Fast Hashing with GMP
formats

Base58: Fast Hashing with GMP

Base58 is an alternative to Base64 that is growing in popularity for case-sensitive encodings due to several characteristics including multi-protocol-safety and human-readability. It is used by Flickr, Bitcoin, and now Grokbase. The general characteristics of Base58 are that it uses the protocol-safe alpha-numeric alphabet (Base62) and excludes easy to confuse digits. Both Flickr and Grokbase use [0-9a-zA-Z] excluding [0O1l] while Bitcoin uses [0-9A-Za-z] excluding [0O1l].

Grokbase uses MD5 hashes of email addresses for UIDs and when I did the update, there were over 500,000 user profiles in the system so I wanted a fast encoder. I ended up using GMP with tr and released the code as Encode::Base58::GMP for Perl and base58gmp for Ruby.

As for how much faster, I ran a few benchmarks comparing Encode::Base58::GMP to the pure Perl Encode::Base58 implementation by Miyagawa and found that the GMP version was about 700x (70,000%) faster encoding MD5 hex values using the Ubuntu 11.10 system Perl. However, for 32-bit integers, the pure Perl version was faster by about 50%.

Two notes about these numbers:

  1. Ubuntu 11.10 x86_64 system Perl isn’t compiled with 64-bit integer support and Encode::Base58 needs integers so I used bignum to enable hex() support. One possibilty is to compile a Perl with 64-bit int support; however, Ubuntu system Perl is part of my stack and many others also rely on system Perl so I left it as that.
  2. Math::GMPz can accept hex values natively and since the goal was to generate MD5 values, I bypassed the integer conversion used with Encode::Base58. When adding integer conversion, GMP based encoding is still faster but drops by an order of magnitude.

70,000% Faster for MD5 (BigInts)

use strict;
use warnings;

use Benchmark qw/cmpthese/;
use bignum qw/hex/;
use Encode::Base58;
use Encode::Base58::GMP;
use Digest::MD5 qw/md5_hex/;

cmpthese(-5, {
  pp  => sub {
    Encode::Base58::encode_base58(hex(md5_hex(rand())))
  },
  gmp => sub {
    Encode::Base58::GMP::encode_base58('0x'.md5_hex(rand()))
  }
} );

Here is a sample result:

       Rate     pp    gmp
pp    111/s     --  -100%
gmp 79644/s 71484%     --


Rate pp gmp
pp 209/s — -100%
gmp 168636/s 80644% –

33% Slower on 32-bit Ints

Here are some results using 32-bit integers.

cmpthese(-5, {
  pp  => sub {
    Encode::Base58::encode_base58(int(rand 2**32))
  },
  gmp => sub {
    Encode::Base58::GMP::encode_base58(int(rand 2**32))
  }
} );
       Rate  gmp   pp
gmp 66733/s   -- -32%
pp  98803/s  48%   --


Rate gmp pp
gmp 161754/s — -33%
pp 243206/s 50% –

Conclusion

While GMP-based encoding is about 33% slower for 32-bit ints, it is significantly faster for 64-bit ints on Ubuntu 11.10 system Perl and very desirable for creating hashes. Numbers will vary with a Perl compiled with 64-bit int support.

Note: Grokbase User URLs

Currently, the Grokbase profile URLs use unsalted MD5 Base58 hashes of email addresses so you can find anyone’s profile page using the following:

use Encode::Base58::GMP qw/md5_base58/;

my $url = 'http://grokbase.com/user/-/'
        . md5_base58( 'test@example.com' );

The ‘-’ after ‘/user’ will get redirected to the user’s display name as shown in their emails.

Using an unsalted MD5 hash is similar to how Gravatar works and has the added benefit of simply requiring a hex to Base58 conversion to convert a Gravatar UID to a Grokbase UID.

Credits:

  • Marcus Bointon‘s article on Base62 encoding using GMP provided the initial idea of using GMP for arbitrary-base transcoding.
  • Sam Rawlins provided an update for Ruby’s gmp to handle arbitrary-base decoding.
  • Case Van Horsen provided an update for Python’s gmpy to support Base37-62.
 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
1 Comment  comments 
  • sam

    Encode::Base58::GMP needs to support the Bitcoin alphabet.