Quick question: are there special hash functions that are optimized for use in h...

robmccoll · on June 26, 2017

Typically the hash functions that you are familiar with due to cryptographic or data consistency use (SHA family, MD family, etc) do not make for good hash table choices because they produce hashes that are much larger than needed and are slow to compute so that they have better cryptographic properties (extremely low collision, no information leakage about inputs, difficulty of guessing inputs). When picking a hash function for a hash table, you want a function that makes a hash just big enough and with low enough collisions while still being fast and easily dealing with variable length keys. This could he something as simple as byte-wise XOR or addition with some shifting as you iterate the key followed by a mod or even bitwise AND mask to pick an index.

dom0 · on June 26, 2017

However, collision resistance must be still quite good for use in a general-purpose hash table or a HT that is possibly exposed to attackers, otherwise denial-of-service attacks become very easy.

Many "modern" implementations (Python, Ruby, Perl, Rust, Redis, ...) use SipHash with a random seed for this very reason.

gizmo686 · on June 26, 2017

In order to use a standard hash function, you would first need to serialize the object. I am not aware of any programing language that does this.

Instead, the approach taken by (at least) Java and Python, is to define a "hash" function of objects, the classes can overwrite. The standard way of implementing such a function is to combine the hashes of the objects fields.

Python advises doing this by wrapping them in a tuple, and returning hash(self.a, self.b, ...). [1]

Java takes a simmilar approach, but does not make an explicit recomandation on how to implement hashCode(). In my experience, most programmers just XOR the hash of the fields, which (depending on the object) could be very sub-optimal, but is often good enough. Based on the doc, the typical implementation for Object.hashCode is to just take the memory address of the object.

[0] https://docs.python.org/3/reference/datamodel.html#object.__...

[1] Pythons tuple hash function may be found here: https://hg.python.org/cpython/file/dcced3bd22fe/Objects/tupl...

[2] https://docs.oracle.com/javase/7/docs/api/java/lang/Object.h...

jwilk · on June 26, 2017

Python has migrated from hg to git. The current version of tupleobject.c is here:

https://github.com/python/cpython/blob/master/Objects/tupleo...

splicer · on June 26, 2017

Good question! Yes. Here’s an example: http://cyan4973.github.io/xxHash/

obstinate · on June 26, 2017

There's also FarmHash, whose 32-bit version is 2x as fast as xxHash on little endian machines (at least according to this benchmark suite https://github.com/rurban/smhasher).

jwilk · on June 26, 2017

Contemporary versions of Python (≥ 3.4) use SipHash for hashing strings:

https://lwn.net/Articles/574761/

OTOH, the function for hashing integers in extremely simple:

  >>> hash(42)
  42

hueving · on June 26, 2017

I recall reading at some point about Go having the option to use a call to AES on the crypto chip on the motherboard for fast high quality hashes, which is cool.