How to Hash String into GUID in C#

Eternal problem: how to turn Hello World! into 2ed51d39-b424-8f5d-bc12-4be4f3fbc6fc?

I made a Word Search game, and part of the code is saving the words which were guessed.

To save memory and disk space (there are around 5000 words in 100 categories kept in memory during the game), I used a HashSet with word IDs.

The issue was that it was still marked as guessed whenever I fixed the word spelling, and when I changed its ID, it became unguessed.

I've decided to use GUIDs generated directly from the word (string).

I needed something which generates GUIDs fast and makes unique GUIDs for each word.

I came up with the following extension function, which is also available on GitHub.

public static class StringExtensions
{
    private const ulong KnuthInitValue = 3074457345618258791ul;
    private const ulong KnuthMultiValue = 3074457345618258799ul;

    // Based on https://stackoverflow.com/questions/9545619/a-fast-hash-function-for-string-in-c-sharp
    public static Guid ToGuid(this string str)
    {
        // Double Knuth hash
        ulong hashedValueEven = KnuthInitValue;
        ulong hashedValueOdd = KnuthInitValue;
        for (int i = 0; i < str.Length; i++)
        {
            if (i % 2 == 0)
            {
                hashedValueEven += str[i];
                hashedValueEven *= KnuthMultiValue;
            }
            else
            {
                hashedValueOdd += str[i];
                hashedValueOdd *= KnuthMultiValue;
            }
        }
        return new Guid(new byte[] {
            (byte)(0xff & hashedValueEven),
            (byte)(0xff & hashedValueEven >> 8),
            (byte)(0xff & hashedValueEven >> 16),
            (byte)(0xff & hashedValueEven >> 24),
            (byte)(0xff & hashedValueEven >> 32),
            (byte)(0xff & hashedValueEven >> 40),
            (byte)(0xff & hashedValueEven >> 48),
            (byte)(0xff & hashedValueEven >> 56),
            (byte)(0xff & hashedValueOdd),
            (byte)(0xff & hashedValueOdd >> 8),
            (byte)(0xff & hashedValueOdd >> 16),
            (byte)(0xff & hashedValueOdd >> 24),
            (byte)(0xff & hashedValueOdd >> 32),
            (byte)(0xff & hashedValueOdd >> 40),
            (byte)(0xff & hashedValueOdd >> 48),
            (byte)(0xff & hashedValueOdd >> 56),
        });
    }
}

It uses two primes to shuffle things around. It generates a 64-bit hash for odd letters, separate 64-bit-hash for even letters, and squeezes them into a GUID.

Console.WriteLine("Hello World!".ToGuid());

> 2ed51d39-b424-8f5d-bc12-4be4f3fbc6fc

To test uniqueness, I've generated GUIDs for over 28 million English words and word-like strings from the Google Books project, using Google Books Reducer data.

There were no collisions, which is not surprising, given that there are 3.4e+38 combinations in a 128-bit GUID.

I've also tested how fast it was. On my beefy, Windows-11-unworthy Xeon E3, it generated 12M (millions) hashes per second.

I compared it with MD5 and SHA256. It is 4.4 times faster than MD5 and 6.7 times faster than SHA256.

That is hardly a surprise; it trades security (which I didn't need) for speed, simplicity and a small hash size.