Module 7 - Hash Map


1. Main Concept of Hash Map

Hashing is the process of converting data of any size (like a string, number, or object) into a fixed-length integer value. This integer value is called a "hash value," "hash code," or simply "hash."

The primary data structure that uses this concept is called a Hash Table.

1.1. Components of a Hash System

A hash-based search system consists of three main components:

The main goal of hashing is to achieve extremely fast data access. In the ideal case, the hash function provides a unique index for every key, allowing insertion, search, and deletion operations to be performed in constant time, or O(1).

Hash Map Analogy:

Imagine a large file cabinet (Hash Table) with 100 drawers numbered 0-99. To store a document for "Budi," you don't search one by one. You use a rule (Hash Function), for example, "Take the first letter ('B' -> 2), store it in drawer #2." When you need to find "Budi," you compute the same rule and immediately jump to drawer #2.

1.2. Good Hash Function Properties

The Hash Function is the most critical part of the Hash Table. A good function must have the following properties:

If the hash function doesn't distribute data well (e.g., it hashes all names starting with 'A', 'B', and 'C' to the same slot), it will cause collisions, and the performance will degrade significantly.

1.3. Common Hash Function Methods

There are several methods to design a hash function, each with different properties. The goal is always to create a "well-distributed" hash that avoids clustering keys in the same buckets.

1.3.1. Division Method

This is the most common and simplest method. It takes the key (which must be a numeric value or convertible to one) and finds the remainder after dividing by the table size.

1.3.2. Multiplication Method

This method is a popular choice because it is far less sensitive to the tableSize (which can be any value, often a power of 2 for efficiency).

1.3.3. Mid-Square Method

This method is particularly effective at breaking up patterns in keys that are sequential or have similar prefixes/suffixes.

1.3.4. Folding Method

This method is excellent for hashing keys that are very long, such as account numbers, ISBNs, or long strings, where other methods might lead to overflow or only use a portion of the key.

2. Collision Handling

A Collision occurs when two or more different keys produce the same hash value (index). For example, "Budi" and "Dina" are both hashed to row #5. We can't overwrite Budi's data. We must have a strategy to handle this.

2.1. Separate Chaining

This is the most common strategy to avoid collisions.

2.1.1. Manual Implementation: Insertion

The insert function handles adding a new key-value pair. Its logic is:

  1. Hash the key to find the correct bucket index.
  2. Get the linked list (the chain) at that index.
  3. Traverse the list:
    • If the key is already in the list, update its value and stop.
    • If the end of the list is reached and the key was not found, add the new key-value pair to the end of the list.
void insert(std::string key, int value) {
    // Hash the key to get the bucket index
    int index = hashFunction(key);

    // Get the chain (linked list) at that index
    std::list<KeyValuePair>& chain = table[index];

    // Traverse the chain
    for (auto& pair : chain) {
        if (pair.key == key) {
            // Key found: update the value and return
            pair.value = value;
            return;
        }
    }

    // Key not found: add new pair to the end of the list
    chain.push_back(KeyValuePair(key, value));
}

2.1.2. Manual Implementation: Searching

The search function finds the value associated with a key. Its logic is:

  1. Hash the key to find the bucket index.
  2. Get the linked list at that index.
  3. Traverse the list:
    • If the key is found, return its value.
    • If the end of the list is reached and the key was not found, return a sentinel value (like -1) or throw an exception.
int search(std::string key) {
    // Hash the key to get the bucket index
    int index = hashFunction(key);

    // Get the chain at that index
    std::list<KeyValuePair>& chain = table[index];

    // Traverse the chain
    for (auto& pair : chain) {
        if (pair.key == key) {
            // Key found: return the value
            return pair.value;
        }
    }

    // Key not found
    return -1; 
}

2.1.3. Manual Implementation: Deletion

The remove function deletes a key-value pair. Its logic is:

  1. Hash the key to find the bucket index.
  2. Get the linked list at that index.
  3. Traverse the list using an iterator. We must use an iterator because we need to know the position of the element to delete it.
  4. If the key is found:
    • Call the list's erase() method, passing the iterator.
    • Stop and return.
void remove(std::string key) {
    // Hash the key to get the bucket index
    int index = hashFunction(key);

    // Get a reference to the chain
    auto& chain = table[index];
    
    // Traverse the chain using an iterator
    for (auto it = chain.begin(); it != chain.end(); ++it) {
        // 'it' is an iterator (position pointer)
        // 'it->key' accesses the key of the element at that position
        if (it->key == key) {
            // Key found: erase the element at this position
            chain.erase(it); 
            return;
        }
    }
    // If loop finishes, key was not found. Do nothing.
}

2.2. Open Addressing (Probing)

This is an alternative strategy where all data is stored within the table itself. No linked lists are used.

3. Load Factor and Rehashing

3.1. Load Factor (λ)

The Load Factor (λ) is a measure of how full the hash table is. It is a critical metric for performance.

3.2. Rehashing

To maintain good performance (close to O(1)), we must keep the load factor low. When the load factor exceeds a certain threshold (e.g., λ > 0.75), the table is "rehashed."

Rehashing is the process of creating a new, larger hash table and re-inserting all existing elements into it. Here is the process of rehashing:

Rehashing is an O(n) operation. It is slow and computationally expensive, but it happens infrequently. Its cost is "amortized" over many O(1) insertions, so the average insertion time remains O(1).

3.3 Rehashing Implementation

// We must track item count 'n' and table size 'm'

int n = 0; // Number of items
int m = 10; // Initial table size
std::vector<std::list<KeyValuePair>> table;
const float MAX_LOAD_FACTOR = 0.75;

void rehash() {
    // Get old table and size
    std::vector<std::list<KeyValuePair>> oldTable = table;
    int oldSize = m;

    // Allocate new, larger table
    m = m * 2; 
    table.clear();
    table.resize(m);
    n = 0; // Reset item count

    // Re-insert all elements from old table
    for (int i = 0; i < oldSize; ++i) {
        for (auto& pair : oldTable[i]) {
            // Re-hash and insert into new table
            insert(pair.key, pair.value); 
        }
    }
}

void insert(std::string key, int value) {
    // Check load factor before inserting
    if ((float)n / m > MAX_LOAD_FACTOR) {
        rehash();
    }

    // Hash the key (uses the new 'm' if rehashed)
    int index = hashFunction(key); 

    // Traverse chain to find or update
    for (auto& pair : table[index]) {
        if (pair.key == key) {
            pair.value = value;
            return;
        }
    }

    // Not found, add new pair and increment item count
    table[index].push_back(KeyValuePair(key, value));
    n++;
}

4. Example: Full Manual Implementation

This chapter combines all the concepts from Chapters 2 and 3 into a single, complete MyHashMap class. This class handles insertion, searching, deletion, and automatic rehashing.

4.1. Manual Implementation Using Separate Chaining

#include <iostream>
#include <string>
#include <vector>
#include <list>

// Data Node: Struct to store Key-Value pairs
struct KeyValuePair {
    std::string key;
    int value;
    KeyValuePair(std::string k, int v) : key(k), value(v) {}
};

class MyHashMap {
private:
    int n; // Number of items
    int m; // Number of buckets (table size)
    std::vector<std::list<KeyValuePair>> table;
    const float MAX_LOAD_FACTOR = 0.75;

    // Hash Function (Simple Division Method)
    int hashFunction(std::string key) {
        int hash = 0;
        // A simple hash: sum ASCII values
        for (char c : key) {
            hash = (hash + (int)c);
        }
        return hash % m; // Use current table size 'm'
    }

    // Rehashing Function
    void rehash() {  
        // Get old table and size
        std::vector<std::list<KeyValuePair>> oldTable = table;
        int oldSize = m;

        // Allocate new, larger table
        m = m * 2;
        table.clear();
        table.resize(m);
        n = 0; // Reset item count

        std::cout << ". New size: " << m << std::endl;

        // Re-insert all elements from old table
        for (int i = 0; i < oldSize; ++i) {
            for (auto& pair : oldTable[i]) {
                // Re-hash and insert into new table
                insert(pair.key, pair.value);
            }
        }
    }

public:
    // Constructor: Initialize table
    MyHashMap(int size = 10) { // Default size of 10
        n = 0;
        m = size;
        table.resize(m);
    }

    // Insert Function (with rehashing)
    void insert(std::string key, int value) {
        // Check load factor before inserting
        if ((float)(n + 1) / m > MAX_LOAD_FACTOR) {
            rehash();
        }

        // Hash the key (uses the new 'm' if rehashed)
        int index = hashFunction(key);

        // Traverse chain to find or update
        for (auto& pair : table[index]) {
            if (pair.key == key) {
                pair.value = value;
                return;
            }
        }

        // Not found, add new pair and increment item count
        table[index].push_back(KeyValuePair(key, value));
        n++;
    }

    // Search Function
    int search(std::string key) {
        int index = hashFunction(key);
        for (auto& pair : table[index]) {
            if (pair.key == key) {
                return pair.value;
            }
        }
        return -1; // Not found
    }

    // Remove Function
    void remove(std::string key) {
        int index = hashFunction(key);
        auto& chain = table[index];
        
        for (auto it = chain.begin(); it != chain.end(); ++it) {
            if (it->key == key) {
                chain.erase(it); // Remove element at position 'it'
                n--; // Decrement item count
                return;
            }
        }
    }
};

4.2. Manual Implementation Using Open Addressing

#include <iostream>
#include <string>
#include <vector>

// Define the state for each slot
enum class SlotState {
    EMPTY,    // The slot has never been used
    OCCUPIED, // The slot contains an active key-value pair
    DELETED   // The slot used to have data, but it was removed
};

struct HashSlot {
    std::string key;
    int value;
    SlotState state;

    // Constructor to initialize new slots as EMPTY
    HashSlot() : key(""), value(0), state(SlotState::EMPTY) {}
};

class MyHashMap {
private:
    int n; // Number of items
    int m; // Number of buckets (table size)
    std::vector<HashSlot> table; 
    const float MAX_LOAD_FACTOR = 0.75; 

    // Hash Function (Simple Division Method)
    int hashFunction(std::string key) {
        int hash = 0;
        for (char c : key) {
            hash = (hash + (int)c);
        }
        // Ensure hash is positive before modulo
        return (hash & 0x7FFFFFFF) % m;
    }

    // Rehashing Function
    void rehash() {
        std::cout << "[Info] Rehashing triggered. Old size: " << m;

        // Save the old table
        std::vector<HashSlot> oldTable = table;
        int oldSize = m;

        // Create a new, larger table.
        m = m * 2;
        table.clear(); 
        table.resize(m); 
        n = 0; // Reset item count

        std::cout << ". New size: " << m << std::endl;

        // 6. Re-insert all active elements from the old table
        for (int i = 0; i < oldSize; ++i) {
            // Only re-insert items that are currently occupied
            if (oldTable[i].state == SlotState::OCCUPIED) {
                insert(oldTable[i].key, oldTable[i].value);
            }
        }
    }

public:
    // Constructor: Initialize table
    MyHashMap(int size = 10) { // Default size of 10
        n = 0;
        m = size;
        table.resize(m); 
    }

    // Insert Function
    void insert(std::string key, int value) {
        // Check load factor before inserting
        if ((float)(n + 1) / m > MAX_LOAD_FACTOR) {
            rehash();
        }

        // Get the initial hash index
        int index = hashFunction(key);
        int firstDeletedSlot = -1; // To store the index of the first DELETED slot

        // Start probing (loop max 'm' times)
        for (int i = 0; i < m; ++i) {
            int probedIndex = (index + i) % m;
            auto& slot = table[probedIndex]; // Get a reference to the slot

            // Case 1: Slot is OCCUPIED and the key matches
            if (slot.state == SlotState::OCCUPIED && slot.key == key) {
                // Key already exists, just update the value
                slot.value = value;
                return;
            }

            // Case 2: Slot is EMPTY
            if (slot.state == SlotState::EMPTY) {
                // We found an empty slot, so the key is not in the table.
                // We should insert it here, OR in the first DELETED slot we found.
                int insertIndex = (firstDeletedSlot != -1) ? firstDeletedSlot : probedIndex;
                
                table[insertIndex].key = key;
                table[insertIndex].value = value;
                table[insertIndex].state = SlotState::OCCUPIED;
                n++; // Increment item count
                return;
            }

            // Case 3: Slot is DELETED
            if (slot.state == SlotState::DELETED) {
                // We can insert here, but we must keep searching
                // to see if the key already exists further down the probe chain.
                if (firstDeletedSlot == -1) {
                    firstDeletedSlot = probedIndex;
                }
            }
        }
    }

    // Search Function
    int search(std::string key) {
        // Get the initial hash index
        int index = hashFunction(key);

        // Start probing
        for (int i = 0; i < m; ++i) {
            int probedIndex = (index + i) % m;
            auto& slot = table[probedIndex];

            // Case 1: Slot is OCCUPIED and key matches
            if (slot.state == SlotState::OCCUPIED && slot.key == key) {
                return slot.value; // Item found
            }

            // Case 2: Slot is EMPTY
            if (slot.state == SlotState::EMPTY) {
                // If we hit an EMPTY slot, the key cannot be
                // further down the probe chain.
                return -1; // Not found
            }

            // Case 3: Slot is DELETED
            // We must continue probing, as the key we want
            // might have collided and be after this deleted slot.
            if (slot.state == SlotState::DELETED) {
                continue;
            }
        }

        // We looped through the entire table and didn't find it
        return -1;
    }

    // Remove Function
    void remove(std::string key) {
        int index = hashFunction(key);

        // Start probing to find the key
        for (int i = 0; i < m; ++i) {
            int probedIndex = (index + i) % m;
            auto& slot = table[probedIndex];

            // Case 1: Slot is OCCUPIED and key matches
            if (slot.state == SlotState::OCCUPIED && slot.key == key) {
                // Found it. Set state to DELETED.
                slot.state = SlotState::DELETED;
                n--; // Decrement item count
                return;
            }

            // Case 2: Slot is EMPTY
            if (slot.state == SlotState::EMPTY) {
                // Key is not in the table
                return;
            }

            // Case 3: Slot is DELETED
            if (slot.state == SlotState::DELETED) {
                // Keep probing
                continue;
            }
        }
    }
};

5. Hashing Implementation with C++ STL

In C++, instead of creating a Hash Table manually, you could use Standard Template Library (STL). The STL implementation automatically handles hash functions, collisions, and rehashing when the load factor gets too high.

5.1. std::unordered_map (Key-Value)

std::unordered_map is a Hash Map implementation for Key-Value pairs. The purpose of this template is to map Keys to Values. It acts like a dictionary, where you look up a word (key) to get its definition (value).

#include <iostream>
#include <string>
#include <unordered_map>

int main() {
    // Key: string (name), Value: int (grade)
    std::unordered_map<std::string, int> studentGrades;

    studentGrades["Budi"] = 90;
    studentGrades["Ani"] = 85;

    // Access value using key
    std::cout << "Budi's Grade: " << studentGrades["Budi"] << std::endl;
}

5.2. std::unordered_set (Unique Keys)

std::unordered_set is a Hash Set implementation that only stores unique keys. This template is used to check whether a key is on the list or not, like a guest book.

#include <iostream>
#include <string>
#include <unordered_set>

int main() {
    std::unordered_set<std::string> attendanceList;

    attendanceList.insert("Budi");
    attendanceList.insert("Ani");
    attendanceList.insert("Budi"); // Ignored, because "Budi" already exists

    // Check existence
    if (attendanceList.count("Budi") > 0) {
        std::cout << "Budi is present." << std::endl;
    }
}

5.3. Comparison: map vs. set

Aspect map set
Feature std::unordered_map<Key, Value> std::unordered_map<Key>
What is Stored Key-Value pairs Key only
Element Type std::pair<const Key, Value> Key
Main Purpose Mapping Keys to Values Storing unique Keys
Data Access map[key] (get/set value) set.count(key) (check existence)

6. Custom Struct Keys

A notable limitation of the C++ STL's std::unordered_map and std::unordered_set is their inability to natively support user-defined types (such as a struct or class) as keys. An attempt to instantiate a map with a custom struct, as shown below, will result in a compile-time error.

// This declaration will fail to compile
std::unordered_map<Mahasiswa, int> nilaiUjian;

This failure occurs because the STL hash containers have two fundamental requirements for their key types, which C++ cannot automatically generate for a custom struct:

The compiler cannot, for instance, assume whether two Mahasiswa objects are equal if only their nim matches, or if both nim and nama must match. To resolve this, the programmer must explicitly "teach" C++ how to perform these two operations for the custom type.

6.1. Defining the Equality Operation

The first requirement is met by overloading the equality operator (operator==) for the custom struct. This is the standard C++ mechanism for defining identity and is used directly by the unordered_map to compare keys within a bucket.

struct Mahasiswa {
    std::string nama;
    int nim;
};

// This function defines what makes two Mahasiswa objects
// "equal" in the eyes of the hash map.
bool operator==(const Mahasiswa& a, const Mahasiswa& b) {
    return a.nama == b.nama && a.nim == b.nim;
}

6.2. Providing a Hashing Function via std::hash Specialization

The second requirement is met by providing a custom hash function. The standard mechanism for this is to create a template specialization of the std::hash struct, which resides in the std namespace.

This specialization "injects" our custom hashing logic into the STL, allowing unordered_map to find and use it automatically whenever it needs to hash a Mahasiswa object. This is achieved by defining operator() within the specialized struct.

#include <functional> // Required for std::hash

// We "inject" our hashing logic into the std namespace
namespace std {

// We declare a specialization of the 'hash' template for our type
template<>
struct hash<Mahasiswa> {
    
    // operator() is what the unordered_map will call.
    // It must be 'const' and return a 'size_t'.
    size_t operator()(const Mahasiswa& m) const {
        
        // Get the hash for each member individually.
        size_t hashNama = std::hash<std::string>{}(m.nama);
        size_t hashNim = std::hash<int>{}(m.nim);

        // Combine the individual hashes into one final hash.
        return hashNama ^ (hashNim << 1);
    }
};

} // End of namespace std

6.3. Usage Example

With both the equality operator and the std::hash specialization provided, the std::unordered_map now has all the components necessary to manage the Mahasiswa key. The following code will now compile and function as expected

int main() {
    std::unordered_map<Mahasiswa, int> nilaiUjian;
    
    Mahasiswa budi = {"Budi", 12345};
    nilaiUjian[budi] = 90;
    
    // The map can now hash "budi" to find a bucket
    // and use operator== to check for collisions.
}