Portfolio

Reference (참고 사이트)

6.5. Hashing - Problem Solving with Algorithms and Data Structures

In previous sections we were able to make improvements in our search algorithms by taking advantage of information about where items are stored in the collection with respect to one another. For example, by knowing that a list was ordered, we could search in logarithmic time using a binary search.

https://runestone.academy/ns/books/published/pythonds/SortSearch/Hashing.html

Hash table - Wikipedia

In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found.

https://en.wikipedia.org/wiki/Hash_table

Hash Tables

Hash tables (also known as hash maps) are associative arrays, or dictionaries, that allow for fast insertion, lookup and removal regardless of the number of items stored.

https://programming.guide/hash-tables.html

What is a Hash Table?

Data stucture that stores elements in key-value pairs where

Key - unique integer (or property name is Javascript objects) that is used for indexing the values

Value - data that is associated with the key

Bucket - array cell/slot where nodes of key-value pairs are stored

For fast data store and retrieval

In a linear search (ie. linked lists) and binary search (ie. BST), data lookups/search has a time complexity of O(n) and O(log n). For large datasets, these complexities can become significantly high

Hashing allows lookups to occur in constant time (O(1))

Actions and their time complexity

insert ⇒ O(1)

dependant on hash function (but it is assumed that most programming languages uses a hash function that is O(1))

lookup (finding a value based on a key) ⇒ O(1)

same as insert. given property name will be hashed and direct exactly to the address to find values

however, depends on whether there are any collisions

delete ⇒ O(1)

same thing because we use the key

also, hash tables are unordered, we don’t have to shift indexes like we would with arrays

search (check if certain key exists) ⇒ O(1)

same thing as lookup

Hash functions (해시 함수)

A hash function is used for mapping each element of a dataset to indexes in the table. The function generates a value of fixed length for each input that it gets. There are many types of hash functions (ie. md5, SHA-1, SHA-256, etc.) and we can assume that the hash functions in our programming languages use a type that has a time complexity of O(1).

A hash function for a hash table will generate a hash output, which is then used as an index in the hash table and an index space (or an address space) is alloted for that index.

Some key aspects of hash functions:

One-way function (일방적) - it’s practically impossible to get what the input would be by the output (the key value cannot be retrieved from the hash)

Idempotent (멱등성) (The same input will give the same output) - It’s not a random gibberish generator. If you keep inputing the same value, its output will be the same until you change the input value; avoids producing the same hash for different keys.

Hash Collisions (해시 충돌)

A collision occurs when two keys get mapped to the same index/memory space.

Remember, our computer has limited space and when we create a hash table, the computer decides how much space to allocate. It doesn’t decide to allocate all the space to this hash table; just a bit of it.

Within this limited space, the hash function randomly assigns a space and memory to store the data. Although hash functions are optimized to try to distribute data all over the available slots in memory, there is still a possibility that two keys will result in the same index. This situation where a newly inserted key maps to an already occupied slot in the has table is called a hash collision and there are mutliple ways of handling collisions.

Load Factor & Capacity

The load factor is the average number of key-value pairs per bucket.

When the load factor reaches a given limit, rehashing kicks in. Since rehashing increases the number of buckets, it reduces the load factor.

However, there is a tradeoff between time and space costs that we must consider when configuring the load factor limit.

Statistically speaking, the best performance for hash table is to operate between 0.25 and 0.75 load factor.

The capacity is the maximum number of key-value pairs for the given load factor limit and current bucket count. Since rehashsing increases the number of buckets, it increases the capacity.

Here is an example of a hash table, configured with a load factor limit of 4

Methods to resolve collisions (해시 충돌 해결)

Separate Chaining (분리 연결법)

Separate Chaining - formatting linked lists on collisions

Data structures used for Separate Chaining:

Linked List

Self Balancing Binary Search Tree (Red-Black Tree)

Rehashing (Resize)

As the lengths of the linked lists grow, the average lookup time increases. Therefore, to keep linked lists short and lookups fast, the number of buckets must be increased (and nodes redistributed). This process is called rehashing.

Open Addressing (개방 주소법)

Tries to find another open slot to hold the item that caused the collision.

Linear Probing

Systematically visiting each slot one at a time to find the next open slot (adding +1)

Creates a tendency for clustering

If many collisions occur at the same hash slot, the surrounding slots will be filled due to linear probing. Therefore, this will have an impact on items that are being inserted later (inducing collisions into other collisions)


linearProbing(pos) {
	return (pos + 1) % tableSize;
}

newHashValue = linearProbing(oldHashValue);

Quadratic Probing

Similar to linear probing but instead of adding +1, adds a successive value of an arbitrary quadratic polynomial until an open slot is found. This is to lessen the problem of clustering.


quadraticProbing(pos) {
	for(let i = 0; i < tableSize; i++) {
		return (pos + i * i) % tableSize;
	}
}

Open Addressing vs. Separate Chaining

일단 두 방식 모두 Worst Case 에서 O(n)이다

하지만 Open Address 방식은 연속된 공간에 데이터를 저장하기 때문에 Separate Chaining에 비해 캐시 효율이 높다. 따라서 데이터의 개수가 충분히 적다면 Open Address방식이 Separate Chaining 보다 더 성능이 좋다.

한 가지 차이점이 더 존재한다. Separate Chaining 방식에 비해 Open Address 방식은 버킷을 계속해서 사용한다. 따라서 Separate Chaining 방식은 테이블의 확장(rehash)을 보다 늦출 수 있다.

보조 해시 함수(supplement hash function)의 목적은 key의 해시 값을 변형하여 해시 충돌 가능성을 줄이는 것이다. Separate Chaining 방식을 사용할 때 함께 사용되며 보조 해시 함수로 Worst Case 에 가까워지는 경우를 줄일 수 있다.

Arrays vs. Sets and Objects vs. Maps

Arrays

Use when you need ordered list of values (and it might contain duplicates)

Use when you need to manipulate data

Sets

Use when you need to work with unique values

Use when high-performance is really important (operations like searching for an item or deleting an item from a set can be 10x faster than in arrays)

Use to remove duplicates from arrays

Objects

More “traditional” key/value store

Easier to write and access values with . and [] operators

Use when you need to include functions (methods) and also use this keyword to access properties of the same object

Use when working with JSON (can convert to map)

Maps

Better performance

Keys can have any data type

Easy to iterate

Easy to compute size of map

Use when you simply need to map key to values

Use when you need key that are NOT strings

Implementing Hash Tables (해시 테이블 구현)

Simple hash function


_hash(key) {
	let hash = 0;
	for(let i = 0; i < key.length; i++){
		hash = (hash + key.charCodeAt(i) * i) % this.data.length
	}
	return hash;
}

Simple hash table constructor


class HashTable {
  constructor(size){
		this.limit = size; // set inital limit size
    this.data = new Array(this.limit);
		this.count = 0; // count how many key-values are currently stored
  }

  _hash(key) {
    let hash = 0;
    for (let i = 0; i < key.length; i++){
      hash = (hash + key.charCodeAt(i) * i) % this.data.length
    }
    return hash;
  }

	_resize(limit) {
		const oldData = this.data;
		this.limit = limit;
		this.data = new Array(this.limit);
		this.count = 0;

		// iterate through oldData and insert into new
		for(let bucket of oldData) {
			if(bucket) {
				for(let arr of bucket){
					this.insert(bucket[arr][0], bucket[arr][1]);;
				}
			}
		}
	}

  insert(key, value) {
    const index = this._hash(key);
    if(!this.data[index]) {
      this.data[index] = [];
    }
    this.data[index].push([key, value]);
		this.count++;

		// check if new insert exceeds 75% limit 
		if(this.count / this.limit > 0.75) {
			this._resize(this.limit * 2);
		}
  }

  lookup(key) {
    const index = this._hash(key);
    const currentBucket = this.data[index];
    if(!currentBucket) return console.error('Item does not exist');
    for(const i in currentBucket){
      if(currentBucket[i][0] === key) return currentBucket[i][1];
    };
  }

	keys() {
		const keysArr = [];
		for(let bucket of this.data) {
			if(bucket) {
				for(let arr of bucket) {
					keysArr.push(arr[0])
				}
			}
		}
		return keysArr;
	}
}

const myHashTable = new HashTable(100);
myHashTable.set('grapes', 10000);
myHashTable.set('grape', 12345);
myHashTable.set('apples', 500);
myHashTable.set('banana', 2000);
console.log(myHashTable);
console.log(myHashTable.get('grape'));

In keys(), we have to loop over the whole length of the hash table & contains nested loop (time complexity of O(n^2)) to find existing keys because it is unordered and we need to consider for collisions

This is why in JavaScript, for...in method is very slow

Exercise: Find First Recurring Character in Array

Test Case

array	return
[2, 5, 1, 2, 3, 5, 1, 2, 4]	2
[2, 1, 1, 2, 3, 5, 1, 2, 4]	1
[2, 3, 4, 5]	undefined