解析unordeded_map和unordeded_set的底層實現

2019-11-08 02:06:35

字體：大中小

來源：轉載

供稿：網友

//本代碼全部編輯于win10底下的vs2013編譯器 在上一篇文章里邊，我講解了map和set的簡單用法，以及簡單的剖析了map和set的底層實現。那么，現在就簡單的對unordered_map和unordered_set做以簡單的剖析。其實，在STL函數庫里邊有關map和set的總共有8個函數。 map/set/multimap/multiset/unordered_map/unordered_set/unordered_multiset/unordered_multimap(8個) 簡單的做以區分： map:K/V結構。 set:K結構。 multi:可以冗余，如沒有這個關鍵字，則是防冗余版本。 unordered:底層由哈希表（哈希桶算法）來實現，如無該關鍵字，則底層是由紅黑樹來實現。現在想著重講一下unordered_map和unordered_set，由于這兩個用法類似（主要區別是結構不一樣）。主要講解unordered_map。首先，unordered_map的底層是一個防冗余的哈希表（采用除留余數法）。重所周知，哈希表的主要問題在于處理哈希沖突。處理哈希沖突的方法很多：開鏈法（所謂的哈希桶）（主要講解哈希桶，其他的方法就不在這里列舉） 那么什么是開鏈法？其底層結構又是什么樣子呢？

	template<typename K, typename V>  //每個節點的結構	struct HashNode	{		pair<K, V> _kv;		HashNode<K,V>* _next;		HashNode(pair<K,V> p)			:_next(NULL)			, _kv(p)		{}	};	template<typename K, typename V, class HashFunc = _HashFunc<K>> //哈希表的結構，第三個參數是仿函數，為了實現可以存儲string	class HashTable	{	PRotected:		vector<Node*> _table;		size_t _size;	};      這就是它的底層結構！          1.用一個vector來作為一個指針數組來存儲節點的指針，_size來保存當前哈希表中的有效元素個數。     2.由于是K/V結構，所以選擇一個pair的結構來存儲K/V。     3.vector中的每一個元素都指向一個鏈表，所有節點中需要一個next域的指針來指向下一個節點（采用單鏈表表結構）。     4.采用模板來實現哈希表可以存儲任意數據類型的目的。     5.使用了仿函數技術（此處先暫不講解原因）。     vector中不存儲節點而存儲節點指針的原因：首先考慮到指針占的內存比較小，不管是什么類型的數據，其指針都只占4個字節（32位下）。其次，如果直接存節點，那么假如，該列沒有任何一個元素，那么其將存儲一個無用的節點，那么還需要來區分到底是有效節點還是無效節點。比較復雜，所以存儲節點的指針，如果該列無數據，則指針為空。     那么，哈希表最重要的就是增刪查改了。     在這里，采用除留余數法進行數據的插入。（當前的key值%vector的大小，取其余數作為vector中的下標，從而定位該數據存儲的位置）。          1.Insert          在這之前，我將引入一個叫做負載因子（載荷因子）的概念。     其作用就是：由于哈希表的時間復雜度為O（1），如果數據太多，而在vector的空間不變的情況下，勢必每個指針下邊掛的數據會越來越多，那么在查找的時候就勢必要遍歷這個鏈表，從而效率大打折扣，那么一般規定載荷因子的值為1（哈希表的有效元素個數最多達到vector的大小），一旦超出，則將進行擴容操作。     那么，我的做法就是：每次插入前先進行容量檢測（單獨實現一個函數完成），然后通過除留余數法定位（單獨實現一個函數完成，這里有問題，如果是int則沒有問題，后邊將會解決這個問題），其次再被定位的這一列遍歷整個鏈表（防冗余），如不冗余則進行頭插（尾插也可以）。    	pair<Node*, bool> Insert(const pair<K,V>& p)   //防冗余	{		if (_size == _table.size())//控制負載因子為1,當負載因子超過1時,進行擴容		{			CheckCapacity();		}		size_t index = GetIndex(p.first);  //定位下標，除留余數法		Node* cur = _table[index];		while (cur)		{			if (cur->_kv.first == p.first) //防冗余			{				return make_pair(cur, false);			}						cur = cur->_next;		}		Node* tmp = new Node(p);		tmp->_next = _table[index]; //頭插		_table[index] = tmp;		++_size;		return make_pair(_table[index], true);	}      這里先不要管其返回值為什么這么設計，等會會一一解釋。      那么，下邊是擴容的函數和除留余數法定位函數。     	const size_t GetIndex(const K& key) const  //除留余數法定位（使用了仿函數）	{		HashFunc hf;		size_t hash = hf(key);		return hash % _table.size();	}	void CheckCapacity()//擴容	{		HashTable<K, V,HashFunc> ht(_table.size());		for (size_t i = 0; i < _table.size(); ++i)		{			Node* cur = _table[i];			while (cur)			{				ht.Insert(cur->_kv);				cur = cur->_next;			}		}		_table.swap(ht._table);	}      首先，為什么使用除留余數法求下標的時候，要采用仿函數呢？      原因就是：如果插入的數據為string類型，那么string類型的數據是不能求模取余的。因此需要字符串哈希算法來完成轉化。      因此為了實現string的數據存儲，就必須使用仿函數來完成模板中不同類型對象的推演。并且利用模板參數，以及特化來實現。      而模板的推演會匹配與自己類型最接近那份代碼，然后生成相應代碼。     	template<typename K>	struct _HashFunc	{		size_t Operator()(const K& key)		{			return key;		}	};	template<>   //模板的特化	struct _HashFunc<string>	{		size_t operator()(const string& key)		{			return BKDRHash(key.c_str());		}		size_t BKDRHash(const char* str)  //字符串哈希算法		{			register size_t hash = 0;			while (*str)			{				hash = hash * 131 + *str;				str++;			}			return hash;		}	};       為了講Insert然后講到了擴容和除留余數法定位，講到除留余數又講到了定位下標為了保證stirng類型的數據成功存儲，又講到了之前哈希表結構中第三個的模板參數以及仿函數和模板的特化。       那么，擴容是怎么實現的呢？       大家看這個代碼，是不是覺得疑問百出？為什么這么寫就可以實現擴容，而且將數據一一復制過去，并且還沒有釋放原來的空間，這到底是為什么呢？         其實，如果你懂operator=()這個函數的現代寫法的話，就可以很容易看懂上邊的代碼。       首先，看一段string類的operator=()重載代碼。       	String& operator=(const String& s)	{		if (this != &s)		{             	       String str(s._ptr);		       swap(_ptr,str._ptr);		}        	return *this;	}         看這個代碼是不是和上邊擴容的代碼很像呢？都是創建了一個臨時變量，然后對該臨時變量進行賦值，然后交換這個臨時變量和當前對象的指針。         交換了之后，原來this的空間已經由這個臨時變量來管理，出了作用域臨時變量會自己析構（這就是不用釋放空間的原因）。         還要注意的一點就是：總結顯示，如果哈希表的大小為一個質數，則會講題哈希沖突的概率。         質數表：         	unsigned long GetNextSize(unsigned long size) //使用素數作為哈希表的大小可以減少哈希沖突	{		const int _PrimeSize = 28;		static const unsigned long _PrimeList[_PrimeSize] =		{			53ul, 97ul, 193ul, 389ul, 769ul,			1543ul, 3079ul, 6151ul, 12289ul, 24593ul,			49157ul, 98317ul, 196613ul, 393241ul, 786433ul,			1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,			50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,			1610612741ul, 3221225473ul, 4294967291ul		};		for (int i = 0; i < _PrimeSize; ++i)		{			if (_PrimeList[i] > size)			{				return _PrimeList[i];			}		}		return 0;	}          2.Find         查找其實很簡單啦！只需要對被查找數進行定位然后遍歷該位置的單鏈表就可以啦。         	Node* Find(const K& key)	{		size_t index = GetIndex(key);		Node* cur = _table[index];		while (cur)		{			if (cur->_kv.first == key)			{				return cur;			}						cur = cur->_next;		}		return NULL;	}          3.Erase          對于刪除操作，其實坑挺多的，這里要注意啦。          大家是不是想，既然我寫了Find函數，那么我就直接用Find來查找這個被刪的值，如果找到了Find就會返回一個該節點的指針，然后在刪除不就ok了。          其實這么想就錯了。這里是不能這么做的。          因為，只能拿到被刪節點的指針，而該節點位于單鏈表中，必須拿到其前一個節點的指針才可以刪除當前節點。          那么，有的人又想到假刪除（鏈表面試題（刪除一個單鏈表中的非尾節點，要求只能遍歷一次）），其方法就是吧當前節點的下一個節點的值拷貝給當前節點，然后刪除當前節點的下一個節點。看似能完成，那么，如果要刪除的是尾節點，又該怎么辦呢？          有的人又提出，如果是尾節點就和第一個節點進行交換。那么這樣是不是很麻煩呢！而我們起初為了調用Find還不是為了簡單才調用它的嗎？          那么，還不如我們按步就搬。	bool Erase(const K& key)	{		size_t index = GetIndex(key);		Node* cur = _table[index];		Node* prev = NULL;		while (cur)		{			if (cur->_kv.first == key)			{				if (prev == NULL)				{					_table[index] = cur->_next;				}				else				{					prev->_next = cur->_next;				}				delete cur;				--_size;				return true;			}			prev = cur;			cur = cur->_next;		}		return false;	}         這樣，也不會出現很多問題。                  以上就是哈希表的簡單操作，當然遠遠不止這些。最重要的還屬其迭代器的實現了（比較精妙）。         這里先給出其代碼，以后再做講解。                     而unordered_map的借口則是分別調用哈希表的所有接口。         完整代碼：        #pragma once  //unordered_map的底層#include <vector>#include <assert.h>template<typename K, typename V>struct HashNode{	pair<K, V> _kv; 	HashNode<K,V>* _next;	HashNode(pair<K,V> p)		:_next(NULL)		, _kv(p)	{}};template<class K, class V, class HashFunc> //聲明class HashTable;//stl庫里邊是沒有使用Ref和Ptr來達到代碼復用的目的。而是將const和非const迭代器分開來寫。//具體原因在于新加的那個哈希表指針，我自己又將其該為const的，暫時沒有發現問題template<typename K, typename V, typename HashFunc, typename Ref, typename Ptr>  struct HashTableIterator{	typedef HashTableIterator<K, V, HashFunc, Ref, Ptr> Self;	typedef HashTable<K, V, HashFunc> HashTable;	typedef HashNode<K, V> Node;	HashTableIterator(){}	HashTableIterator(Node* ptr, const HashTable* table)		: _ptr(ptr)		, _hashtable(table)	{}	Ref operator*()	{		return _ptr->_kv;	}	Ptr operator->()	{		return &(operator*());	}	Self& operator++()	{		_ptr = _Next(_ptr);		return *this;	}	Self operator++(int)	{		Self tmp = *this;		++*this;		return tmp;	}	bool operator==(const Self& s)	{		return _ptr == s._ptr;	}	bool operator!=(const Self& s)	{		return _ptr != s._ptr;	}	Node* _ptr;	const HashTable* _hashtable;protected:/*	Node* _Next(Node* cur)	{		assert(cur);				Node* old = cur;		cur = cur->_next;				if (!cur)		{			size_t index = _hashtable->GetIndex(old->_kv.first);			while (!cur && ++index < _hashtable->_table.size())   //需要友元				cur = _hashtable->_table[index];		}		return cur;	}*/		Node* _Next(Node* cur)	{		assert(cur);		Node* old = cur;		cur = cur->_next;		if (!cur)		{			size_t index = _hashtable->GetIndex(old->_kv.first);			while (!cur && ++index < _hashtable->GetTable()->size())				cur = _hashtable->GetTable()->operator[](index);		}		return cur;	}};template<typename K>struct _HashFunc{	size_t operator()(const K& key)	{		return key;	}};template<>struct _HashFunc<string>{	size_t operator()(const string& key)	{		return BKDRHash(key.c_str());	}	size_t BKDRHash(const char* str)  //字符串哈希算法	{		register size_t hash = 0;		while (*str)		{			hash = hash * 131 + *str;			str++;		}		return hash;	}};template<typename K, typename V, class HashFunc = _HashFunc<K>>class HashTable{	typedef HashNode<K, V> Node;public:	typedef HashTableIterator<K, V, HashFunc, pair<K, V>&, pair<K, V>*> Iterator;	typedef HashTableIterator<K, V, HashFunc, const pair<K, V>&, const pair<K, V>*> ConstIterator;public:	HashTable()		:_size(0)	{}	HashTable(size_t size)		:_size(0)	{		_table.resize(GetNextSize(size));	}	pair<Node*, bool> Insert(const pair<K,V>& p)   //防冗余	{		if (_size == _table.size())//控制負載因子為1,當負載因子超過1時,進行擴容		{			CheckCapacity();		}		size_t index = GetIndex(p.first);		Node* cur = _table[index];		while (cur)		{			if (cur->_kv.first == p.first)			{				return make_pair(cur, false);			}						cur = cur->_next;		}		Node* tmp = new Node(p);		tmp->_next = _table[index];		_table[index] = tmp;		++_size;		return make_pair(_table[index], true);	}	Node* Find(const K& key)	{		size_t index = GetIndex(key);		Node* cur = _table[index];		while (cur)		{			if (cur->_kv.first == key)			{				return cur;			}						cur = cur->_next;		}		return NULL;	}	bool Erase(const K& key)	{		size_t index = GetIndex(key);		Node* cur = _table[index];		Node* prev = NULL;		while (cur)		{			if (cur->_kv.first == key)			{				if (prev == NULL)				{					_table[index] = cur->_next;				}				else				{					prev->_next = cur->_next;				}				delete cur;				--_size;				return true;			}			prev = cur;			cur = cur->_next;		}		return false;	}	V& operator[](const K& key)	{		pair<Node*, bool> ret;		ret = Insert(make_pair(key,V()));		return ret.first->_kv.second;	}		Iterator Begin()	{		for (size_t i = 0; i < _table.size(); ++i)			if (_table[i])				return Iterator(_table[i], this);		return End();	}	Iterator End()	{		return Iterator(NULL,this);	}	ConstIterator Begin() const	{		for (size_t i = 0; i < _table.size(); ++i)			if (_table[i])				return ConstIterator(_table[i], this);		return End();	}	ConstIterator End() const	{		return ConstIterator(NULL, this);	}	~HashTable()	{		Clear();	}	void Clear()	{		for (size_t i = 0; i < _table.size(); ++i)		{			Node* cur = _table[i];			Node* del = NULL;			while (cur)			{				del = cur;				cur = cur->_next;				delete del;			}			_table[i] = NULL;		}	}	const size_t GetIndex(const K& key) const	{		HashFunc hf;		size_t hash = hf(key);		return hash % _table.size();	}	const vector<Node*>* GetTable() const	{		return &_table;	}protected:	void CheckCapacity()	{		HashTable<K, V,HashFunc> ht(_table.size());		for (size_t i = 0; i < _table.size(); ++i)		{			Node* cur = _table[i];			while (cur)			{				ht.Insert(cur->_kv);				cur = cur->_next;			}		}		_table.swap(ht._table);	}	unsigned long GetNextSize(unsigned long size) //使用素數作為哈希表的大小可以減少哈希沖突	{		const int _PrimeSize = 28;		static const unsigned long _PrimeList[_PrimeSize] =		{			53ul, 97ul, 193ul, 389ul, 769ul,			1543ul, 3079ul, 6151ul, 12289ul, 24593ul,			49157ul, 98317ul, 196613ul, 393241ul, 786433ul,			1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,			50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,			1610612741ul, 3221225473ul, 4294967291ul		};		for (int i = 0; i < _PrimeSize; ++i)		{			if (_PrimeList[i] > size)			{				return _PrimeList[i];			}		}		return 0;	}protected:	vector<Node*> _table;	size_t _size;};