It could be. I am not saying that is the case always. But I have been a witness to several usages where the Identity column is used without being fully aware of all the issues surrounding it.
Please make no mistake. I am referring to the same IDENTITY column provided by Microsoft SQL Server and Sybase, AUTO_INCREMENT provided by MYSQL, COUNTER or IDENTITY used in Microsoft Access, “GENERATED BY DEFAULT AS IDENTITY” syntax supported by IBM DB2 and “SEQUENCE” used in Oracle.
When a Primary Key is defined then it means the values of the Primary Key have to be unique. The database enforces the uniqueness by creating a Primary Index and prevents any attempt to insert a row containing a duplicate value. Thus it is not necessary to explicitly use a Create Index statement. The Primary Index is automatically created once the Primary Key is defined.
The good news is that most of the time we can get away. Modern databases have gone to a large extent with various optimizations to facilitate the use of Identity Column and to be used by itself in a primary or secondary index. The advancements in the hardware and the hard disks are also of major help in reducing the bottleneck. Hence for normal usage which is the case most of the time it might be OK. Still I hope you would agree that it does not hurt to increase our understanding of Identity column and its usage as Primary Key.
Let us see what is the main issue with using an Identity column as a Primary Key.
We all know that Identity column generates numbers in a sequence, incremented by one (the default increment) for every new request. Now if we use it directly in an index it is disaster whether it is primary index or secondary index.
A database index, which is normally implemented as B+ tree, stores data arranged in sorted fashion so that locating a particular key either for retrieval or update is faster. For the sake of simplicity we can set aside B+ tree and instead we can consider a simple linked list maintained in the sorted order. Then every time we insert a new record containing the Identity field, the newly generated auto increment value always gets added to the end of the sorted linked list. (Because the new value is greater than the previously added value). We all know that multiple additions to the end of the sorted link list cannot take place simultaneously. It can only happen serially one after the other. Parallelism is not just compromised, it is totally absent. Switching back to B+ tree and its terminology, the inserts would be trying to add the key to the same index page. Thus a contention does exist for adding entries to the last page of the index and in some cases this bottleneck could be crippling.
This is the major issue I wanted to cover in this blog post. But as I have been saying, this issue need not be of major concern in most of the cases as modern databases are capable of handling few hundred inserts per second if not thousands and depending on the use case, hardware and the RDBMS used, we just need to ensure that we are not having any performance problems.
But if the number of inserts to the database table are large then using the Identity column as a Primary Key could cause performance issues.
Identity column nicely generates a unique id. So imagine because of that feature the Identity column is used as Primary Key in a table
- by a Bank to maintain the records of all withdrawals happening in its ATM across the nation or the globe.
- by a Cellular company to record all the phone calls.
- by Facebook to record all the posts.
- by Twitter to record all the tweets.
These are clear examples of database tables busy with inserts and we may not be able to ignore the contention bottleneck that happens in the last B+ tree page described above.
How do we solve it and what are the alternatives we have?
There is no single solution that can work for all cases.
Obviously if there are already unique fields in the table then we could use them as the Primary Key. Not the Identity column.
Sometimes the intended purpose of the database table could yield us a solution to the problem. For example let us consider the table used by the Cellular company to record all the phone calls. Such a table invariably has to contain a CustomerId or Telephone number or SIM card number to associate the call with a particular customer. Let us say we have the SIM card number which we shall call as SIM_Id. In such a case the primary key could be SIM_Id, CallId where CallId is the Identity Column.
What does it get us? Because SIM_Id is used in the prefix part of the key it provides a nice bifurcation. All the inserts would not cause contention to the last page of the B+ tree. As the ordering is not just CallId but instead it is SIM_Id, CallId, different insert requests would end up inserting their entries in different B+ tree pages and not just in the last page. If we consider the example of sorted linked list which is sorted by SIM_Id, CallId, all the new inserts would not be trying to add only to the end of the linked list. Depending on the value of the SIM_Id, different inserts would be trying to insert at different spots in the sorted linked list and and not just at the end.
Thus in the above case we have used a Compound Primary Key (Compund Key is key with more than one field). The Identity column appears in the trailing part of the key and not in the leading part.
Has this solved the problem discussed here? Yes. I think definitely it has addressed the problem. By whatever mechanism if we can force the inserts to use different B+ tree pages rather than trying to always write to the last page then the contention for the last page gets eliminated.
Is this the recommended solution? Probably not always. It depends on the fields we have and the intended purpose of the database table. We just need to be aware of the various possible solutions and then try to come up with the best solution for the use case we might have at hand.
Are there any more issues with the Identity column? Definitely there are and we shall try to discuss them in the future blog posts.
I have just scraped the surface in this post. There are various concepts, thoughts, insights and few more issues pertaining to the Identity column. If there is enough interest in the readers then I shall put more efforts to bring in more material on this subject. For example, in the previous company to whom I was working as a Consultant, they have quite a bit of write-up on this topic. I can request them to provide that material to share it with you. I can also contact few experts in the database area whom I respect and had a chance to work with and request them to contribute their thoughts.
So I request you to kindly register your vote and comments so that we could take this discussion forward.
I have withdrawn the earlier post and re-drafted it to convey the intention of this blog post much more precisely and succinctly. I hope this time the intended objective has been met. Please let us know your opinion.
Latest posts by Arun Kumar (see all)
- Block IP using ModSecurity - April 9, 2013
- Templates in C Language??!!! - April 1, 2013
- Essential Macros for C Programming - March 20, 2013
- Final Part – Customizing AWS Elastic Beanstalk - March 12, 2013
- Development Environment – Customizing AWS Elastic Beanstalk - March 9, 2013