SQL CREATE INDEX Statement
The CREATE INDEX statement is used to create indexes in tables.
Indexes allow the database application to find data fast; without reading the whole table.
Indexes
An index can be created in a table to find data more quickly and efficiently.
The users cannot see the indexes, they are just used to speed up searches/queries.
Note: Updating a table with indexes takes more time than updating a table without (because the indexes also need an update). So you should only create indexes on columns (and tables) that will be frequently searched against.
SQL CREATE INDEX Syntax
Creates an index on a table. Duplicate values are allowed:
CREATE INDEX index_name
ON table_name (column_name)
SQL CREATE UNIQUE INDEX Syntax
Creates a unique index on a table. Duplicate values are not allowed:
CREATE UNIQUE INDEX index_name
ON table_name (column_name)
Note: The syntax for creating indexes varies amongst different databases. Therefore: Check the syntax for creating indexes in your database.
CREATE INDEX Example
The SQL statement below creates an index named "PIndex" on the "LastName" column in the "Persons" table:
CREATE INDEX PIndex
ON Persons (LastName)
If you want to create an index on a combination of columns, you can list the column names within the parentheses, separated by commas:
CREATE INDEX PIndex
ON Persons (LastName, FirstName)
SQL DROP INDEX, DROP TABLE
Indexes, tables, and databases can easily be deleted/removed with the DROP statement.
The DROP INDEX Statement
The DROP INDEX statement is used to delete an index in a table.
DROP INDEX Syntax for MS Access:
DROP INDEX index_name ON table_name
DROP INDEX Syntax for MS SQL Server:
DROP INDEX table_name.index_name
DROP INDEX Syntax for DB2/Oracle:
DROP INDEX index_name
DROP INDEX Syntax for MySQL:
ALTER TABLE table_name DROP INDEX index_name
How do database indexes work? And, how do indexes help?
Let’s explain why you would need a database index by going through an example. Let’s say that we have a database table called Employee with three columns – Employee_Name and Employee_Age Employee_Address. Assume that the Employee table has thousands of rows.
Now, let’s say that we want to run a query to find all the details of any employees who are named ‘Jesus’? So, we decide to run a simple query like this:
SELECT * FROM Employee
WHERE Employee_Name = 'Jesus'
What would happen without an index on the table?
Once we run that query, what exactly goes on behind the scenes to find employees who are named Jesus? Well, the database software would literally have to look at every single row in the Employee table to see if the Employee_Name for that row is ‘Jesus’. And, because we want every row with the name ‘Jesus’ inside it, we can not just stop looking once we find just one row with the name ‘Jesus’, because there could be other rows with the name Jesus. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name ‘Jesus’. This is what is called a full table scan.
How a database index can help performance
You might be thinking that doing a full table scan sounds inefficient for something so simple – shouldn’t software be smarter? It’s almost like looking through the entire table with the human eye – very slow and not at all sleek. But, as you probably guessed by the title of this article, this is where indexes can help a great deal. The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined.
What is an index?
So, what is an index? Well, an index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table. An index is created on a column of a table. So, the key points to remember are that an index consists of column values from one table, and that those values are stored in a data structure. The index is a data structure – remember that.
Subscribe to our newsletter for more free interview questions.
What kind of data structure is an index?
B- trees are the most commonly used data structures for indexes. The reason B- trees are the most popular data structure for indexes is because of the fact that they time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. But, which data structure is actually used for index is determined by the RDBMS. And, in some scenarios, you can actually specify which data structure you want your database to use when you create the index itself.
How does a hash table index work?
Hash tables are another data structure that you may see being used as indexes – these indexes are commonly referred to as hash indexes. The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index. For instance, the query we discussed earlier (SELECT * FROM Employee WHERE Employee_Name = ‘Jesus’) could benefit from a hash index created on the Employee_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “Jesus => 0×28939″, where 0×28939 is a reference to the table row where Jesus is stored in memory. Looking up a value like “Jesus” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “Jesus” in the Employee_Name column.
The disadvantages of a hash index
Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality (like “WHERE name = ‘Jesus’”). What is implied in the key value mapping in a hash table is the concept that the keys of a hash table are not sorted or stored in any particular order. This is why hash indexes are usually not the default type of data structure used by database indexes – because they aren’t as flexible as B- trees when used as the index data structure. Also see: Binary trees versus Hash Tables.
What are some other types of indexes?
Indexes that use a R- tree data structure are commonly used to help with spatial problems. For instance, a query like “Find all of the Starbucks within 2 kilometers of me” would be the type of query that could show enhanced performance if the database table uses a R- tree index.
Another type of index is a bitmap index, which work well on columns that contain Boolean values (like true and false), but many instances of those values – basically columns with low selectivity.
How does an index improve performance?
Because an index is basically a data structure that is used to store column values, looking up those values becomes much faster. And, if an index is using the most commonly used data structure type – a B- tree – then the data structure is also sorted. Having the column values be sorted can be a major performance enhancement – read on to find out why.
Let’s say that we create a B- tree index on the Employee_Name column This means that when we search for employees named “Jesus” using the SQL we showed earlier, then the entire Employee table does not have to be searched to find employees named “Jesus”. Instead, the database will use the index to find employees named Jesus, because the index will presumably be sorted alphabetically by the Employee’s name. And, because it is sorted, it means searching for a name is a lot faster because all names starting with a “J” will be right next to each other in the index! It’s also important to note that the index also stores pointers to the table row so that other column values can be retrieved – read on for more details on that.
What exactly is inside a database index?
So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.
An index also stores a pointer to the table row
So, the question is if the value that we are looking for is found in an index (like ‘Jesus’) , how does it find the other values that are in the same row (like the address of Jesus and his age)? Well, it’s quite simple – database indexes also store pointers to the corresponding rows in the table. A pointer is just a reference to a place in memory where the row data is stored on disk. So, in addition to the column value that is stored in the index, a pointer to the row in the table where that value lives is also stored in the index. This means that one of the values (or nodes) in the index for an Employee_Name could be something like (“Jesus”, 0×82829), where 0×82829 is the address on disk (the pointer) where the row data for “Jesus” is stored. Without that pointer all you would have is a single value, which would be meaningless because you would not be able to retrieve the other values in the same row – like the address and the age of an employee.
How does a database know when to use an index?
When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Jesus’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table. Read this article to understand more about those scenarios: Selectivity in SQL.
Can you force the database to use an index on a query?
Generally, you will not tell the database when to actually use an index – that decision will be made by the database itself. Although it is worth noting that in most databases (like Oracle and MySQL), you can actually specify that you want the index to be used.
How to create an index in SQL:
Here’s what the actual SQL would look like to create an index on the Employee_Name column from our example earlier:
CREATE INDEX name_index
ON Employee (Employee_Name)
How to create a multi-column index in SQL:
We could also create an index on two of the columns in the Employee table , as shown in this SQL:
CREATE INDEX name_index
ON Employee (Employee_Name, Employee_Age)
What is a good analogy for a database index?
A very good analogy is to think of a database index as an index in a book. If you have a book about dogs and you are looking for the section on Golden Retrievers, then why would you flip through the entire book – which is the equivalent of a full table scan in database terminology – when you can just go to the index at the back of the book, which will tell you the exact pages where you can find information on Golden Retrievers. Similarly, as a book index contains a page number, a database index contains a pointer to the row containing the value that you are searching for in your SQL.
What is the cost of having a database index?
So, what are some of the disadvantages of having a database index? Well, for one thing it takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.
As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.
Interview Questions: Database Indexes
Continuing my series on interview questions, I'm going to spend some time covering ops and sysadmin questions. We'll start by writing up an introduction to database indexes and their structure.
The Question
Most consumer-facing web startups these days use one of the major open source databases, either MySQL or PostgreSQL, to some degree. If you want to prove your worth it's a good idea to get down to the nitty gritty and gain some understanding about these databases' internals.
So, the question: "Explain to me what databases indexes are and how they work."
The Answer In a nutshell a database index is an auxiliary data structure which allows for faster retrieval of data stored in the database. They are keyed off of a specific column so that queries like "Give me all people with a last name of 'Smith'" are fast.
The Theory
Database tables, at least conceptually, look something like this:
id age last_name hometown
-- -- -- --
1 10 Johnson San Francisco, CA
2 27 Smith San Joe, CA
3 15 Rose Palo Alto, CA
4 64 Farmer Mill Valley, CA
5 55 Pauling San Francisco, CA
6 17 Smith Oakland, CA
... ... ... ...
100 49 Meyer Berkeley, CA
101 30 Wayne Monterey, CA
102 18 Schwartz San Francisco, CA
104 6 Johnson San Francisco, CA
... ... ... ...
10000 41 Fetterman Mountain View, CA
10001 25 Breyer Redwood City, CA
That is, a table is a collection of tuples1. If we have a file like this sitting on disk how do we get all records that have a last name of 'Smith?'
The code would wind up looking something like this:
results = []
for row in rows:
if row[2] == 'Smith':
results.append[row]
Finding the appropriate records requires checking the conditions (here, having a last name of 'Smith') for each row. This is linear in the number of rows which, for many databases, could be millions or billions of rows. Bad news.
How can we make it faster?
Database Indexes
Any type of data structure that allows for (potentially) faster access can be considered an index. Let's look at some.
Hash Indexes
Take the same example from above, finding all people with a last name of 'Smith.' One solution would be to create a hash table. The keys of the hash would be based off of the last_name field and the values would be pointers to the database row.
This type of index is called, unsurprisingly, a "hash index." Most databases support them but they're generally not the default type. Why?
Well, consider a query like this: "Find all people who are younger than 45." Hashes can deal with equality but not inequality. That is, given the hashes of two fields, there's just no way for me to tell which is greater than the other, only whether they're equal or not.
B-tree Indexes
The data structure most commonly used for database indexes are B-trees, a specific kind of self-balancing tree. A picture's worth a thousand words, so here's an example. B-tree
The main benefit of a B-tree is that it allows logarithmic selections, insertions, and deletions in the worst case scenario. And unlike hash indexes it stores the data in an ordered way, allowing for faster row retrieval when the selection conditions include things like inequalities or prefixes.
For example, using the tree above, to get the records for all people younger than 13 requires looking at only the left branch of the tree root.
Other Indexes
Hash indexes and B-tree indexes are the most common types of database indexes, but there are others, too. MySQL supports R-tree indexes, which are used to query spatial data, e.g., "Show me all cities within ten miles of San Francisco, CA."
The e are also bitmap indexes, which allow for almost instantaneous read operations but are expensive to change and take up a lot of space. They are best for columns which have only a few possible values.
Subtleties
Performance
Indexes don't come for free. What you gain for in retrieval speed you lose in insertion and deletion speed because every time you alter a table the indexes must be updated accordingly. If your table is updating frequently it's possible that having indexes will cause overall performance of your database to suffer.
There is also a space penalty, as the indexes take up space in memory or on disk. A single index is smaller than the table because it doesn't contain all the data, only pointers to the data, but in general the larger the table the larger the index2.
Design
Nodes in a B-tree contain a value and a number of pointers to children nodes. For database indexes the "value" is really a pair of values: the indexed field and a pointer to a database row. That is, rather than storing the row data right in the index, you store a pointer to the row on disk.
For example, if we have an index on an age column, the value in the B-tree might be something like (34, 0x875900). 34 is the age and 0x875900 is a reference to the location of the data, rather than the data itself.
This often allows indexes to be stored in memory even for tables that are so large they can only be stored on disk.
Furthermore, B-tree indexes are typically designed so that each node takes up one disk block. This allows each node to be read in with a single disk operation.
Also, for the pedants among us, many databases use B+ trees rather than classic B-trees for generic database indexes. InnoDB's BTREE index type is closer to a B+ tree than a B-tree, for example.
Summary
Database indexes are auxiliary data structures that allow for quicker retrieval of data. The most common type of index is a B-tree index because it has very good general performance characteristics and allows a wide range of comparisons, including both equality and inequalities.
The penalty for having a database index is the cost required to update the index, which must happen any time the table is altered. There is also a certain about of space overhead, although indexes will be smaller than the table they index.
For specific data types different indexes might be better suited than a B-tree. R-trees, for example, allow for quicker retrieval of spatial data. For fields with only a few possible values bitmap indexes might be appropriate.
Good Question, Bad Question
I like this question because it shows whether the interviewee is curious enough to dive into these details. For certain higher-level engineering positions knowing this should be second-nature, but even for a generic web development position knowing how your database works will only help you improve the performance of your web application.
Also, it's just arcane enough that you can go through the motions without knowing it, but not so arcane that it's inaccessible to someone without an advanced education. Any decent programmer should be able to understand it — the exceptional ones will go out of their way to learn it.
No comments:
Post a Comment