Top database Questions

List of Tags
568
Charles Faiga

What are common database development mistakes made by application developers?

Answered By: cletus ( 1004)

1. Not using appropriate indices

This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.

2. Not enforcing referential integrity

Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.

It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.

More here:

3. Using natural rather than surrogate (technical) primary keys

Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).

In my opinion you should always use surrogate keys. This issue has come up in these questions:

This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.

EDIT: I know this is not how EDIT's were intended, but "you won't find any criticism of surrogate keys other than being arguably unnecessary" is false. You can find it by googling, and by reading the comments to this answer. Using surrogate keys for everything means you lose the ability to use clustered indexes properly. If you have a large table that is an awful lot to give up.

Remember, even countries can cease to exist (for example, Yugoslavia).

4. Writing queries that require DISTINCT to work

You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:

SELECT DISTINCT ...

This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.

From Why I Hate DISTINCT:

Where things start to go sour in my opinion is when a developer is building substantial query, joining tables together, and all of a sudden he realizes that it looks like he is getting duplicate (or even more) rows and his immediate response...his "solution" to this "problem" is to throw on the DISTINCT keyword and POOF all his troubles go away.

5. Favouring aggregation over joins

Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.

To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:

From SQL statement - “join” vs “group by and having”:

First query:

SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3

Query time: 0.312 s

Second query:

SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1

Query time: 0.016 s

That's right. The join version I proposed is twenty times faster than the aggregate version.

6. Not simplifying complex queries through views

Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:

  • Party: people and organisations;
  • Party Role: things those parties did, for example Employee and Employer;
  • Party Role Relationship: how those roles related to each other.

Example:

  • Ted is a Person, being a subtype of Party;
  • Ted has many roles, one of which is Employee;
  • Intel is an organisation, being a subtype of a Party;
  • Intel has many roles, one of which is Employer;
  • Intel employs Ted, meaning there is a relationship between their respective roles.

So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:

CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id

And suddenly you have a very simple view of the data you want but on a highly flexible data model.

7. Not sanitizing input

This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.

Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.

8. Not using prepared statements

Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:

SELECT * FROM users WHERE username = 'bob'

vs

SELECT * FROM users WHERE username = ?

or

SELECT * FROM users WHERE username = :username

depending on your platform.

I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).

Using prepared statements will also give you meaningful statistics about how often certain queries are used.

Prepared statements will also better protect you against SQL injection attacks.

9. Not normalizing enough

Database normalization is basically the process of optimizing database design or how you organize your data into tables.

Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).

This also came up in Best method for storing a list of user IDs:

I've seen in other systems that the list is stored in a serialized PHP array.

But lack of normalization comes in many forms.

More:

10. Normalizing too much

This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.

I once worked on a system that had a huge hierarchy for clients that went something like:

Licensee ->  Dealer Group -> Company -> Practice -> ...

such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.

More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.

More:

11. Using exclusive arcs

An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).

From A Practical Guide to Relational Database Design:

We have strongly advised against exclusive arc construction wherever possible, for the good reason that they can be awkward to write code and pose more maintenance difficulties.

12. Not doing performance analysis on queries at all

Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.

13. Over-reliance on UNION ALL and particularly UNION constructs

A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.

UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.

UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.

14. Using OR conditions in queries

This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.

Bad:

... WHERE a = 2 OR a = 5 OR a = 11

Better:

... WHERE a IN (2, 5, 11)

Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.

15. Not designing their data model to lend itself to high-performing solutions

This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.

In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:

Premature optimization is the root of all evil

16. Incorrect use of Database Transactions

All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.

Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.

However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.

  • Begin a Transaction before the first statement.
  • Commit the Transaction after the last statement.
  • On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.

Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.

17. Not understanding the 'set-based' paradigm

The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.

This lack of understanding manifests itself in a few ways.

  • Inappropriately imposing too much procedural or imperative logic on the databse.
  • Inappropriate or excessive use of cursors. Especially when a single query would suffice.
  • Incorrectly assuming that triggers fire once per row affected in multi-row updates.

Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.

415
James Hall

So I'm using an app that stores images heavily in the DB. What's your outlook on this? I'm more of a type to store the location in the filesystem, than store it directly in the DB.

What do you think are the pros/cons?

Answered By: Mark Harrison ( 352)

I'm in charge of some applications that manage many TB of images. We've found that storing file paths in the database to be best.

There are a couple of issues:

  • database storage is usually more expensive than file system storage
  • you can super-accelerate file system access with standard off the shelf products
    • for example, many web servers use the operating system's sendfile() system call to asynchronously send a file directly from the file system to the network interface. Images stored in a database don't benefit from this optimization.
  • things like web servers, etc, need no special coding or processing to access images in the file system
  • databases win out where transactional integrity between the image and metadata are important.
    • it is more complex to manage integrity between db metadata and file system data
    • it is difficult (within the context of a web application) to guarantee data has been flushed to disk on the filesystem
366
Alan

I was wondering if anyone can tell me if MongoDB or CouchDB are ready for a production environment.

I'm now looking at these storage solutions (I'm favouring MongoDB at the moment), however these projects are quite young and so I foresee that I'm going to have to work quite hard to convince my manager that we should adopt this new technology.

What I'd like to know is:

  1. Who is using MongoDB or CouchDB today in a production environment?

  2. How are you using MongoDB/CouchDB?

  3. What problems (if any) did you come across when you adopted this new storage mechanism (and how did you overcome them)?

  4. How did you deal with any migration issues that you had to deal with?

  5. Do you have any good/bad experiences with either of these solutions that you'd like to share?

Answered By: user111359 ( 197)

I'm the CTO of 10gen (developers of MongoDB) so I'm a bit biased, but I also manage a few sites that are using MongoDB in production.

businessinsider has been using mongo in production for over a year now. They are using it for everything from users and blog posts, to every image on the site.

shopwiki is using it for a few things including real time analytics and a caching layer. They are doing over 1000 writes per second to a fairly large database.

If you go to the mongodb Production Deployments page you'll see some people who are using mongo in production.

If you have any questions about the scale or scope of production deployments, post on our user list and we'll be more than happy to help.

351
Mathias

How do you add a column, with a default value, to an existing table in SQL Server 2000/2005?

Answered By: James Boother ( 482)
ALTER TABLE {TABLENAME} 
ADD {COLUMNNAME} {TYPE} {NULL|NOT NULL} 
CONSTRAINT {CONSTRAINT_NAME} DEFAULT {DEFAULT_VALUE}
341
Brian MacKay

I feel that my shop has a hole because we don't have a solid process in place for versioning our database schema changes. We do a lot of backups so we're more or less covered, but it's bad practice to rely on your last line of defense in this way.

Surprisingly, this seems to be a common thread. Many shops I have spoken to ignore this issue because their databases don't change often, and they basically just try to be meticulous.

However, I know how that story goes. It's only a matter of time before things line up just wrong and something goes missing.

Are there any best practices for this? What are some strategies that have worked for you?

Answered By: Gulzar Nazim ( 230)

Must read Get your database under version control. Check the series of posts by Scott Allen.

When it comes to version control, the database is often a second or even third-class citizen. From what I've seen, teams that would never think of writing code without version control in a million years-- and rightly so-- can somehow be completely oblivious to the need for version control around the critical databases their applications rely on. I don't know how you can call yourself a software engineer and maintain a straight face when your database isn't under exactly the same rigorous level of source control as the rest of your code. Don't let this happen to you. Get your database under version control.

311
Teifion

SQLite is a single-file based database and MySQL is a normal database. That's great, but I'm not sure which is faster where or better for what...what are the pros and cons of each option?

Answered By: Justin Standard ( 219)

SQLite is great for testing and prototyping, or for embedding in applications. MySQL is worthy of (large scale) production environments.

This site has guidance on when to use SQLite

Here is my personal summary:

SQLite:

  • easier to setup
  • great for temporary (testing databases)
  • great for rapid development
  • great for embedding in an application
  • doesn't have user management
  • doesn't have many performance features
  • doesn't scale well.

MySQL:

  • far more difficult/complex to set up
  • better options for performance tuning
  • can scale well if tuned properly
  • can manage users, permissions, etc.
279
Xenph Yan

Given that indexing is so important as your dataset increases in size, can someone explain how indexing works at a database agnostic level?

For information on queries to index a field, check out http://stackoverflow.com/questions/1156/how-do-i-index-a-database-field

Answered By: Xenph Yan ( 586)

Why is it needed?

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indexes require additional space on the disk, since the indexes are stored together in a MyISAM database, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value. This sample database contains five million rows, and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a MyISAM database which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value given that the id field is a key field. But since the id field is also sorted a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks that the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 277,778 block accesses required by the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example), and that too many indexes can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indexes are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

270
sgibbons

Is there a good object-relational-mapping library for PHP?

I know of PDO/ADO, but they seem to only provide abstraction of differences between database vendors not an actual mapping between the domain model and the relational model. I'm looking for a PHP library that functions similarly to the way Hibernate does for Java and NHibernate does for .NET.

Answered By: Ian P ( 112)

Look into Doctrine.

Doctrine 1.2 implements Active Record. Doctrine 2+ is a DataMapper ORM.

Also, check out Xyster. It's based on the Data Mapper pattern.

Also, take a look at DataMapper vs. Active Record.

I'm working on a projects which involves a lot of database writes, I'd say (70% inserts and 30% reads). This ratio would also include updates which I consider to be one read and one write. The reads can be dirty (e.g. I don't need 100% accurate information at the time of read).
The task in question will be doing over 1 million database transactions an hour.

I've read a bunch of stuff on the web about the differences between MyISAM and InnoDB, and MyISAM seems like the obvious choice to me for the particular database/tables that I'll be using for this task. From what I seem to be reading, InnoDB is good if transactions are needed since row level locking is supported.

Does anybody have any experience with this type of load (or higher)? Is MyISAM the way to go?

Answered By: rix0rrr ( 130)

I'm not a database expert, and I do not speak from experience. However:

MyISAM tables use table-level locking. Based on your traffic estimates, you have close to 200 writes per second. With MyISAM, only one of these could be in progress at any time. You have to make sure that your hardware can keep up with these transaction to avoid being overrun, i.e., a single query can take no more than 5ms.

That suggests to me you would need a storage engine which supports row-level locking, i.e., InnoDB.

On the other hand, it should be fairly trivial to write a few simple scripts to simulate the load with each storage engine, then compare the results.

230
Shadow_x99

I am trying to insert into a table using the input from another table. Although this is entirely feasible for many database engines, I always seem to struggle to remember the correct syntax for the SQL engine of the day (MySQL, Oracle, SQL Server, Informix, and DB2).

Is there a silver-bullet syntax coming from an SQL standard (for example, SQL-92) that would allow me to insert the values without worrying about the underlying database?

Answered By: Shadow_x99 ( 263)

Here is an SQL Statement that seems to be quite representative of what should work on any RDBMS and achieve the result. However, due to my incapacity to install all RDBMS software and test this query, I cannot be sure.

If this is wrong in an RDBMS, please notify me.

The SQL:

INSERT INTO table1 ( column1 )
SELECT  col1
FROM    table2