What is the difference between
INNER JOIN and
Assuming you're joining on columns with no duplicates, which is by far the most common case:
An inner join of A and B gives the result of A intersect B, i.e. the inner part of a venn diagram intersection.
An outer join of A and B gives the results of A union B, i.e. the outer parts of a venn diagram union.
Suppose you have two Tables, with a single column each, and data as follows:
A B - - 1 3 2 4 3 5 4 6
Note that (1,2) are unique to A, (3,4) are common, and (5,6) are unique to B.
An inner join using either of the equivalent queries gives the intersection of the two tables, i.e. the two rows they have in common.
select * from a INNER JOIN b on a.a = b.b; select a.*,b.* from a,b where a.a = b.b; a | b --+-- 3 | 3 4 | 4
Left outer join
A left outer join will give all rows in A, plus any common rows in B.
select * from a LEFT OUTER JOIN b on a.a = b.b; select a.*,b.* from a,b where a.a = b.b(+); a | b --+----- 1 | null 2 | null 3 | 3 4 | 4
Full outer join
A full outer join will give you the union of A and B, i.e. All the rows in A and all the rows in B. If something in A doesn't have a corresponding datum in B, then the B portion is null, and vice versa.
select * from a FULL OUTER JOIN b on a.a = b.b; a | b -----+----- 1 | null 2 | null 3 | 3 4 | 4 null | 6 null | 5
This has always confused me. It seems like this would be nicer:
my_list = ["Hello", "world"] print my_list.join("-") # Produce: "Hello-world"
my_list = ["Hello", "world"] print "-".join(my_list) # Produce: "Hello-world"
Is there a specific reason it does it like this?
It's because any iterable can be joined, not just lists, but the result and the "joiner" are always strings.
import urllib2 print '\n############\n'.join(urllib2.urlopen('http://data.stackexchange.com/users/7095'))
Let's say I have two tables, TableA and TableB. TableB's primary key is a single column (BId), and is a foreign key column in TableA.
In my situation, I want to remove all rows in TableA that are linked with specific rows in TableB: Can I do that through joins? Delete all rows that are pulled in from the joins?
DELETE FROM TableA FROM TableA a INNER JOIN TableB b ON b.BId = a.BId AND [my filter condition]
Or am I forced to do this:
DELETE FROM TableA WHERE BId IN (SELECT BId FROM TableB WHERE [my filter condition])
The reason I ask is it seems to me that the first option would be much more effecient when dealing with larger tables.
DELETE TableA FROM TableA a INNER JOIN TableB b on b.Bid = a.Bid and [my filter condition]
I have a database with account numbers and card numbers. I match these to a file to update any card numbers to the account number, so that I am only working with account numbers.
I created a view linking the table to the account/card database to return the Table ID and the related account number, and now I need to update those records where the ID matches with the Account Number.
This is the
Sales_Import table, where the account number field needs to be updated:
LeadID AccountNumber 147 5807811235 150 5807811326 185 7006100100007267039
And this is the
RetrieveAccountNumber table, where I need to update from:
LeadID AccountNumber 147 7006100100007266957 150 7006100100007267039
I tried the below, but no luck so far:
UPDATE [Sales_Lead].[dbo].[Sales_Import] SET [AccountNumber] = (SELECT RetrieveAccountNumber.AccountNumber FROM RetrieveAccountNumber WHERE [Sales_Lead].[dbo].[Sales_Import]. LeadID = RetrieveAccountNumber.LeadID)
It updates the card numbers to account numbers, but the account numbers gets replaced by NULL
I believe an
UPDATE FROM with a
JOIN will help:
UPDATE Sales_Import SET Sales_Import.AccountNumber = RAN.AccountNumber FROM Sales_Import SI INNER JOIN RetrieveAccountNumber RAN ON SI.LeadID = RAN.LeadID
I'm doing some research into databases and I'm looking at some limitations of relational DBs.
I'm getting that joins of large tables is very expensive, but I'm not completely sure why. What does the DBMS need to do to execute a join operation, where is the bottleneck?
How can denormalization help to overcome this expense? How do other optimization techniques (indexing, for example) help?
Personal experiences are welcome! If you're going to post links to resources, please avoid Wikipedia. I know where to find that already.
PS. In relation to this, I'm wondering about the denormalized approaches used by cloud service databases like BigTable and SimpleDB. See this question.
Denormalising to improve performance? It sounds convincing, but it doesn't hold water.
Chris Date, which advanced the relational data model along with Dr Ted Codd, its creator, got tired of misinformed arguments against normalisation and systematically demolished them using scientific method - he got large databases and tested these assertions. I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems. This is the definitive text on database theory and design, currently in its eighth edition. Chris Date was an expert in this field when most of us were still running around barefoot.
He found that:
It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.
Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.
Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.
Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.
It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.
Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.
If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.
For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.
You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.
I'd also like to respond to
Joins are just cartesian products with some lipgloss
What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as cartesian products to which predicates apply only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.
The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate:
SELECT * FROM A,B
David Aldridge provides some important additional information.
There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.
A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.
I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.
"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.
The reason that I responded so savagely was that the statement as worded says that
Joins are cartesian products...
This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.
I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.
The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.
Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.
You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.
A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).
An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.
Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?
Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.