Friday, May 14, 2010

Filter Early

Yesterday, my 12 year-old son Alex was excited to tell me that he had learned a new trick that made it easier to multiply fractions. Here’s the trick:

The neat thing for me is that this week I’m working on my slides for ODTUG Kaleidoscope 2010 (well, actually, for the Performance Symposium that’ll occur on Sunday 27 June), and I need more examples to help encourage application developers to write code that will Filter Early. This “trick” (it’s actually an application of the Multiplicative Inverse Law) is a good example of the Filter Early principle.

Filter Early is all about throwing away data that you don’t need, as soon as you can know that you don’t need it. That’s what this trick of arithmetic is all about. Without the trick, you would do more work to multiply 4/7 × 3/4 = (4 × 3)/(7 × 4) = 12/28, and then you would do even more work again to figure out that 12 and 28 both share a factor of 4, which is what you need to know before you then divide 12/4 = 3 and then 28/4 = 7 to reduce 12/28 to 3/7. It’s smarter, faster, and more fun to use the trick. Multiplying fractions without the trick is a Filter Late operation, which is just dumb and slow.

Here are some other examples of the Filter Early pattern’s funnier (unless you’re the victim of it), sinister antipattern, Filter Late. You shouldn’t do these things:
  • Drop a dozen brass needles into a haystack, shuffle the haystack, and then try to retrieve the needles. (Why did I specifically choose brass? Two reasons. Can you guess?)
  • Pack everything you own into boxes, hire a moving company to move them to a new home, and then, after moving into your new home, determine that 80% of your belongings are junk that should be thrown away.
  • Return thousands of rows to the browser, even though the user only wants one or two.
  • To add further insult to returning thousands of rows to the browser, return the rows in some useless order. Make the user click on an icon that takes time to sort those rows into an order that will allow him to figure out which one or two he actually wanted.
  • Execute a database join operation in a middle-tier application instead of the database. I’m talking about the Java application that fetches 100,000 rows from table A and 350,000 rows from table B, and then joins the two result sets in a for loop, in an operation that makes 100,000 comparisons to figure out that the result set of the join contains two rows, which the database could have told you much more efficiently.
  • Slog row-by-row through a multimillion-row table looking for the four rows you need, instead of using an index scan to compute the addresses of those four rows and then access them directly.
Converting a Filter Late application into a Filter Early application can make performance unbelievably better.

One of my favorite features of the Oracle Exadata machine is that it applies the Filter Early principle where a lot of people would have never thought to try it. It filters query results in the storage server instead of the database server. Before Exadata, the Oracle Database passed disk blocks (which contain some rows you do need, but also some rows you don’t) from the storage server to the database. Exadata passes only the rows you need back to the database server (Chris Antognini explains).

How many Filter Early and Filter Late examples do you know?

9 comments:

Chen Shapira said...

brass needles? So I can't filter them easily by sight (color) or by magnet.

Cary Millsap said...

@Chen, exactly.

Now, ...what great Filter Early (or Filter Late) examples do we have out there?

Coskan Gundogar said...

Similar to the second one and main reason of our yearly fights with my wife :)

Filter your £X of goods before you close the holiday luggage, so you won't have to pay £2X extra luggage fee when you realize you are 5kg more and it is £5 per kg

Gary Myers said...

An excellent 'filter early' is when Oracle decides that, due to a constraint, a certain predicate can never be true for a table row and it doesn't even both to check the table.

Stew Ashton said...

President Nixon letting Haldeman and Ehrlichman screen all his visitors and prepare digests of newspapers so he wouldn't have to read the originals (he boasted of this last practice to journalists).

Brian Tkatch said...

Burn the haystack?

Debra Lilley said...

When I taught discoverer and the audience was by definition 'non technical', I explained filtering as being a basket of data, and each condition reduced the size of the basket. The example I would give was buttons. No point searching for a small white button with only 2 holes if your basket contained every button in the box. Bin or filter the coloured buttons, becuase as Chen says, that is easy; then start removing big, or 4 hole buttons. I also used this analagy to explain indexing.

Dallas said...

I had a nice "filter late" experience two weeks ago.

We had an app go into 11g. Immediately upon implementation, a function that normally ran late in the execution plan suddenly started running very early (before filtering), in the driven part of a nested loops join. Instead of running 50,000 times, it started running 460 million times before finally being killed.

Arnoud said...

The most common filter late example I see is in data warehousing. The staging area often pulls all columns and rows from a source table, because the business will ask for them soon enough so why touch the code twice.