This project is archived and is in readonly mode.

#3346 ✓stale
Chris

find_in_batches is slow on large datasets

Reported by Chris | October 8th, 2009 @ 12:07 AM

Using find_in_batches on a table of ~1MM rows generates queries that take >3s for a very simple query because the order clause has to consider the whole table greater than the start row.

http://github.com/chriseppstein/rails/commit/0a726f1b1447ddf105a3e5...

is a patch that changes the implementation of find_in_batches such that it selects a section of ids and only sorts that section.

With this patch, my queries are about 3ms on my local system - a 1000x speed improvement - but they are a constant-time query with the number of records instead of growing slower over time.

To do this, I made the assumption that the primary key is an integer -- I'm not sure if that's a valid assumption, but it's certainly the most common scenario.

This patch also adds support for :limit, because there's really no good reason to not support it in code.

Lastly, this patch breaks the behavior that only the last batch will have less than the batch size number of records. However, it does make sure that every batch has at least one record. This doesn't seem like a terribly important aspect of the api, but it's worth noting.

Comments and changes to this ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

<h2 style="font-size: 14px">Tickets have moved to Github</h2>

The new ticket tracker is available at <a href="https://github.com/rails/rails/issues">https://github.com/rails/rails/issues</a>

People watching this ticket

Attachments

Pages