In Rails, Active Record provides batch processing for ActiveRecord::Relation
with the in_batches
method. In Rails 7.1 the implementation of in_batches
has improved to give optimized results for whole table iterations. In this article, we will see how it has improved.
Example
User.in_batches(of: 3) do |relation|
puts relation.to_sql
end
Before Rails 7.1
Loading development environment (Rails 7.0.4)
3.0.0 :001 > User.in_batches(of: 3) do |relation|
3.0.0 :002 > puts relation.to_sql
3.0.0 :003 > end
User Pluck (0.4ms) SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT $1 [["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" IN (1, 2, 3)
User Pluck (0.4ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2 [["id", 3], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" IN (4, 5, 6)
User Pluck (0.3ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2 [["id", 6], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" IN (7, 8, 9)
User Pluck (0.4ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2 [["id", 9], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" = 10
=> nil
Previously, the in_batches
method fetches all the ids and constructs an IN
-based query for each batch. While iterating over the whole tables, this approach is not optimal. Performing IN
queries with a large number of values affects the performance.
Rails 7.1 onwards
Loading development environment (Rails 7.1.0.alpha)
3.0.0 :001 > User.in_batches(of: 3) do |relation|
3.0.0 :002 > puts relation.to_sql
3.0.0 :003 > end
User Pluck (0.3ms) SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT $1 [["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" <= 3
User Pluck (0.2ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2 [["id", 3], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" > 3 AND "users"."id" <= 6
User Pluck (0.4ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2 [["id", 6], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" > 6 AND "users"."id" <= 9
User Pluck (0.3ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2 [["id", 9], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" > 9 AND "users"."id" <= 10
=> nil
Now, whole table iterations use range query (id >= x AND id <= y
) by default, making it several times faster.
Only whole table iterations use this style of iteration by default. You can disable this behavior by passing use_ranges: false
.
User.in_batches(of: 3, use_ranges: false) do |relation|
puts relation.to_sql
end
If you iterate over the table with a where
clause e.g. disabled_at: nil
, it will apply an IN
query by default. But, if the results of the where
query are large enough, it makes sense to opt for the range query by passing use_ranges: true
.
User.where(disabled_at: nil).in_batches(use_ranges: true) do |relation|
# do something
end
Lets verify in_batches
optimization for 1 million records
-
Querying
start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.in_batches do |batch| batch.count end elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"
-
Updating
start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.in_batches.update_all("score = score + 1") elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"
-
Deleting
start = Process.clock_gettime(Process::CLOCK_MONOTONIC) User.in_batches.delete_all elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start puts "Elapsed: #{elapsed}s"
Database → | Postgres | MySQL | SQLite | |||
---|---|---|---|---|---|---|
Operations↓ | < rails 7.1 | >= rails 7.1 | < rails 7.1 | >= rails 7.1 | < rails 7.1 | >= rails 7.1 |
Querying | ~24.81s | ~5.14s | ~33.08s | ~6.37s | ~15.56s | ~5.25s |
Updating | ~94.56s | ~58.45s | ~50.22 | ~43.37s | ~17.89s | ~6.51s |
Deleting | ~86.69s | ~38.62s | ~62.23s | ~26.58s | ~21.19s | ~5.21s |
Note: above are statistics from my local machine in the development environment, it may vary based on your machine environment.
Check out this pull request for more details.