In Rails, Active Record provides batch processing for ActiveRecord::Relation with the in_batches method. In Rails 7.1 the implementation of in_batches has improved to give optimized results for whole table iterations. In this article, we will see how it has improved.

Example

User.in_batches(of: 3) do |relation|
  puts relation.to_sql
end

Before Rails 7.1

Loading development environment (Rails 7.0.4)
3.0.0 :001 > User.in_batches(of: 3) do |relation|
3.0.0 :002 >   puts relation.to_sql
3.0.0 :003 > end
  User Pluck (0.4ms)  SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT $1  [["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" IN (1, 2, 3)
  User Pluck (0.4ms)  SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 3], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" IN (4, 5, 6)
  User Pluck (0.3ms)  SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 6], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" IN (7, 8, 9)
  User Pluck (0.4ms)  SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 9], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" = 10
 => nil

Previously, the in_batches method fetches all the ids and constructs an IN-based query for each batch. While iterating over the whole tables, this approach is not optimal. Performing IN queries with a large number of values affects the performance.

Rails 7.1 onwards

Loading development environment (Rails 7.1.0.alpha)
3.0.0 :001 > User.in_batches(of: 3) do |relation|
3.0.0 :002 >   puts relation.to_sql
3.0.0 :003 > end
  User Pluck (0.3ms)  SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT $1  [["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" <= 3
  User Pluck (0.2ms)  SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 3], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" > 3 AND "users"."id" <= 6
  User Pluck (0.4ms)  SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 6], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" > 6 AND "users"."id" <= 9
  User Pluck (0.3ms)  SELECT "users"."id" FROM "users" WHERE "users"."id" > $1 ORDER BY "users"."id" ASC LIMIT $2  [["id", 9], ["LIMIT", 3]]
SELECT "users".* FROM "users" WHERE "users"."id" > 9 AND "users"."id" <= 10
 => nil

Now, whole table iterations use range query (id >= x AND id <= y) by default, making it several times faster.

Only whole table iterations use this style of iteration by default. You can disable this behavior by passing use_ranges: false.

User.in_batches(of: 3, use_ranges: false) do |relation|
  puts relation.to_sql
end

If you iterate over the table with a where clause e.g. disabled_at: nil, it will apply an IN query by default. But, if the results of the where query are large enough, it makes sense to opt for the range query by passing use_ranges: true.

User.where(disabled_at: nil).in_batches(use_ranges: true) do |relation|
  # do something
end

Lets verify in_batches optimization for 1 million records

  • Querying

    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    User.in_batches do |batch|
      batch.count
    end
    elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    puts "Elapsed: #{elapsed}s"
    
  • Updating

    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    User.in_batches.update_all("score = score + 1")
    elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    puts "Elapsed: #{elapsed}s"
    
  • Deleting

    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    User.in_batches.delete_all
    elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
    puts "Elapsed: #{elapsed}s"
    
Database → Postgres MySQL SQLite
Operations↓ < rails 7.1 >= rails 7.1 < rails 7.1 >= rails 7.1 < rails 7.1 >= rails 7.1
Querying ~24.81s ~5.14s ~33.08s ~6.37s ~15.56s ~5.25s
Updating ~94.56s ~58.45s ~50.22 ~43.37s ~17.89s ~6.51s
Deleting ~86.69s ~38.62s ~62.23s ~26.58s ~21.19s ~5.21s

Note: above are statistics from my local machine in the development environment, it may vary based on your machine environment.

Check out this pull request for more details.