Sometimes dealing with DB processes in a Rails application consume a lot of time. Often there are a bunch of small and simple DB queries that go endless. For example, bulk insertions or bulk updates in tables. Dividing it into various background jobs is an idle solution. In a few scenarios, we can't prefer background jobs e.g. the change you apply, makes the existing data invalid.

In this article, we will see how to complete a large set of DB queries under minutes which would otherwise take hours if not optimized.

It's a better practice to create a rake task, that will run after a DB migration call. But for simplicity, we'd look at some inline code and you can be a better judge to put it into better practice!

The example in this article uses the following Rails & PG version along with Heroku PaaS:

  • Rails >=5
  • Postgresql >=8
  • PaaS: Heroku

In this database optimization, we will be modifying an existing column of a table. It's a bulk update operation. With serialization/encryption - we'd be resetting older values with encrypted values.

Connections Pool Size

By default, Rails has only 5 connections pooled. Postgresql will allow many more connections as per the server capability. Often the unused connections are better utilized via background processes. The next step is to find out if we have more than 5 connections. This can be checked using the command heroku pg:info i.e. through Heroku console.

=== DATABASE_URL
Plan:         Standard 0
Status:       Available
Data Size:    5.23 GB
Tables:       134
PG Version:   9.5.22
Connections:  7/120
Credentials:  1
Fork/Follow:  Available

Now we know that the DB plan we are using has a DB pool size of 120 max connections. Great! Let's reconfigure ActiveRecord Connection Pool to use the max potential. (At least config for this single process as a migration)

Max out the Rails Connection Pool

CUSTOM_CONN_POOL = 100

def set_conn_pool_from_env!
  config = ActiveRecord::Base.configurations[Rails.env]
  config['pool'] = CUSTOM_CONN_POOL
  ActiveRecord::Base.establish_connection(config) if ENV['DB_POOL']
end

The above code will allow us to increase ConnectionPool size. To know more about Connection Pool management follow a guide by Heroku here.

With above, ActiveRecord#ConnectionPool can now pool much larger connections. Now we go and take advantage of Threads i.e. Concurrency.

Add new Thread per connection

def new_thread_connection
  Thread.new do |t|
    ApplicationRecord.connection.with_connection { yield }
  end
end

ActiveRecord on Ruby (MRI) works synchronously. So any queries executed within a connection will block the main thread. If a database query takes 200ms, the next line of code has to wait for those 200ms. The purpose of more threads is so that we do not wait for query calls. We cannot make all queries parallel - but we can optimize code to make most of the idle connections.

How to use Threaded Connection Pool

encryptor = MyEncryptor.new

def new_thread_connection
  Thread.new do |t|
    ApplicationRecord.connection.with_connection { yield }
  end
end

def update_batch(batch)
  batch.each do |visitor|
    visitor.update(encrypted_info: encryptor.encrypt(visitor.info))
  end
end

def migrate
#..update column from string to text type...
  set_conn_pool_from_env!

  updater_threads = Visitor.in_batches_of(max_conn_size).map do |visitors_batch|
    new_thread_connection { update_batch(visitors_batch) }
  end
# join all threads to main thread
  updater_threads.each(&:join)
end

  1. If given a 100 max_conn_size, Divided visitors in a batch size of 100.
  2. A total of 10,000 visitors converted to a batch of 100 by the main thread.
  3. Each batch can now process in a separate thread.
  4. Each Thread now fetches a connection, then releases it when done.
  5. At last with updater_threads get merged and provides all control back to the main thread.

There is more scope for improvement i.e. we can bulk update in a single statement using SQL case statements


Building PG's Case Statements with Arel

arel_builder = Arel::Nodes::Case.new(Visitor.arel_attribute(:id))
arel_builder.when(visitor.id).then(encryptor_lib.encrypt(visitor.info))

Arel is a query builder used by ActiveRecord. Unfortunately, there's no ActiveRecord API/helper in place for case statements. Arel has a ton of SQL builder helpers in house.


Summing it up

class UpdateVisitorsInfo < ActiveRecord::Migration[6.0]
  CUSTOM_CONN_POOL = 100

  def migrate
    change_column(:visitors, :info, :text)

 # update all columns from plain to encrypted values
    set_conn_pool_from_env!

    updater_threads = Visitor.in_batches_of(max_conn_size).map do |visitors_batch|
      new_thread_connection do
        visitors_batch.update_all("info = ?", build_case_stmt(visitors_batch))
      end
    end

    updater_threads.each(&:join)
  end

private

  def new_thread_connection
    Thread.new do |t|
      ApplicationRecord.connection.with_connection { yield }
    end
  end

  def set_conn_pool_from_env!
    config = ActiveRecord::Base.configurations[Rails.env]
    config['pool'] = CUSTOM_CONN_POOL
    ActiveRecord::Base.establish_connection(config) if ENV['DB_POOL']
  end

  def build_case_stmt(batch)
    arel_builder = Arel::Nodes::Case.new(Visitor.arel_attribute(:id))

    batch.each do |visitor|
      arel_builder.when(visitor.id).then(encryptor_lib.encrypt(visitor.info))
    end

    arel_builder.to_sql
  end
end


Have threads ever been this useful to you or they have been a disaster? Do you run into issues with Connection Pool? Did you like the idea in this article of how connection pools and threads are used?