Third part of the diary of Ricardo Vila, software engineer at TEIMAS, upgrading Teixo to the latest version of Rails.
The upgrade to Rails 3 also came with the need to change the servers we were using with Rails 2. We had to configure and deploy new servers with an updated OS and installed libraries. Also update all the scripts and be aware of the problems that came from having two staging and two production environments with different configurations.
Nevertheless the most relevant change came from switching the server from Unicorn to Puma. Despite it is not necessary to use Puma with Rails 3 we decided to go on with this change in order to advance this point for the next Rails upgrade we will face (Rails 4.2).
This was the most headaching part because we face a loss of performance that we do not understand at the beginning. With Rails 2 we configured several Unicorn processes on each server, depending on the CPU and RAM of the server. With Puma we wanted to change this behavior and take advantage of Puma's threads. So initially we configured only one Puma process with several threads on every frontend server,,, so our puma.rb looked like this:
puts "1 workers and 7 threads".
workers 1
# Min and Max threads per worker
threads 1, 7
At this point we didn't know that Rails 3.2 is not multithreading . So Puma with 1 worker and 7 threads is the same than 1 worker and 1 thread. With this configuration each server could only dispatch one request at a time. The other requests were waiting in Puma's queue. We notice it thanks to New Relic's monitoring tool.
When we noticed it we changed our Puma configuration to this:
puts "5 workers and 1 threads".
workers 5
# Min and Max threads per worker
threads 1, 1
And performance returned to normal levels. We also had to fine tuning memory usage and the number of processes. Finally we adopted a gem called puma_worker_killer to keep an old behavior we had with Unicorn using a gem called unicorn_worker_killer to avoid memory problems. Those problems seem to have gone on Rails 3 so we plan to remove puma_worker_killer in the future.
At this point we had two Teixos in the Staging environment, one on Rails 2.3 (Stable) and other on Rails 3.2 (Edge). Both working with the same database and with the same Memcached server. Each version had its own set of DelayedJob workers but they all worked against the same database searching for Jobs to run. Also user sessions were stored on the database and all those things together led us to some new problems we had to deal with.
User sessions were stored on the database for both channels. This led us to deserialization issues when a user that had a stored session on Edge (Rails 3.2) attempted to use Stable version (Rails 2.3) because Rails 2.3 didn't know how to deserialize Rails 3 objects. There is no unique solution for this, it depends on the date attached to the session and how deserialization works for each one. In our case we simply added this on config/initializers/a_config.rb
ActionController::ParamsHashWithIndifferentAccess = ActionDispatch::Http::ParamsHashWithIndifferentAccess
So Rails 3 could unmarshal Rails 2 objects of this class ParamsHashWithIndifferentAccess and vice versa.
And defining a SessionStore initializer (config/initializers/session_store.rb) to deal with FlashMessages stored on the Session.
module ActionController
Flash module
class FlashHash < Hash
def method_missing(m, *a, &b)
end
end
end
end
Another similar problem we faced was to deal with deserialization issues on Rails Cache. On this case we changed Teixo to add an environment variable on both Stable and Edge channels. This variable was used on the Rails 3.2 version of Teixo to be appended at the beginning of each cache key use. Doing so makes cache entries for Rails 3 different for Rails 2 (keys are similar but Rails 3 keys has a prefix), so Stable and Edge channels do not share cache entries.
Be aware that this can led to some inconsistencies between environments. For example a partial of an object show cached on Stable is not overridden when this object is modified on Edge and vice versa. So be conscious of cache problems if users work on both environments at the same time.
As it was said previously Teixo uses the DelayedJob gem. It defines tasks as different job classes extending Delayed::Backend::ActiveRecord::Job. Also we simulate different Job categories grouping Job types by priorities. I.E.: priority 0 to 4 means job category A, priorities 5 to 9 means category B, and so on. In the production environment we have different number of Workers to run those Jobs depending on the category: Two workers for category A, one for category B, etc.
When we planned to have two channels we needed a way to ensure that workers on Stable will run only Jobs of the Rails 2 version and the same with Rails 3.
We solved it with a priority offset. Depending on the environment we have a DelayedJob offset parameter. This parameter was used during Job creation and added to the base priority of the job. So jobs on Stable had priorities between 0 and 99, and Edge jobs had priorities between 100 and 199. We needed to adapt Workers initialization to add this offset when launching.
Worker for Jobs with A category Stable was launched like this:
/script/delayed_job -n 2 --min-priority 0 --max-priority 4 run
Same worker on Edge
/script/delayed_job -n 2 --min-priority 100 --max-priority 104 run
QA has a huge impact in the migration process ensuring that automated testing covers most of the features of Teixo. But not all the features and user workflows are covered by automated testing. Some features, procedures and/or behaviors are complex and specific for certain customers and thus they are not test covered.
This is where Customer success department comes to play, ensuring that all those specific points are working fine, with a previously defined Test plan. They did several manual test validations, following the test plan, detecting various issues, and most valuable, sources of issues even in non tested parts of the application. All those testing was made in our staging environment on the edge channel.
When all the (known) issues were fixed, we deployed our new brand Teixo in production, on the Edge channel, accessible via specific url. At this point we had two different versions of Teixo working with the same database instance:
We started working on this edge version internally for a few days.
Meanwhile Customer Success department started to talk to a selected group of customers asking them if they would like to try the new Teixo Release before it was fully open to every customer.
The advantages for them were that if there were any problems with their workflows on Teixo we could fix these issues very soon. Many customers agreed and we started to allow them access to the edge version. Each week more customers started to work on the new version until we had enough customers working on the platform.
This double production environment led us to an overhead work as explained early (section Double Trouble) but it was worth the effort.
When we considered the edge version was stable enough the final step was planned. It would require several stages:
When everything was ready and the date arrived the change was relatively easy. We still had to face some issues when all the customers started using the new Teixo version but they were easily handled and in a few days the situation was stable, the more complex problems were already solved.