Production and Learned Lessons - How we updated our main product’s core without downtime

Tercera parte del diario de Ricardo Vila, ingeniero de software en TEIMAS, actualizando Teixo a la última versión de Rails.

This is some text inside of a div block.

2.3.3 Preparing production environment

2.3.3.1 Servers and more servers

The upgrade to Rails 3 also came with the need to change the servers we were using with Rails 2. We had to configure and deploy new servers with an updated OS and installed libraries. Also update all the scripts and be aware of the problems that came from having two staging and two production environments with different configurations.

Nevertheless the most relevant change came from switching the server from Unicorn to Puma. Despite it is not necessary to use Puma with Rails 3 we decided to go on with this change in order to advance this point for the next Rails upgrade we will face (Rails 4.2).

This was the most headaching part because we face a loss of performance that we do not understood at the beginning. With Rails 2 we configured several Unicorn processes on each server, depending on the CPU and RAM of the server. With Puma we wanted to change this behavior and take advantage of Puma’s threads. So initially we configured only one Puma process with several threads on every frontend server,, so our puma.rb looked like this:

            puts "1 workers and 7 threads"

            workers 1

            # Min and Max threads per worker

            threads 1, 7

At this point we didn’t knew that Rails 3.2 is not multithreading. So Puma with 1 worker and 7 threads is the same than 1 worker and 1 thread. With this configuration each server could only dispatch one request at a time. The other requests were waiting in Puma’s queue. We notice it thanks to New Relic’s monitoring tool.

When we noticed it we changed our Puma configuration to this:

            puts "5 workers and 1 threads"

            workers 5

            # Min and Max threads per worker

            threads 1, 1

And performance returned to normal levels. We also had to fine tuning memory usage and the number of processes. Finally we adopted a gem called puma_worker_killer to keep an old behavior we had with Unicorn using a gem called unicorn_worker_killer to avoid memory problems. Those problems seem to have gone on Rails 3 so we plan to remove puma_worker_killer in the future.

2.3.3.2 Dealing with Stable and Edge channels together

At this point we had two Teixos in the Staging environment, one on Rails 2.3 (Stable) and other on Rails 3.2 (Edge). Both working with the same database and with the same Memcached server. Every version had its own set of DelayedJob workers but they all worked against the same database searching for Jobs to run. Also user sessions were stored on the database and all those things together led us to some new problems we had to deal with.

Sessions

User sessions were stored on the database for both channels. This led us to deserialization issues when a user that had a stored session on Edge (Rails 3.2) attempted to use Stable version (Rails 2.3) because Rails 2.3 didn't knew how to deserialize Rails 3 objects. There is no unique solution for this, it depends on the date attached to the session and how deserialization works for each one. In our case we simply added this on config/initializers/a_config.rb 

          ActionController::ParamsHashWithIndifferentAccess = ActionDispatch::Http::ParamsHashWithIndifferentAccess

So Rails 3 could unmarshal Rails 2 objects of this class ParamsHashWithIndifferentAccess and vice versa.

And defining a SessionStore initializer (config/initializers/session_store.rb) to deal with FlashMessages stored on the Session. 

          module ActionController

            module Flash

          class FlashHash < Hash

                 def method_missing(m, *a, &b)

               end

          end

            end

          end

Cache problems

Another similar problem we faced was to deal with deserialization issues on Rails Cache. On this case we changed Teixo to add an environment variable on both channels Stable and Edge. This variable was used on the Rails 3.2 version of Teixo to be appended at the beginning of each cache key use. Doing so makes cache entries for Rails 3 different for Rails 2 (keys are similar but Rails 3 keys has a prefix), so Stable and Edge channels do not share cache entries. 

Be aware that this can led to some inconsistencies between environments. For example a partial of an object show cached on Stable is not anulled when this object is modified on Edge and vice versa. So be conscious of cache problems if users work on both environments at the same time.

DelayedJobs, workers and offsets

As it was said previously Teixo uses the DelayedJob gem. It defines tasks as different job classes extending  Delayed::Backend::ActiveRecord::Job. Also we simulate different Job categories grouping Job types by priorities. I.E.: priority 0 to 4 means job category A, priorities 5 to 9 means category B, and so on. In the production environment we have different number of Workers to run those Jobs depending on the category: Two workers for category A, one for category B, etc.

When we planned to have two channels we needed a way to ensure that workers on Stable will run only Jobs of the Rails 2 version and the same with Rails 3.

We solved it with a priority offset. Depending on the environment we have a DelayedJob offset parameter. This parameter was used during Job creation and added to the base priority of the job. On the Stable environment the offset was 0, on Edge it was 100. So jobs on Stable had priorities between 0 and 99, and Edge jobs had priorities between 100 and 199. We needed to adapt Workers initialization to add this offset when launching.

Worker for Jobs with A category Stable was launched like this:

          /script/delayed_job -n 2 --min-priority 0 --max-priority 4 run

Same worker on Edge

          /script/delayed_job -n 2 --min-priority 100 --max-priority 104 run

2.4 QA and Customer Success department

QA has a huge impact in the migration process ensuring that automated testing covers most of the features of Teixo. But not all the features and user workflows are covered by automated testing. Some features, procedures and/or behaviors are complex and specific for certain customers and thus they are not test covered. 

This is where Customer success department comes to play, ensuring that all those specific points are working fine, with a previously defined Test plan. They did several manual test validations, following the test plan, detecting various issues, and most valuable, sources of issues even in non tested parts of the application. All those testing was made in our staging environment on the edge channel.

2.5 Stepping into production

When all the (known) issues were fixed, we deployed our new brand Teixo in production, on the Edge channel, accessible via specific url. At this point we had two different versions of Teixo working with the same database instance: 

We started working on this edge version internally for a few days. 

Meanwhile Customer Success department started to talk to a selected group of customers asking them if they would like to try the new Teixo Release before it was fully open to every customer. 

The advantages for them were that if there were any problems with their workflows on Teixo we could fix these issues very soon. Many customers agreed and we started to allow them access to the edge version. Each week more customers started to work on the new version until we had enough customers working on the platform.

This double production environment led us to an overhead work as explained early (section Double Trouble) but it was worth the effort.

When we considered the edge version was stable enough the final step was planned. It would require several stages:

  • Duplicate our edge environment in production creating a new one now known as stable, and redirect traffic from the stable url to the new environment.
  • Merge the edge branch code to the master branch.
  • This new Stable environment would be running code from the master branch.

When everything was ready and the date arrived the change was relatively easy. We still had to face some issues when all the customers started using the new Teixo versión but they were easily handled and in a few days the situation was stable, the more complex problems were already solved.

3 Learned lessons

  • Nobody can handle your code better than your team. External teams may help and guide but the best developers you will find for a project like this are at your home
  • If you team up with external help, teach them how your product works and why it behaves as it does. Be sure that the external team also understands your workflow, and tools, and last but not least have a fluent communication with them.
  • We’ve made the right choice using two branches of code instead of the dual boot mechanism and making major Rails steps instead of minor ones.
  • Expect more problems than the known ones, many more. Especially if your project has grown away from standard Gems and Libraries.
  • Updating the core of your app would be a good moment for doing cleaning and tidying code on your app beyond the changes needed for updating Rails. But it's worth not to be too ambitious because big changes will lead to increase a lot the already existing entropy on the project. On our first step to Rails 3.2 we also had to change all our AWS servers and this brought us many problems added to those related to the rails update. Don’t try to enhance everything, take notes for future improvement tasks. Do this cleaning and tidying before or after the update.
  • The non technical part of the update process is as important to success as the technical one. Project testing, validation and customer management is key.
  • Don’t be afraid of Monkeypatchings or workarounds. Control them and try to fix them in a later phase of the project.

Fecha
23/6/23
Categoría
Tecnología
Etiquetas
Compartir en
NOTICIAS

Suscríbete a la newsletter

¿Quieres recibir nuestras noticias en tu bandeja de entrada?