Fernando Alvarez, software engineer, shares the challenges and learnings from a recent content migration process.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+

It’s a great day at the office, the birds are singing, the flowers are blooming, and the coffee is pleasant. On this day, our client contacted us to tell us about a new requirement that, at the time, we thought would be a piece of cake… but we were so wrong…

The client told us they needed to migrate all the content from their CMS to the new Content Delivery Stream we were building. “Move content from one node to another” sounds quite easy, but there were a lot of problems to be solved behind that phrase.

Old vs New

The client wanted a fancy, brand-new Content Delivery Stream, including a headless CMS, Serverless back-end, and a front-end to display the its content. The CMS they were using had a lot of problems, from poor client-side performance and a slow delivery stream, to constant server down-time.

The architecture we proposed looked like this:

High Level Architecture

In this proposal, CMS users can create and manage content inside a Headless CMS based on WordPress. When users publish an Article, the CMS sends it to the Content Stream, which uses AWS Serverless Lambdas to transform and confirm the content. Content gets hosted on an ElasticSearch service and web clients can visit the frontend and navigate on the website. The front-end is built in VueJS and Nuxt.js (server-side rendering for VueJS). The front-end fetches the data from ElasticSearch to display the content. For third-party news feeds, content is loaded using the WordPress REST API.

What were we going to move?

The requirements were clear for us:

The team needed to move around half a million articles from the client’s old CMS into the new Content Delivery Stream. The content had to be available to display on the front-end and for CMS users to manage it. The articles had to show the same content as the old CMS and needed to be flexible enough to edit in WordPress.

Designing the Migration Process

The client gave us access to an ElasticSearch endpoint containing all the data to be migrated, so we could use this instead of querying their database directly. At first, we thought of doing a migration from their ElasticSearch to our Delivery ElasticSearch to have the content ready for external visitors of the website, but there was a problem with this approach, the content would not be available in WordPress to edit and manage. After seeing this problem, we started doing some research on getting the content “On Demand”, using a similar process as the ElasticPress Plugin for WordPress. Because we lacked the time to develop this idea, we dropped it.

In the next iteration of the design process, we decided to insert the migrated content into WordPress. We already had a plugin that transformed the content and the metadata into the required schema that ElasticSearch needed; we only needed to convert every article from the client’s ElasticSearch to WordPress post schema.

For this, we thought about automating the ingest from ES client to our WordPress instance. Our first choice to do this was using the Amazon Simple Queue Service (SQS) to throttle the Content Ingestion. The plan was simple, and then, this diagram was born:

To achieve this plan, we only needed a SQS Producer to grab from ES and ingest it to an SQS queue, and a couple of SQS Consumers to listen to the queue and insert the messages into WordPress. Our technology choice was NodeJS, especially because there are two easy-to-use SQS Producer and Consumer modules for NodeJS available in NPM.

At the time we were developing the SQS Producer, we faced an important problem: ElasticSearch’s limit of 10K documents on from + size queries became a blocker. To solve this, we decided to use the Scroll API to access to every document in chunks.

Another problem we faced when developing the Producer was that, if the process died, the state of the articles wasn’t saved and everything needed to be ingested again. To avoid this problem, we started saving the UUID of the latest ingested document in a database every time the process ran. So every time the process started, it checked the latest UUID in the database, so the Scroll API knew where to pick up the work.

The consumer process was easy to develop. We created a NodeJS script that would listen to the queue and would get data in chunks of ten. After this, it would try to insert them into WordPress. If the process failed fifteen times, SQS would send the broken messages into a Dead Letter queue. To keep the process alive, we used PM2.

After creating the SQS producer and consumers, we started developing the content transformers for the WordPress REST API. At first, we decided to get every dependency inside WordPress, but then, we started to notice requests took a lot of time. Transforming and fetching data inside the CMS was a lot of work for WordPress and MySQL. After a tough afternoon of brainstorming, we decided to fetch all the dependencies in the SQS Producer. We transferred that work to Node instead of WordPress to avoid Requests timeout from the API.

After making that change in the Producer, everything started going fine. We decided to start testing the migration process and noticed that MySQL and PHP were taking too long transforming and inserting data into the CMS.

We decided to use Amazon Aurora to scale up the database power. Things started to look a little bit better after that, but we noticed that the process was going to take many days to finish. With around 50 Articles per minute, the complete process would take forever, so we decided to vertically scale-up the CMS.

We created a Terraform template to create a cluster of 5 WP instances, all of them with the same amount of high power, and they were all connected to Aurora. Every request passed through an Elastic Load Balancer (AWS ELB) and, using a round robin, would access each instance.

And then, at last, we had our first architecture design of the migration process:

Architecture of the content migration using SQS

We were so proud at that moment because everything started to flow super nicely! But we knew this could be improved and that we could  speed up the process a little bit more. We decided to go to the next level by using Amazon Kinesis.

We also decided to change the SQS Producer to be a hybrid, supporting SQS and Kinesis ingestion. For the consumer, we created a lambda function that would get triggered when a new event is ingested into Kinesis Stream. At that moment, we could get around 150 events per minute, and guess what?! The process was too much for WordPress and started timing out. The solution to this was to vertically-scale up the EC2 instances. At last, our migration process with Kinesis was finished.

Architecture of the content migration using Kinesis

At this point, everything started going exceptionally well, we were achieving extremely good time per request, handling a great number of requests per minute… Then everything started to get dark for us.

The Bumps in the Road

We thought we were out of trouble, but again, we were so wrong. The first bump was when we hit 100k articles and everything came tumbling down:

What an odd peak…

We were surprised by this, we never thought a peak like that could happen in the process, until we noticed something interesting:

We noticed that getting the dependencies to send a transaction to the Delivery Stream was taking too long «but WHY!?!» We noticed that the wp_postmeta table from WordPress had around 1.5 million rows, a lot of the values weren’t indexed, and this required a lot of time for MySQL to resolve. This problem was serious because we were facing a probable design problem in the way WordPress stored the article metadata. To solve this, we installed the ElasticPress Plugin to save all the content from WordPress into another ElasticSearch instance. And after this change, everything started to run normally again!

Before ElasticPress and After ElasticPress

Another problem we faced was that a process in Kinesis finished abruptly.

Fortunately, this just was a configuration problem that was solved easily by changing the data retention period of the event in the AWS Console.

From 24 hours of retentions to a week of retention

Another problem we faced in our tests was the number of request per minute being at an incredibly slow rate when the database had around 1 million Articles:

To solve this, the operation team suggested we use the HyperDB Plugin for WordPress to easily manage the read and writes replicas since we were using AWS RDS Aurora as Database Infrastructure. After configuring the plugin, the results were very satisfying:

HyperDB helped us manage Read and Write replicas & speed up the process

Time to Run the Data Migration

After all these bumps, we were finally able to run the data migration in the client’s infrastructure. We set up everything and we started the process first thing in the morning. After running the process for 10k articles, we started noticing slowness in the process, time outs, and very high CPU usage in the AWS Aurora cluster.

We made some tests again in our infrastructure and everything was fine, but their infrastructure wasn’t working! Something fishy was going on. We were facing an interesting and unpredictable problem. So close to success, yet so far. We started debugging the process using new relic, and we noticed something interesting in the client’s infrastructure:

Every request took almost 9 seconds!
126 instances!?

After some exhaustive debugging, the architecture developer found a bug in the scalability strategy. The problem was fixed and the number of instances started to look more consistent, but we still noticed slowness. Then, we found this:

Painfully long requests

For some strange reason, a specific select was taking a lot of time, and the database CPU was peaking up to 100% of usage. This problem was interesting to solve because we needed to replicate the same problem in our infrastructure. We found that each post insert made a request to a table to increase the count of terms. We clearly didn’t need that for our CMS, so we made some query improvements and optimizations, but the problem was still there. We decided to add a caching layer using Redis and the plugin Redis Object Cache, and the result was better than we expected.

Before Caching

MySQL painfully getting peak up

After Caching

Every request started to take less than a second

Conclusion

A valuable lesson we learned through this process is that data migrations are not as easy as they sound. This whole experience helped us to understand data migrations better, and recognize the high value of migration to us. Right now, the solutions provided in this article are effectively working for the content migration. I really recommend you try them, and participate in the development process for these kind of projects.

 

sara Posted by sara on Thursday, March 15, 2018.

Comments

Leave a Reply