Migrating to Nova Cells v2

On 15th October 2019, the Nectar Research Cloud finally migrated to Nova cells v2. This completed a long journey of running multiple patches on nova to make cells v1 work.

First a little history.....

The Nectar Research Cloud began its journey with cells back in the Essex release. We heard about this new thing called cells (now known as cells v1) and were aware Rackspace were using them in a forked version of Nova. Chris Behrens was the main developer of cells and so I got a conversation going and soon we were running a version of nova from a personal branch from Chris's Github. This helped us scale the Nectar Research Cloud to the scale it is now and was also helpful for our distributed architecture.

I think it was Grizzly where cells v1 was merged into the main nova codebase. Finally we were back onto the mainline codebase... however there were multiple things that didn't work, including security groups and aggregates to name a few. That led me down the path of nova development and we managed to get most of these things working, unfortunately by this time cellsv1 had been marked as deprecated and so patches weren't accepted upstream. Back to a fork and so the Nectar nova cells v1 branch was born. It also started a community of cells v1 operators, we now had Nectar developed code running at places like CERN and godaddy.

With cells v1 not being accepted into the mainstream nova codebase, the upstream developers began working on a replacement, known as cells v2. Cells v2 was added to Nova in the Pike release, but it wasn't until Queens where multi-cell deployments were supported.

Upgrade time!

Our aim with moving from cells v1 to cells v2 took approximately 3 months of planning and testing. We aimed to migrate to cells v2 with as little downtime as possible and also as a gradual process as we had concerns about doing a big flick of the switch upgrade.

Because cells v1 had 2 DBs, API and compute there was a concern that there may be things that we're in sync, a tool was written (thanks Jake!) and we discovered several things that we needed to sync up.
We managed to set up a separate cells v2 nova-api endpoint standing alongside our existing production cells v1 endpoint. This allowed us to test many aspects of the migration without impacting production services and helped us uncover some bugs like firewall issues, hitting mysql max_connections and other minor issues.

The first thing we did was gradually redirecting all GET HTTP request to go through the cells v2 API endpoints.
GET requests response times for nova-api

As you can see for GET requests, cells v2 is slower than cells v1. This was expected as cells v1 talks to the core database which lives in the same data centre as nova-api and is running on high performance hardware. With cells v2, nova-api talks to all the compute cells databases which are spread across the country (and in New Zealand) and so latency becomes apparent.

POST/PUT/DELETE requests response times for nova-api
The POST, PUT and DELETE response times ended up being approximately the same after switching over. You can see a minor increase before flattening out. The increase was due to some scheduling and database issues (discussed below) but in the end it looks as if these requests are actually quicker in cells v2.

All requests response times for nova-api
GET requests make up over 90% of all traffic so overall we do see slower response times. You can see it starting to settle down in the end and we see response times going from ~0.4s to ~0.7s almost double the time. We hope we can make some improvements there with database tuning and upgrading nova to newer versions.

Overall the migration from cells v1 to cells v2 was pretty smooth and the outage to the API kept to near zero. We had roughly an hour of scheduling issues but once we had those sorted it has been very stable.


Popular posts from this blog

Understanding telemetry services in Nectar cloud

Getting capacity and usage information out of Openstack Placement