A brief history of OVN migration
We are currently undertaking an upgrade of the Software Defined Networking (SDN) for the Nectar Research Cloud.
History
Nectar Cloud SDN is currently powered by MidoNet. We started off using this project in around 2016 to provide Advance Networking capabilities. Unfortunately, in 2019 Sony purchased Midokura, the company that wrote MidoNet. MidoNet then became unmaintained. Lucky for us, it is opensource, so Nectar could continue using it and we have been maintaining it since. However, the feature set and integration of MidoNet with the rest of OpenStack components drift further and further with each upgrade, and it will be a pointless endeavour carrying MidoNet forever.
Fortunately there is hope. The team at Open vSwitch, decided to work with OpenStack to create a project named OVN (Open Virtual Network). Open vSwitch is a popular virtual switch software running on Linux to provide many networking capabilities. The OVN project will create a software which will provide the L2 and L3 capabilities of an SDN, much like what MidoNet does.
Over the years, OVN has gradually matured. It is now the driver of choice which upstream supports. This means by migrating to OVN, we can keep Nectar's SDN up to date and we do not have to maintain MidoNet anymore.
First Migration
We did our first migration attempt in 2020. After running it successfully in our development and test cloud, we tried migrate our production cloud into OVN.
The first part of the migration was to import the existing data into OVN and let OVN create the equivalent resources (Networks, Ports, Routers, etc). Unfortunately, the import script hang after working on our data for a while.
Closer inspection revealed that the database didn't really like it. We tried hacking around for a bit, but not having good background knowledge of OVS/OVN, decided to open a bug and seek help upstream. We provided some fake data similar to our production size. When the maintainers tried to replicate our issue this was the memorable reply - "...8GB of ram and after about 3 hours it was finally killed due to running out of memory" 😀
There wasn't much we could do, so we abandoned our attempt and decided to wait till the performance improvements land.
Second Migration
The second migration was attempted 1 year later in 2021. This time it was much better - the performance fixes has landed and preliminary testing was encouraging. We managed to to run OVN in dev and test cloud.
The whole of 2021 was preparing to run it on the production cloud. Remember the performance fixes? For us to use those, we had to upgrade the whole cloud to the versions with the fixes. While the performance fixes were in Neutron, the Compute nodes also needed to be upgraded as they run a component that talks to the SDN databases (OVSDB) which ultimately provides networking to the virtual machines. This means upgrading Nova from Train to Ussuri, and also upgrading the operating systems of all the Compute nodes (>500 nodes!) from Ubuntu 18.04 to 20.04. Kudos to the Nectar operators from different sites who pulled this off.
When all the upgrade and setup tasks were done, we were running OVN in a 'shadow' mode. This meant that all the resources were in both MidoNet and OVN, but OVN is not in use. To move it, we needed a 'Big Bang' migration day.
On the final migration day, we needed to do a few things all at once to move everyone to OVN:
- Stop all instances on MidoNet
- Stop the Neutron server so there will be no changes to the databases
- Run a final sync to make sure MidoNet and OVN looks exactgly the same
- Flip a few flags in the databases
- Move the different Floating IP networks at the different sites from the MidoNet gateway to the OVN gateway
- Turn on OVN drivers in Neutron so all it will now handle (CRUD) the resources in OVN and not MidoNet
- Restart all the instances which will now connect be connected via OVN
Fingers crossed, we kicked off the Big Bang on 29 March 2022. All over the country, operators and network staff kicked into actions migrating different parts of the cloud. On the core services side, we flipped the necessary switches and turned on OVN in Neutron.
Then everything fell over.
Neutron started throwing craploads of errors when the driver was turned on. The OVSDBs started pegging at 100% CPU. Same thing for Neutron.
Sh*t
Looking at the errors, it looked like Neutron, now with OVN drivers turned on, was trying to write to the OVSDBs. However, the DBs were unable to keep up with the amount of operations and were constantly rejecting the updates. In addition, due to them being so busy, certain timeouts were reached and elections were triggered, further exacerbating the issue.
Unfortunately, ovsdb-servers are single threaded, so we couldn't throw more CPUs at the problem.
We pondered for a while and attempt to see if Neutron will recover (sometimes there are thundering herd problem when you turn everything off and on again across big systems). However, in this case it did not seem that Neutron/OVN will ever get into a stable state.
Rollback rollback!
After pondering it for a while, we decided to initiate a rollback. I had the unfortunate privilege of getting people all over Australia to undo their work. This is probably one memory I will never forget. 😮
Rollback was successful. Big thanks to all the operators who put in day's work. Unfortunately it did not turned out the way I wanted. Now it was time for the walk of shame to the local pub where I bought everyone a round and beg for forgiveness.
What went wrong?
In subsequent days, we discussed this migration attempt. Some things could clearly have been done better.
- Big bang. A big bang was chosen because it was the quickest and easiest. If it had succeeded we would have achieved migration with minimal effort and moved on to other tasks. All of us have unlimited tasks with limited time, so effecient use of our time is essential.
- However, big bang by definition is not a controlled gradual migration, and does not allow us to see how things scale up with load.
- Testing. One of the suggestion was that could we have tested more. However, we can never fully replicate prod in test (unless we get two clouds of the same size yay budget).
What went right?
Not all doom and gloom, there were a few things that went right
- Having a rollback plan. Rollback was a high priority for us.
- Executing rollback plan. We had a one day outage planned. The outage was within the time communicated to our users. Although we did not achieve our migration goal, there were no unexpected disruption to users and everything was back the way they were at the end of the outage.
- It gave us more ideas on where could go wrong and helped us shape our future migration strategy. Yes we are doing it again!
The End
A few days later, I saw this bug linked from the mailing list, which I suspect is what hit us. And I quote...
Basically, you can't successfully use RAFT unless you have python-ovs 2.17 (which hasn't been released to PyPI yet due to it accidentally breaking Neutron/ovsdbapp). ... during ovsdb-server snapshotting which happens every 100 txns, ovsdb-server currently does a leadership change which breaks the connection if you are connected to the leader. With out monitor-cond-since, the entire db is downloaded again on reconnect, which just completely destroys performance.
🍻
Comments
Post a Comment