Posts

A brief history of OVN migration

We are currently undertaking an upgrade of the Software Defined Networking (SDN) for the Nectar Research Cloud. History Nectar Cloud SDN is currently powered by MidoNet . We started off using this project in around 2016 to provide Advance Networking capabilities. Unfortunately, in 2019 Sony purchased Midokura , the company that wrote MidoNet. MidoNet then became unmaintained. Lucky for us, it is opensource, so Nectar could continue using it and we have been maintaining it since. However, the feature set and integration of MidoNet with the rest of OpenStack components drift further and further with each upgrade, and it will be a pointless endeavour carrying MidoNet forever. Fortunately there is hope. The team at Open vSwitch , decided to work with OpenStack to create a project named  OVN (Open Virtual Network) . Open vSwitch is a popular virtual switch software running on Linux to provide many networking capabilities . The OVN project will create a software which will provide the L2...

How we lost all user data on our Jupyter Notebook service

Image
  On Tuesday the 21st of February we did some maintenance on a Kubernetes cluster that hosts our Jupyter notebook service . This maintenance resulted in all users data that wasn’t actively being used being deleted. At the time of the maintenance this was all of our users. In preparation for moving our Kubernetes cluster to some different hardware we needed to shrink our Kubernetes workers to allow for some more space on the underlying hypervisors. So one by one we drained and deleted a worker and then created a new smaller worker and joined it back into the cluster. This all worked smoothly and the JupyterHub service was unaffected except when we needed to move the hub process. Unfortunately you can only run one so there was about 1 minute where the web interface is down.     Once the rolling rebuild of our workers was complete we made sure the service was working as expected. All looked fine except we noticed our volumes that are attached to each user's pod were empty...

State of the Cloud 2019

Now that 2020 is upon us, I thought it might be a good idea to generate some statistics about the Nectar Cloud for 2019. Instances In 2019, Nectar Cloud ran a total of 70,371 instances. VCPU time These instances ran for a total of 9,203,375 days, 19 hours 17 minutes and 59 seconds of VCPU time [1]. That is around 25,214 VCPU years! The mean VCPU time is about 130 days . The mode VCPU time is 365 days , which means there were lots of single core instances running through the year. Flavour The most popular flavour is m2.large (4 VCPU). There were 26,750 of such instances. End Statistics were generated from Gnocchi . Nectar logs the start/end times of each instance in Gnocchi, as well as a host of other data. As a Nectar user, you can use the Gnocchi API to access metrics for your resources. Let me know if this has been interesting, or if there are any other stats you want to see! Footnote 1. VCPU time is (Number of VCPU) * (Running time). For example, if an in...

Migrating to Nova Cells v2

Image
On 15th October 2019, the Nectar Research Cloud finally migrated to Nova cells v2. This completed a long journey of running multiple patches on nova to make cells v1 work. First a little history..... The Nectar Research Cloud began its journey with cells back in the Essex release. We heard about this new thing called cells (now known as cells v1) and were aware Rackspace were using them in a forked version of Nova. Chris Behrens was the main developer of cells and so I got a conversation going and soon we were running a version of nova from a personal branch from Chris's Github . This helped us scale the Nectar Research Cloud to the scale it is now and was also helpful for our distributed architecture. I think it was Grizzly where cells v1 was merged into the main nova codebase. Finally we were back onto the mainline codebase... however there were multiple things that didn't work, including security groups and aggregates to name a few . That led me down the path of nova de...

Understanding telemetry services in Nectar cloud

Intro Openstack telemetry service is to reliably collect data on the utilization of the physical and virtual resources comprising deployed clouds, persist these data for subsequent retrieval and analysis, and trigger actions when defined criteria are met.  Ceilometer project was the only piece of telemetry services in openstack, but since it was dumped into too many functions, it got split into several subprojects: Aodh (alarm evaluation functionality), Panko (events storage) and Ceilometer (data collection - more about ceilometer  history  from ex-PTL). The original storage and API function of ceilometer had been deprecated and moved to the other projects. The latest telemetry architecture could refer to openstack  system architecture . In Nectar cloud, we are using the Ceilometer for the metric data collection, Aodh for the alarm services, Gnocchi for the metric API services, and Influxdb as the time series database backend of Gnocchi. Influxdb backend drive...

Passing entrophy to virtual machines

Recently, when we were working on testing new images with Magnum, I found that the newest Fedora Atomic 29 images were taking a long time to boot up. A closer look using  nova console-log  revealed that they were getting stuck at boot with the following error. [   12.220574] audit: type=1130 audit(1555723526.895:78): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-machine-id-commit comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [   12.248050] audit: type=1130 audit(1555723526.906:79): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-journal-catalog-update comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 1061.103725] random: crng init done [ 1061.108094] random: 7 urandom warning(s) missed due to ratelimiting [ 1063.306231] IPv6: ADDRCONF(NETDEV_UP): eth0: li...

Nectar Identity service now upgraded to Queens

We're please to announce that the Nectar Identity service (Keystone) has now been upgraded to the Queens release. Normally an upgrade like this wouldn't require a blog post, but there's a couple of significant changes that users should be aware of. Keystone API v2.0 deprecation The Keystone v2.0 API has actually been removed from the Queens release, but we're aware that many users are still using it. We are now actively requesting that any users still using the v2.0 API to move over to the v3 API that has been available since 2016. We plan to keep the v2.0 API running until April 2019, at which time, it will no longer be available. See our Keystone v2.0 to v3 migration guide on how you can switch over. Application Credentials The long awaited application credentials are finally available in the Queens release. Application credentials allow users to generate their own OpenStack credentials suitable for applications to authenticate to the Identity service, with...