Resource

A DevOps Story of SQL Server Pipelines – C3 Post Trade

 

Hi there. This is Anthony Vandewerdt, Field CTO for Actifio, Asia Pacific. Today, I have several special guests with me. I have Kosala from our technical account management team in New Zealand and Joe Hassell, who is our field original director for Actifio Asia Pacific. And our very special guest is Gareth Perry from C3 Post Trade. So Gareth rather than steal your story. Why don’t you tell it to us?

Thanks Anthony.

Hi, I’m Gareth I’m one of the actives at C3 Post Trade and at C3 Post Trade we build and provide a number of SaaS offerings in the financial markets, and we deal with large volumes of data and host our services from data centers around the world.

We’ve recently embarked on a journey to further improve our processes with regards to the protection, management, and automation of SQL server resources within one of our platforms.

Today, I’m going to talk about that journey and the role Actifio plays in those pipelines.

We’re going to explore three key areas, protection and recovery of the production environment with regards to the SQL server availability group, provisioning data for test environments, and the developer experience in the DevOps pipelines. Let’s begin.

So protection and recovery.

Firstly, a little bit about our environment. Within this production environment, there’s a SQL server availability group and it’s managing 25 databases with around three terabytes of data. For resiliency the environment has two sites and two data centers on opposite sides of the planet. And as with many systems, the availability and resiliency of these databases are absolutely critical to the platform.

Some of the requirements we had going into this phase of the journey, I just mentioned how critical the databases are to the platform. And so the entire solution must be robust and it must provide security around the access and data. It needs to provide a solid replication service that can handle the types of disruptions that you might expect with data centers in different continents. And it must be able to restore data to an exact point in time. It needs to provide structured management of the backup assets. So, full steps, logs, that sort of thing. And it has to allow flexible retention policies, for example, restoring any point in time over the last month and then perhaps daily snapshots prior to that.

Let’s take a look at the solution we have in place. Here we have our availability group spread over the two data centers, Center A and Center B. Twenty-five databases with the three terabytes sitting on a primary server that has a synchronous link to the secondary in Center A and an asynchronous to the secondary in Center B.

Now, the availability group is there to handle that instant fail over scenario related to a hardware failure or perhaps a data center failure. The data protection and recovery aspect of the solution is where Actifio comes in. We’ve installed an Actifio Sky Appliance in Center A, and this appliance is responsible for incremental backups from the secondary SQL node.

A second Actifio sky appliance is installed in Center B, and the Actifio Stream Snap Replication Service continuously replicates the incremental backups over to Center B. And so this ensures both sides have the same history of the data.

Now, let’s look at our recovery scenario. This is for if there’s a data loss event and we need to restore the production database or databases to a prior point in time. So anyone who’s been here will know one of the biggest challenges is knowing the point in time to restore to, especially when you’re racing to meet an RTO. So, Actifio allows us to rapidly start mounting out multiple copies of the databases at different points in time. So for example, team 45 team 50, team 55, and this means in parallel, you can have the recovery team reviewing the state of the data. And in this case, let’s say we decide team 50 is a good option.

We can now use the Actifio Global Manager, which is a product for managing multiple appliances to coordinate amount of team 50 to all of the nodes across the two data centers at the same time.

Once this is complete, you’re able to bring the primary database online and due to the secondaries being at the same point in time, same alley scene, the availability group will sync and the databases will come online almost instantly.

It’s worth noting here that with this being the production environment, you do want Actifio to perform a clone copy of the data onto your premium storage. However, I know there’s a new feature called Mountain Move, which can bring the databases online straightaway, and then perform that copy in the background. That sounds pretty cool. And we’re quite keen to have a look at that.

Some of the successes of this protection phase. So Actifio is integrating natively with sequel, performing incremental forever backups. So for us no more discontents of fouls and depths, just that incremental forever. The stream set replication service is proven really solid and self managing even across the planet. And we do get those connection issues at times. Access to the protected data is secured by user accounts, with an audit of any jobs with anyone requesting a mount or a clone of the data.

Actifio manages all the backup assets. In fact, it’s all based on change blocks and point and time. So there aren’t really any fouls or diffs to even think about. And for us, the ability to easily and rapidly investigate databases at multiple points in time in parallel is something we really weren’t expecting to get out of the solution, which has been fantastic for us. And so all of this really adds up to a much higher confidence and our ability to meet our RPO and RTO commitments for the SQL databases.

So next, now look at some of the test data management and this phase of the project. We have multiple full-size test environments that are frequently refreshed, and this can be daily sometimes multiple times in the same day, they match production in almost every way, including the number and size of the databases with sites across the two same data centers that we have in the production environment.

Our requirements of what we wanted out of this phase from a DevOps perspective was we need to provide recent copies of mass data from production. And this had to replicate the same availability group typology that we have in production because these environments can refresh frequently. The restore of the availability group must complete in less than 30 minutes, the performance of the databases must be on par with that, that we see in production. And the solution must scale to support multiple instances of these environments running at any one-point in time without affecting performance.

So let’s have a look at this setup. If we quickly review what we have in production, we have the SQL server availability groups split across the two data centers with the Actifio Sky in Center A, Actifio Sky in Center B with the Streams Net.

Now, the environments themselves are fairly complex, many servers, web services, message brokers, UI, et cetera. And as mentioned, the SQL AIG is spread across the data centers.

So for this, we use a product called Octopus Deploy to orchestrate the deployment process. And this starts off by preparing the environment. When it’s time to prep, the SQL databases the Actifio Sky appliances are coordinated across the two data centers to mount the same point in time to each of the nodes. The availability group is then synchronized and the masking process kicks in. Once that data masking is complete Octopus Deploy then continues to orchestrate the deployment in the same way it would perform this action on the production environment. And this whole process is replicated across additional test environments.

Some of the successes for us from this phase of the implementation with Actifio, so managing assets, a traditional approach to restoring these databases would require quite a painful orchestration of fulls, diffs, logs across all of the SQL nodes and the two data centers. With Actifio, we don’t deal with any of this. It’s just a case of requesting that point in time recovery.

Time to restore the availability group, using a traditional approach, restoring old databases to all nodes across the two data centers and the availability group takes five hours. Using Actifio, we’re now able to complete this process in around 15 minutes. And in addition to that, it’s worth mentioning that the data we’re restoring there could be within say 20 minutes of production, if required.

So disk consumption with a traditional approach, disk consumption to support the databases on all environments was 36 terabytes because of the way Actifio Sky manages the discs we’re now using just four terabytes, which I know seems hard to believe, but that is what it is.

So now some of the fun stuff, developer experience in CI pipelines, the final piece of the puzzle. So our DevOps Environment and CI pipelines, so the development environment and the CIA pipelines. So we embrace a feature branching methodology, and there are often many feature iterations in progress at any one point in time, we have a CI Pipeline using GitLab and Team City to automate the build, package, and thousands of unit scenario integration tests off the back of commits to the GitRepo. And this pipeline also simulates upgrades and performance tests against realistic production workloads.

Our requirements out of this phase were that all of the production data must be massed and validated and the CI pipelines cycle frequently. So the three terabyte databases must refresh in less than five minutes. The database provisioning solution must scale to tens of developer and agent servers. And it’s really important that devs can easily mount point and time masked data from production. Like our other pipelines, the database ones must support automation via API and critical to the success of this whole thing is that it must be stable and consistent.

So let’s have a look at a couple of example workflows. So here we have a developer and they’re working on a feature branch and when they’re ready to push their changes, they push them up to Git. TeamCity identifies the change, and sends work to one of the agents.

And for those who don’t know what TeamCity is, it’s a build management and continuous integration server. The agent picks up the job, pulls the feature branch from Git, and then begins to build the code. Now, if the build succeeds, the pipeline can then proceed to the database tests. So the agent calls in to Actifio via the Actifio API and requests, a set of masked production databases from the Actifio Live Clone. So the Actifio Live Clone is a way for us to create a set of pre-master and validated databases that are ready for instant mount to any of the servers.

Actifio mounts the databases directly to the agent, and now the deploy process can continue by running the deployment and the database upgrade scripts. So databases are often complex with regards to the upgrades, and we need to be confident that the upgrade scripts for a package of any of those feature branches will run successfully against the live environment. On success of those upgrade scripts, the pipeline will then run some performance tests, and this is checking service actions against configured benchmarks and the test. Anything outside of what we expect will be flagged. And this helps to identify performance issues against realistic data. While still in the development process, feedback of the tests is sent via TeamCity to the developer and if all goes well for our developer, they’ll get a green light and that always makes us happy. Yeah.

Now, let’s look at another important scenario. So our developer needs to perform some analysis that can only be done against a masked set of the production data. Now, this process is often a painful challenge for a developer to access this data, especially in a state that’s valid for their feature branch modifications. However, with the automation that we’ve put in place with Actifio, this has become quite a simple task for us. The developer initiates the process similar to the CI pipeline that will call into Actifio and request the masked data to be mounted. Actifio will mount the data, making it available for the developer. And now the developer can perform the required debugging or analysis that they need to do. And if the developer needs to make some code changes they can do this and they can then run any upgrade scripts locally to ensure they’ll pass in the CI Pipeline and therefore, in the live deployment. If successful and happy, the developer will push those changes to Git, and the CI Pipeline will kick off again and return that feedback to the developer.

So when required, now, a developer is able to switch a feature branch, build the code, locally mount three terabytes of masked sequel data, upgrade the database, and be debugging on a stable and consistent environment within around ten minutes.

So successes out of this phase of the journey, we’re seeing an average mount time of one minute, 40 seconds for the three terabyte databases, which is pretty incredible when you think that this data could be from any point in time over the last month. And as far as scale, we’ve had over 20 mounts active at the same time and seen no difference in the performance of those databases.

We’ve had Actifio perform over 33 terabyte mounts in one hour, and again, seen no real difference in performance from the appliance. And that’s by no means a limit. That’s just as hard as we’ve pushed it so far and everything is behaving well. We’ve reduced our disk consumption by 20 terabytes and the entire pipeline has proven to be very stable for us.

To summarize the journey and the implementation, the focus on our entire pipeline from production protection to developer experience has seen some impressive efficiency gains and allows us to work at an agile and consistent cadence while maintaining our utmost quality. And Actifio plays a pivotal role in that.

And that’s our journey. Thank you.

Thank you very much for that. That was absolutely fascinating. One question I do have is what data masking software are you using is it your own home built scripts, or do you actually have a specific tool?

Yes, that’s right. So we do have our own process that we’ve been building up over many years. I do know there are tools out there that integrate with Actifio, at this point in time, we’ve been able to hook the existing process we already had into our pipeline.

Oh, that’s great. So you actually managed to reuse your existing intellectual capital, if you like, that you knew and trust without actually having to break it.

Yeah. That’s the one. Yep.

That’s fantastic. And how many did you say 30 mounts an hour?

Yeah. So that’s driven really by the pipeline when you’re having a busy day and you’ve got many feature branches, many developers pushing changes into to Git, you’ve got a lot of tests happening. So that’s what we’ve pushed it to. Like I was saying, it’s been solid. I’m sure it can do a lot more. Maybe we need to get the developers working a bit harder, see what it can really do.

I was just thinking, I mean, because the converse of course is in the old world, how long would your developers have had to wait if instead an environment was being manually built or built by a traditional restore. So the wait time there is completely gone.

Yeah. And it’s that consistency, that ability to get it a known amount of time, for us that developers are really just triggering a process and sitting back and just watching the automation happen for a lot of limits, switch a feature branch, kick the process off, go grab your coffee. And then by the time you come back, it’s built, your databases are there. And the most important part of that is that it’s consistent. Often people will have a big database and they’ll change to another feature branch that doesn’t quite match the same state as the database. But that’s their best option. For us just being able to switch and have everything consistent is just key.

That’s fantastic. Well, thank you very much Gareth for spending the time to share this story with us, we really do appreciate it and we’re enjoying the journey with you and we hope to continue it for a long time to come.

Great. Thanks Anthony.

Learn more about managing SQL database for test and dev and data protection.