Skip to main content

Cluster fails lead to success for fleet management giant: Five devops lessons learned

(Image credit: Image Credit: Profit_Image / Shutterstock)

About two years ago, our DevOps team at ABAX decided to tackle Kubernetes to better scale up and roll out applications to support the dynamic nature of our global fleet management business. It’s been quite a journey, and we have both stumbled and succeeded along the way.

Perhaps, for context, we need to paint the picture of what ABAX does, so you can see how DevOps fits into the overall business. ABAX delivers three primary solutions, namely fleet tracking, electronic mileage logs and equipment, and vehicle control systems. The primary use case for these is to help customers prevent loss and theft through active vehicle tracking and monitoring. Collectively, we track approximately 120 million GSM and GPS ‘positions’ a day via our tracking units, with tens of thousands of customers and over 250,000 real-time connections to maintain.

Headquartered in Larvik, Norway, ABAX is a growing business, and there is a lot of demand for the infrastructure and development team to continually improve our solutions.

My team is primarily responsible for the deployment, availability, monitoring and management of all production customer systems. We started our journey with Kubernetes just over two years ago. Initially, we decided to host our Kubernetes solution on-premises, which we thought would be fine, but as it turned out, it wasn't. We soon realised we didn’t have the technical skills to support it, and we ran into a lot of deployment and QA issues very quickly.

Deployment of Kubernetes is hard due to the sheer complexity of the system. Without proper automation, it’s difficult to keep things consistent. Configuration issues arise, and you’re left with near constant debugging. Best case scenario, you get a failure in the cluster, and you are left figuring out how to get things running again. And while this sounds bad, the alternative is worse: you have a cluster with intermittent or partial failures, which is extremely difficult to troubleshoot.

After just a couple of attempts, we literally lost faith and trust in Kubernetes altogether. Our developers didn't trust IT, and IT didn't trust the developers – we had a big problem on our hands.

We never gave up, and today Kubernetes is central to our business success. Here are some of the valuable lessons we learned along the way as we moved from on-premises to Google Cloud.

#1 Build or buy tooling early on

After failing to get our first Kubernetes projects right, our team decided we needed a partner who could provide management support and help us with both our on-premises and cloud projects.

It was at this time that we began to look for a tool and found Rancher. Because it is open source, we could download and test it for free. As soon as we had deployed it, we began to reestablish our trust in Kubernetes. We were clear on the fact that Kubernetes hadn’t worked, but we soon realised that the problems we had experienced were due to lack of experience, poor design decisions, and limited deployment automation. We frequently encountered configuration issues, and we burned a lot of hours figuring out how to use and troubleshoot the technology. For example, understanding how ingress controllers might misbehave when combined with legacy environments, sorting through persistent storage issues where the VMware integration had no real way of doing backups, and making ill-informed decisions for network setup that can result in losing new features later on, with no way to go back once the cluster has been deployed.

Whenever we had a failure, we lost a Kubernetes cluster and had to start again. That is a week’s worth of work down the drain.

Eventually, by automating configuration and deployments, for example, by scripting with Bash and managing infrastructure as code using tools like Terraform or Rancher, we encountered far fewer stability and configuration errors.

By the time we put a support contract in place, we had already muddled through the many issues we encountered during our initial six to nine months. It was a harsh realisation that we could have avoided the headaches and saved ourselves a ton of time had we started with the right tooling, practices, and support from the start.

So, the key takeaway is to start on the right foot with the right tooling and support. It’s a real time saver.

#2 Automate everything. If in doubt, automate anyway

This lesson is simple: automate whatever you can. You will save time and resources by automating manual tasks you do time and again.

In ABAX, we are a small group of only six people who work with DevOps, with some developers and some infrastructure resources. This team services the requirements of about 30 - 50 developers, which means that if we're going to keep up with business requirements, we must automate the deployment of the software and infrastructure itself. It is intimidating, but the more you automate, the more confident you get in automation, and the easier it gets.

Many companies are still working through the complexities of building binaries/executables independently and then copying these over to a production server. It is exceptionally manual and cumbersome. Roles-based access control, security, and monitoring can take weeks, or months when handled manually.  Fortunately, we don’t work this way at ABAX.

As soon as we brought in an orchestration platform, we reduced the deployment process from weeks to days, and in some cases from days to minutes, by automating processes, including code tests and integration tests as well as the configuration of firewalls and servers – all of which will dramatically speed up time to production. 

Make no mistake; automation can be scary, especially when it relates to business-critical systems. At first, you feel like you are giving away control by putting it all in a piece of scripting software that you may or may not have done right. It is a lot easier to intervene when you do things manually. But automation done right is a game-changer.

#3 Leave room for specialisation, no one is an expert in everything

Some people think that a developer in a DevOps team should be able to do a little bit of everything, but in our experience, this isn’t always the best idea. Yes, some people are multi-talented and skilled, but others may be skilled developers who are terrible at setting up infrastructure. Systems are only as good as their weakest link, and bad infrastructure equates to issues later on.

A case in point is that: developers are generally good at coding software that can run effectively at a small scale, say 250 units, whereas our organisation needs a solution that delivers 250 000 units. By involving DevOps early in the design process, we ensure our scalability strategy is viable using parallel processing instead.

There are currently two trains of thought in the DevOps movement. On the one hand, you have developers and IT on the same team who are forced to cooperate and work together, which is how our team currently works. When first deploying this model, there may be friction among members, but over time it will disappear, and ultimately bringing different perspectives and experiences to the table is a recipe for success. On the other hand, you have a developer-centric model where developers do Ops as well. The latter works well in a startup with a lot of developers and smaller systems, but as a system becomes even a little more complex, it’s probably not the best idea to give people with no operational experience the responsibility of managing them.

In our team, some of us do a little bit of everything, while others are more specialised and this works well for us. My recommended approach is to embed an infrastructure specialist, QA, and build engineer into a dev team. This requires a lot of manpower and money – it is not a cheap option. However, if I were building a team from scratch and had the capital to support it, that is what I would suggest a business do. For ABAX, this arrangement allows us to deliver a complete QA environment in a matter of minutes, as one of Europe’s fastest growing telematics providers this is especially important considering that our systems are required to scale at a moment's notice.

One of the big drawbacks of embedding DevOps team members into developer teams is that it’s easy to lose out on day-to-day connections with your peers, which can cause DevOps practices to stagnate and diverge. We personally set aside time to sync up regularly to ensure any team member can step in and cover for another, without spending a lot of time coming up to speed. We also use GitHub to maintain a universal hub for our Infrastructure as Code.

#4 Communication is key

It is an ongoing lesson for us, and we are far from perfect, but we work on communication all the time.

There are a lot of moving parts in a business. We work closely with our ABAX architecture team to ensure we meet their exceptional performance standards. When you have 250,000 vehicles reporting into a system, with positioning messages coming in every second, stability is key.

We also follow the methodology of blameless postmortems. You don't learn from situations where people are trying to avoid being blamed for things that have gone wrong. Meetings need to be constructive, and when we experience an issue, we need to cover off what went right, what went wrong, how we can replicate what went right, and how to avoid what went wrong. This helps us ascertain what we had to do manually to get it right and map out a plan to automate it in the future.

I would say that DevOps is 40 per cent application of technology and 60 per cent communication. Communication helps you remove barriers between different groups with different interests in the organisation. Meetings that turn into blame games are pointless. We have literally asked people to leave meetings when they are too preoccupied with placing blame. In DevOps, you need to fixate on how to do things better.

#5 Don’t overthink and don’t overcomplicate

I am a great believer of the KISS principle: Keep It Simple, Stupid, despite the irony of my saying so as an ardent Kubernetes fan.

Over the past two years, we have worked on a lot of initiatives using containers. Some have worked, and some haven't. Regardless, we have accepted the fact that not everything we work on will work the first time. So, we learn from our mistakes and carry on.

I will use an example. A couple of months ago, we were working on a project building Windows containers to run some of our legacy applications in Kubernetes. We hoped that if this worked, the task of rewriting these could be left for later as it would place unnecessary strain on our stretched resources and operations heavy team. However, the project didn't pan out too well, primarily because we felt that the technology was too unstable to use in production. Besides which, it was getting too complicated, so we tossed it out. Shortly after, we started a completely new initiative and are now drawing from the lessons and mistakes we made with the original project.

No project or initiative is ever a total loss, even when they have failed. Our Windows project was no different, even though only 20 to 40 per cent of the code and scripts were OS-specific, we were able to utilise the remaining 60-80 per cent directly on other projects. On top of that, we learned about the inherent limitations of working with containers from our experience building and deploying Windows containers. These learnings are now speeding up our approach and the actions we are taking on new projects, as it significantly simplifies how we manage these containers.

Keeping things simple makes it easier to scavenge projects for later use. And, it provides the added benefit of making it easier to spot errors down the road.

Failures also demonstrate how vital a feedback culture is. Always analyse projects, no matter the outcome. Feedback loops link all projects together and loop from one to the next. DevOps needs to work closely with infrastructure and development teams, and your feedback loop needs to include input from all parties. Otherwise, no one in the process will know where a mistake was made.

You can cut down on overthinking and overcomplicating things if you provide feedback to one another. Always create a culture where wins are celebrated as this motivates teams and encourages improvement. Which is ultimately the success of not just projects, but business as well.

Thomas Ornell, IT infrastructure engineer, ABAX