Machine Learning is often presented as the cutting edge of what we can currently achieve with technology. A lot of innovation and progress is coming from the fields of data science and machine learning, but a lot of the language around the topic is filled with jargon and is aimed at an expert audience. As such, the whole concept can feel quite opaque. Having built the machine learning algorithm behind Infogrid’s smart building platform, I can answer one of the foundational questions, what do Machine Learning algorithms learn from? And what is their first lesson?
A Machine Learning platform is a bit like a car’s engine, where the pistons are replaced by algorithms. No matter how good that engine is, it can’t run without a form of fuel. For a Machine Learning platform, that fuel is data. When you are starting from scratch you need to create what is called ‘ground truth’: the core dataset against which everything else can either be based upon or checked against. Without a robust ground truth you won’t be able to trust the outputs of your engine, the foundation of the ‘fuel’ is crucial. This is why the first lesson for an ML algorithm is always based around developing an understanding of the world based on observation, measurement and collection of real-world data. This grounds the whole algorithm in reality and allows for extrapolation.
How do you create a ‘ground truth’?
There are a few ways to get the ground truth for your system. In some scenarios you may be able to find an existing dataset already available for free in the public domain or one that can be purchased. Some companies have already collected a lot of real-world data on their customers as part of the normal operation of their business. For example, a supermarket will have in-depth information on the shopping habits of members of their loyalty scheme. They could use that data to run a machine learning platform that, in theory, provides better deal recommendations, or delivers insights on changing trends in customer behavior. What do you do in a scenario where you are developing a ML platform where the data isn’t readily available? In that case, you have to run experiments to create the data yourself. This really puts the ‘science’ into data science and may come as a surprise to people who think you must be stuck behind a computer all day to create anything called an algorithm.
Infogrid provides a smart building platform which can automate a range of extremely time-intensive tasks, from checking air quality and virus risk in office spaces to monitoring for legionella risk in water pipes. The breadth of what Infogrid can do means that we have had to create more than one ground truth dataset.
The first dataset we needed to collect was based on understanding how people use offices. This was a critical first step as we needed to collect this data before we could provide any analysis of our customers’ offices. So, we installed a wide range of sensors in our own offices to create the ‘ground truth’ dataset based on how our team was using the workplace. For example, to collect data around desk occupancy we put a pressure sensor in each seat and a temperature sensor under the desk. This double-blind data collection lets us understand how often people were sat at their desks while weeding out any scenarios where a bag or a box was left on a chair. With two sets of sensors in action, we were also able to find more stories in the data than we had initially expected. The temperature sensors let us understand which areas within the office were hotter when there was bright sun coming in through the windows and which remained cooler. With this data, we could figure out when we should roll down blinds and reduce our own heating and air con costs. All the data is anonymized and generalized and is used to give the algorithm a core truth of how people use an office, not to keep tabs on our own staff!
As Infogrid grows so do our capabilities, best seen in our development of legionella compliance. Legionella, for those of you who aren’t familiar, is a deadly pathogen that multiplies very quickly in warm, stagnant water. This kind of environment is often found in poorly maintained or under-used warm water plumbing. That's why facilities managers and building supervisors need to ensure that all hot water taps are regularly flushed and that the temperature of the hot water system remains above 47 degrees Celsius. Traditionally, Legionella checks must be performed manually, i.e. someone has to manually run each hot water tap in a building for around 5 minutes and measure the water temperature. This process takes a lot of time and wastes a lot of water. It’s a process that was ripe for automation.
The way we use ML for this activity is slightly different to the previous desk occupancy example above. The aim has been to reduce costs and complexity, and using a single heat sensor, measure when the tap was last used as well as recording the water temperature.
To achieve this we had to create a ground truth of how water temperature changes in a pipe when it is used. We again had to turn to real-world experiments and we installed automatic tap controls in our office. This way we could tell when a tap was opened and for how long. We could then track the heat changes that took place when a tap was turned on. From this ground truth our ML algorithm can now, from just a heat sensor, know when a tap was last used, pretty neat!
When you are doing these kinds of experiments, you need to be flexible and work out solutions to problems that you didn’t foresee in initial planning. For example, when installing a heat sensor you have to be mindful of how close to the boiler the sensor will be. If you are too close then the heat will likely be conducted along the metal from the boiler, rather than from the water within the pipe. There are a few ways you can mitigate an issue like this. The easiest thing to do is to move the sensors further away. You could also set up your system so that if a sensor has to be placed near the boiler you can tell the platform to account for the resulting heat disparity. Ultimately you want a system that can figure out whether a sensor is near the boiler and adjust accordingly without the need for human input. We are not quite there yet, but we’re close!
In the end, it is important to be scientific and rigorous when collecting ground truth datasets. We collect data across multiple sources which gives us real confidence in the data. It is also an ever-evolving process. As our platform expands, our existing ground truths increase in accuracy and complexity. And as we move into offering new services, we add new ground truths to our library. There is a lot of creativity and real-world testing which goes into developing machine learning platforms, and you have to be conscious of the limitations of the data you collect, constantly working to ensure any weaknesses are bolstered over time.
The proof is always ultimately in the pudding. If your ML system is doing what you intended it to do, then your ground truth is probably accurate enough. If you get odd outputs, it could be a warning sign that you need to go back to the drawing board and collect your ground data from scratch. Ultimately an engine is only as good as the fuel you put in it, and that is still true of an ML platform. Data scientists need to be rigorous in the data they use to build their systems, otherwise all outputs are compromised. Put the time into getting the ground truth right and you will be rewarded with a nice and shiny machine learning platform!
- These are the best free software for small businesses available now
Roger Nolan, CTO, Infogrid