How to get the most out of machine learning systems

The father of modern speech recognition, Frederick Jelinek, once famously said: 'Anytime a linguist leaves the group, the recognition rate goes up.' Based on this logic, if the domain experts – the phonologists, in his case – were to be exchanged with pure engineers, the performance of the system would improve.

Would this theory apply to a system that heavily utilises machine learning? Do the domain experts increase the performance, or is it the lack of them that is best for the system?

When working in a highly specialised domain, such as the legal arena, which has clear, well-defined tasks, technology is provided to support, augment and increase productivity. It is often the case that both supervised machine learning techniques (i.e. there is access to labelled data) and unsupervised machine learning techniques (compromised of just raw data) will be utilised.

Focusing on supervised machine learning, there is a task at hand and there is labelled data to train a machine learning system. This applies to many highly specialised domains where data is unique and, in fact, the importance of data modelling and subject matter experts is something that should not be overlooked.

A system to scale up testing and training

It is often assumed that only giants such as Google or Facebook can afford the type of systems that can scale up testing and training. This is not the case, and it can pay dividends to have an internal system in place which can allow the engineers to quickly and easily test new hypothesis or to implement new algorithms, ranging from simple Bayesian classifiers to the more time-consuming deep learning.

On the topic of narrow domains, Andrew Ng, the chief scientist of Baidu, recently said: 'Most of the value of deep learning today is in narrow domains where you can get a lot of data.'

Data model and ontology

Earlier this year, Microsoft launched the disastrous chatbot Tay which was supposed to be a clever experiment in artificial intelligence and machine learning. The bot was designed to 'engage and entertain people where they connect with each other online through casual and playful conversation', learning from the people it interacted with on social media. But within 24 hours the chatbot had been taken offline after starting with innocent but awkward greetings such as 'humans are super cool!' that eventually turned into much more concerning statements.

The moral of the story is to always be aware of the nature of your data. The most crucial part for the success of machine learning systems is having an accurate model and a holistic view of the data with which you work. This will, to a greater extent, govern the choice of algorithms that are used, the performance metrics of the system, and its perception by the users.

Therefore, a good ontology of the domain in which you work will immensely improve the task you try to solve.

Error handling

The phrase 'garbage in, garbage out' has almost become synonymous with the world of supervised machine learning systems. Achieving great results is made impossible when the data is disordered, incoherently marked up or highly noisy.

The best way to run the data already in the beginning is by relying on subject matter experts throughout the preparing, correcting and improving of the data. This ensures there is a way to handle errors and mistakes in the pool of data, which in turn will result in a far great level of performance.

Evaluation

It is important to have all bases covered. There needs to be a framework in place to evaluate the models. What is the ground truth? How do you ensure that the machine learning system is solving the problems? These are the types of questioned that need to be asked. Generally, this is available to engineers but some of these questions and their answers will emerge to the real users.

Visualisation and user interface

Levels of visualisation should be put in place for different types of users. It is not necessary for every single person to need to know every aspect regarding the performance and results of the system. This information can be adapted to the needs of different target groups. The success of the product will depend on how effectively concerns of different target groups are anticipated, understood and addressed.

By creating a domain/data model and utilising subject matter experts throughout the development process, the machine learning system will almost certainly guarantee success. Domain experts should see the product not as a threat, but instead as a natural extension of their abilities. Human expertise is not something that can be substituted.

Svetoslav Marinov, Head of the Gothenburg Machine Learning Team at Seal Software