Skip to main content

Six NLP trends technologists need to know

(Image credit: Image Credit: Flickr / Daniel Gasienica)

For the last several years, natural language processing (NLP) has taken the enterprise by storm. Its uses across industries, company sizes, and geographies have expanded exponentially, and for good reason. NLP has the power to operate customer service chatbots in retail, read and write news for the financial services industry, and glean important insights about patient populations in a healthcare setting, among other tasks. But despite all the benefits of NLP, we’re just beginning to untap its full potential.  

As the adoption curve rises, it’s important to understand the trends fueling the NLP fire, and new research from Gradient Flow aims to do that. In collaboration with John Snow Labs, a new global survey offers a detailed analysis of NLP technologies being implemented by businesses, budgets, trends, widely used tools and cloud platforms, and use cases. In its second year, the research provides a benchmark to measure where we are and where we’re headed in regards to NLP. 

With survey representation from both organizations with years of history deploying NLP applications in production compared to those that are exploring NLP, responses from Technical Leaders versus general practitioners, and other contrasting factors, several key findings emerged. Here are the six main NLP trends technologists should keep in mind and why they matter.  

NLP budgets remain robust: 60 percent of Tech Leaders indicated that their NLP budgets grew by at least 10 percent compared to 2020. A third (33 percent) of Tech Leaders indicated that their NLP budgets grew by at least 30 percent, and 15 percent reported their NLP budgets have more than doubled. While the Covid-19 pandemic had major implications on IT investments last year, many organizations were focused on mission-critical technology and staying afloat during uncertain times. Consistently rising NLP budgets across the board indicate not only the health of the tech industry overall, but a focus back on new tools and innovations to help propel NLP forward. 

NER and document classification are the most popular applications of NLP: Tech Leaders singled out named entity recognition (NER) and document classification as the primary use cases for NLP. Another use case, entity linking / knowledge graphs is gaining importance due to the rise of artificial intelligence (AI). Looking ahead, we can expect growth in Q&A and natural language generation use cases powered by large language prediction models and related open-source alternatives. De-identification is another use case that’s popular among highly-regulated industries, such as healthcare and financial services, and will likely gain steam as businesses develop better data privacy practices. 

Accuracy and customizability are key priorities: All users want high-accuracy tools that are easy to tune and customize. Tech Leaders echoed this sentiment, noting that accuracy, followed by production readiness, and scalability, as vital to NLP solutions. Because NLP projects involve pipelines, where the results from a previous task and pre-trained model is used downstream, accuracy is extremely important. Experienced users of NLP tools and libraries understand that they often need to tune and customize models for their specific domains and applications. For example, an NER model trained on news and media sources is likely to perform poorly when used in specific areas of healthcare or financial services. Essentially, NLP is not a one-size-fits-all technology. 

NLP libraries are gaining traction: As with last year’s survey, Spark NLP remains the most popular library, used by 31 percent of respondents and 41 percent of Tech Leaders. More than half (53 percent) stated they used at least one of the following NLP libraries popular within the Python ecosystem: Hugging Face, spaCy, Natural Language Toolkit (NLTK), Gensim, or Flair. Spark NLP was the most cited in both healthcare and financial services industries, likely due to its focus on healthcare-specific models, as well as no data-sharing requirements, attractive to organizations that adhere to strict laws and regulations regarding user data. 

NLP cloud services are widely used, but cost-prohibitive: A large majority (83 percent) of respondents stated that they used at least one of the following NLP cloud services: AWS Comprehend, Azure Text Analytics, Google Cloud Natural Language AI, or IBM Watson NLU. Of mature stage companies, or those that have had NLP models in production for at least two years, 78 percent stated they use at least one of the aforementioned NLP cloud services, with Google Cloud NLP ranking the most used. Popularity and accessibility aside, Tech Leaders cited difficulty in tuning models and cost as the top two challenges when using NLP cloud services. 

Text fields in databases, files, and online content are the data sources fueling NLP

The top three data sources for NLP projects are text fields in databases, files (PDFs, docx, etc.), and online content. When asked how they generate labeled data, a third of Tech Leaders (32 percent) and a third of respondents at organizations with a Mature NLP practice (35 percent) indicated that they use a text annotation solution. Another fifth of respondents from these two segments outsource data labeling to a dedicated team or service. This is consistent with findings from last year’s survey results, and are likely to remain key sources of data in years to come — especially for organizations just getting started with NLP. 

While the findings from the 2021 NLP Survey were similar to the results of last year’s, it will be interesting to see how growth continues as the post-pandemic economy recovers along with IT spending. Despite mature organizations leading the way, it's likely the adoption curve will continue to rise, as more tools become available, lowering the barriers to entry. As usage widens, so too will the benefits NLP can bring to the enterprise and beyond.

Ben Lorica, NLP Industry Survey Co-Author, External Program Chair, Healthcare NLP Summit

Ben Lorica
Ben Lorica is the Chief Data Scientist at O'Reilly Media, and is the Program Director of both the Strata Data Conference and the Artificial Intelligence Conference.