DAVID MILWARD says, "Understanding the full meaning of language is still very challenging." He should know, as he's spent 25 years working on it, from Cambridge undergraduate to his current job as CTO of the 10-year-old British start-up Linguamatics.
It's not hard to explain what Linguamatics does. Its software extracts knowledge from unstructured text. What's difficult is to explain why it's different. Isn't that what a search engine does?
"With a traditional search engine you have to program in the things you're looking for," says Milward. "It's not very agile. We wanted to provide an agile system with Linguamatics so people could ask any question."
Milward got interested in natural language processing when, as a Cambridge computer science undergraduate in 1986, an early project was a question-answering system. After finishing his PhD he worked in Edinburgh on grammar and how to take in text, understand its structure, and come up with a meaningful representation while people spoke. At SRI International, back in Cambridge, he worked on applying text mining to the life sciences before co-founding Linguamatics with Roger Hale.
"Organisations are becoming more and more knowledge-driven," he says. "Similarly to scientific discovery, they build new things based on existing knowledge."
As new discoveries increasingly require cross-disciplinary work, "There's a danger of getting overwhelmed, so being able to do things in a more automated fashion becomes important."
Pharmaceutical companies were early adopters because the industry is so knowledge-driven; Linguamatics is also active in business intelligence. During the last election it mined 187,000 Twitter reactions to the third debate overnight to show each leader's approval rating.
"We found that although people don't use fully grammatical sentences, they do use grammatical constructions." The relatively few linguistic patterns enabled them to identify what was being said.
To add to the challenge, humans use multiple words for the same concept, and dozens of acronyms must be disambiguated. The mass of journal papers, patent applications, company communications, regulatory filings, and proprietary drug testing results all vary in linguistic structure. But Milward's system sees the human use of "carcinoma", "tumour", and "neoplasm" and links to cancer. The result is the ability to ask a question like, "What genes are associated with breast cancer?" and get back a list of genes rather than a list of documents.
"We're still trying to find areas where the machine can do well," he says. "We now have areas where we can get these precise answers to questions, but that doesn't mean we're saying we're replacing the scientist who can understand the context. We're just trying to give more aid to them."
Sign up for INQbot – a weekly roundup of the best from the INQ