NWO researcher develops a 'blacklist' of expressions

21 December 2009 NWO (Netherlands Organization for Scientific Research)

List helps computers understand expressions with more than one meaning

Computers might well be 'with it', but 'they haven't got a clue' about expressions. Dutch researcher Nicole has come up with a solution to this problem: she has prepared a list of unpredictable word combinations that might, for instance, have a literal as well as a metaphorical meaning. The structuring of this list is such that it can be used by many different computer systems. Now at last your car navigation system might one day understand that you really do want to 'throw it out of the window'.

The Dutch language has many combinations of words whose features cannot be explained by simply looking at the qualities of the individual words. The meaning of 'missing the boat', for instance, isn't always the same as 'being too late to catch the boat'. This type of word combination doesn't pose problems to people, but linguistic computer systems, such as speech recognition software or programmes preparing automatic summaries, just don't recognise these expressions. This is because the meaning depends on the context. Of course you can actually miss the boat.

Grégoire prepared a list of about 5000 unpredictable word combinations. She divided them up into different classes on the basis of their structure. She looked at the rules of singular and plural; for example you can't 'take to your heel', just 'take to your heels', and 'take to those heels' doesn't work either. Grouping together various classes of word combinations can minimise the amount of manual work to incorporate the list into a computer system and it means that the list can be used for many different systems.

Nicole Grégoire undertook part of her research within STEVIN, a long-term research and stimulation programme for Dutch language and speech technology, jointly financed by the Flemish and Dutch governments (Ministry of Education, Culture & Science, NWO and Ministry of Economic Affairs). The aim of the programme is to increase the innovative capacity of this sector while at the same time enhancing the position of Dutch in the modern world of information and communication. Grégoire's database is being distributed under the name 'DuELME' by the Centrale voor Taal- en Spraaktechnologie [Central Distribution Centre for Language and Speech Technology].

