Synonym discovery is important in a wide variety of concept-related tasks, such as entity/concept mining and industrial knowledge graph (KG) construction. It intends to determine whether two terms refer to the same concept in semantics. Existing methods rely on contexts or KGs. However, these methods show the following shortcomings: 1) It’s easy to make false positive predictions just relying on pre-trained embeddings. Because some correlated while non-synonymous terms (like antonyms) often share the same or similar contexts, leading to results that pre-training embeddings of correlated terms tend to be similar and are hard to make a distinction; 2) KGs and context may not be available in some domains, hindering models from generalizing to these fields.
To solve the problems, a research team led by Nan ZHENG published their
new research on 15 June 2025 in
Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team proposed propose a context-free prompt learning model, named ProSyno. To address two aforementioned challenges, domain-independent word descriptions in Wiktionary are introduced into ProSyno as a semantic source. The rationale is twofold: 1) word descriptions in Wiktionary contain informative semantics which are beneficial to distinguishing highly correlated term pairs; 2) Wiktionary is the world’s largest freely available dictionary. Its large coverage ensures our model’s capacity of transferring to various domains. Figure 1 depicts an example which shows that a word description helps to distinguish synonym. The first description of “
crabby” consists of the word “
irritable” that is highly correlated to the target term “
feeling irritable”, thus synonym relation between the term pair can be discovered easily. Specifically, a hierarchical semantic encoder is designed to extract semantic representations of words. However, there usually exist several descriptions of a target word in Wiktionary. To obtain informative word representations from multiple descriptions, a dynamical matching mechanism is designed to weigh each description and then each description of the word is fused by its corresponding matching degree. To transfer knowledge from foundation model to synonym detection task, we employ a prompt learning method to train our model. Prompt learning makes synonym discovery task cord with pre-training task by converting inputs into an ordered sequence that PLM can process. This enables our model can better leverage learned knowledge from large-scale dataset. Experimental results on four benchmarks demonstrate the effectiveness of ProSyno.
Figure 2(a) shows the architecture of ProSyno, which consists of a hierarchical semantic encoder and a pattern mapper. Hierarchical semantic encoder encodes word descriptions in Wiktionary to obtain semantic representations of target terms. Pattern mapper aims to exploit large PLMs to determine synonym relations between the concept term pair by wrapping the term pair and their semantic representations into an ordered sequence that PLM can process.
In the research, they analyze the benefits reaped by prompts and test ProSyno with different semantic encoders. To further investigate why dynamic matching mechanism can work, they perform some micro-level case studies. They compare random initialization with manual initialization of task-oriented prompts. The former initializes task-oriented prompts randomly, which samples from a zero-mean Gaussian distribution with 0.02 standard deviation. The latter uses the embeddings of “synonym” to initialize the task-oriented prompts. To study the effectiveness of different PLMs, they compare ProSyno with ProSyno-BERT which replaces BioBERT with BERT base. Besides, to analyze the generation to other datasets, train ProSyno on one medical dataset to obtain ProSyno-AA, and then ProSyno-AA is employed to make predictions on the other two medical datasets and one general dataset without additional fine-tuning.
Future work can focus on exploring large language models to solve this task, like GPTs.
DOI:
10.1007/s11704-024-3900-z