A large amount of information is provided in text documents but difficult to access for computer programs. In order to detect complex information it is often important to understand the relationships between words and entities in sentences. A relation can express for instance that a disease has a particular finding or a that a drug can be used to treat a disease. An example of a relation is given in Figure 1. Relation extraction addresses the task of detecting relationships between entities from natural language.
|BACKGROUND: Insulin has traditionally been viewed as a last resort in the treatment of type 2 diabetes. (PMID=17257474)
Figure 1: Example describes the relation may-treat between insulin and type 2 diabetes.
Supervised learning approaches are most successful to address this task as proved by different shared tasks and competitions (see e.g. (Segura-Bedmar et al., 2013)). Supervised learning is a general term for machine learning methods using a fix set of manually labelled instances (usually consistent of positive and negative examples) to train a classifier. Depending on the use case different features are extracted from the training data to train the classifier. Features can be for example words or part-of-speech tags around the two entities or the dependency tree path between the entities. In many cases supervised learning methods provide better results using a larger set of training examples. Nevertheless, training data is not always available. Moreover, the generation of manually labelled data is usually time consuming and expensive. Depending on the domain even expert knowledge is required.
Distant supervision is a technique to overcome this problem by generating positive and negative training data automatically. Those instances are then used as input to train a relational classifier. According to Mintz et al. (2009) distant supervision is defined as follows:
“The distant supervision assumption is that if two entities participate in a relation, any sentence that contain those two entities might express that relation.”
Using a set of known facts (e.g. Aspirin may treat pain) for a relation the approach searches for sentences containing these facts and labels them according to the relation of interest. In most of the cases a set known facts are taken from a relational knowledge base such as Freebase or UMLS. An excerpt of the UMLS relation may-treat is given in Table 1. In contrast to Figure 1, however, those automatically (distantly) labelled sentences might contain false labels, as the example in Figure 2 shows. According to Table 1, insulin can be used to treat Type 2 Diabetes. Conversely, the example shows that the sentence expresses something different. Nonetheless, distant supervision is able to generate large training data sets, and classifiers trained on this data provide reasonable results compared to using manually labelled data (see e.g. (Thomas et al. 2012)).
|Type 2 Diabetes
|Sulconazole Nitrate Cream
Table 1: Excerpt of the UMLS may-treat relation
Reactive species and early manifestation of insulin resistance in type 2 diabetes. (PMID=16448517)
Figure 2: Distantly labelled sentence, false positive
This article presents an alternative scenario originally published in Roller and Stevenson (2015). Supervised methods tend to provide better results the more manually labelled training data is available. What if only a small set of training instances is available? How does this effect the classification results? In the following a method is presented which shows how to improve classification results of a supervised classifier using only a small manually labelled instances by including automatically labelled data. The method is tested in the biomedical domain for the detection of adverse-drug effects (ADE) from sentences. The ADE data set is described in detail in Gurulingappa et. al (2012) and Roller and Stevenson (2015). The given data set contains 1644 annotated abstracts of medical publications. Each annotated abstract contains several sentences with a range of different adverse-drug effects.
For the experiment only a small training data set of maximum 200 medical abstracts is used and evaluated on a larger set of 1444 abstracts. In order to examine the impact of small training data the experiment starts with only one abstract. Then, the number of training instances is slowly increased to 200. For each training subset positive and negative mentions of adverse-drug effects are extracted and the used to generate further training data automatically. From the sentence in Figure 4 for instance the seed pair “hair loss”-“paroxetine” can be extracted and used to generate distantly labelled data. In the next step those seed instances are then used to generate distantly labelled training data from 2 million medical abstracts published in the Medline repository. Using the seed instances and the automatic labelling process it is possible to generate a much larger training data set than using only manually labelled data. The different size of training data is presented in Table 2.
Findings on discontinuation and rechallenge supported the assumption that the hair loss was a side effect of the paroxetine. (PMID=10442258)
Figure 4: Example of an adverse-drug effect
Table 2: ADE training data size
Table 2 shows that using only a small number of abstracts for training only a few different positive and negative seed instance pairs can be extracted. Furthermore, a small set of seed abstracts contains only a small number of positive and negative training instances (manually labelled). As Table 2 shows using seed instances it is possible to generate a larger number of automatically labelled training data, in particular in comparison to the manually labelled data.
For the following experiment three classifiers will be trained. First, a classifier is trained using only the manually labelled data (gold standard) as input (supervised). Next, a classifier is trained using only the automatically labelled data (distantly supervised) as input. Finally, both data sets are merged and used as input for the third classifier (mixture model). For the experiment a support vector machine with a shallow linguistic kernel is used (Giuliano et al. 2006). To ensure reliable results the increasing training step (from 1-200) is repeated 5 times with a different set of 200 abstracts. This means that each of the 5 evaluation rounds use a different set of seed instances and, therefore, a different set of distantly labelled instances. The results are evaluated in terms of precision, recall and f-score and presented in Figure 4.
Figure 4: Effect of varying number of seed abstracts
Figure 4 shows that using only a small set of manually labelled training data the supervised classifier provides low results in terms of f-score. Increasing the training data size improves the result. It is interesting that the distantly supervised classifier outperforms the supervised classifier to an abstract size of 100. In this case using more data (even though the data is noisy) leads to better results than using only a small set of manually labelled data. If both data sets are combined further improvements are achieved. With an increasing number of training abstracts the distance between the results of the mixture model and the supervised classifier is getting smaller. Eventually a further increase of the training data outperforms the supervised classifier (using around 300 abstracts – not shown in this experiment).
This article presented a method to detect relations in natural language if only a small set of manually labelled training data is available. The approach has been tested in context of detecting adverse-drug effects from biomedical abstracts. It is able to use information from an existing training data set to automatically acquire new training data. Using this data, a relational classifier can be trained to detect and extract similar information in text documents. The classifier is able to provide comparable results to a supervised classifier using a small gold standard as input. Furthermore a mixture model has been presented using manually labelled and distantly labelled data which is able to outperform a classifier using only (a small set of) gold standard data. This result is notable since distantly supervised data tends to be much noisier than manually labelled data and therefore produces less accurate classifiers.
Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of Biomedical Informatics, 2012. Text Mining and Natural Language Processing in Pharmacogenomics.
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, 2009.
Roland Roller and Mark Stevenson. Making the most of limited training data using distant supervision. In Proceedings of the BioNLP 2015 Workshop, Beijing, China, 2015.
Isabel Segura-Bedmar, Paloma Martínez, and Daniel Sánchez-Cisneros. The 1st DDI Extraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts. In Proceedings of DDI Extraction-2011 challenge task., 2011.
Philippe Thomas, Illés Solt, Roman Klinger, and Ulf Leser. Learning Protein Protein Interaction Extraction using Distant Supervision. In Proceedings of Robust Unsupervised and Semi-Supervised Methods in Natural Language Processing, 2011.