Mark Craven and Johan Kumlien
Abstract
Recently, there has been much
effort in making databases for molecular biology more accessible
and interoperable. However, information in text form, such as
MEDLINE records, remains a greatly underutilized source of biological
information. We have begun a research effort aimed at automatically
mapping information from text sources into structured representations,
such as knowledge bases. Our approach to this task is to use machine-learning
methods to induce routines for extracting facts from text. We
describe two learning methods that we have applied to this task
- a statistical text classification method, and a relational learning
method - and our initial experiments in learning such information-extraction
routines. We also present an approach to decreasing the cost of
learning information-extraction routines by learning from "weakly"
labeled training data.