Share |

Data mining chemical patents

Close-up image of pills from a tin can.

Making drugs by synthesizing chemical compounds takes time. Now German researchers are developing a search algorithm to help chemists do this work faster. Image courtesy Richard Dunstan, stock.xchng.

Before a new drug can be used to treat aliments or disease, one of the first crucial processes in making that drug is to synthesize its chemical compounds.

This is, quite literally, joining one atom at a time to form new molecular structures. But, the job of developing chemicals can be a laborious one, with researchers having to manually sift through thousands of patent information, documents, and images. This slows the speed at which medicinal compounds can reach the public. Now, German researchers have developed a prototype software tool that uses the parallel processing power of high-performance computers to automatically find the relevant data faster.

For some, chemical synthesis may be unexciting, but it’s the alchemy of modern science. A dangerous chemical in nature can be harnessed to help cure the most deadly human diseases. Last year, a toxic compound called palau’amine, that is produced within marine sponges, was successfully synthesized by US researchers because of its antibiotic, anticancer, and antifungal properties.

Now, in Europe, chemists are working directly with academics to streamline their working process. Taros Chemicals, a German commercial chemical service provider, is working with Fraunhofer, Europe’s largest non-profit application-orientated organization. Fraunhofer is involved in everything from communication, energy, the environment, health, security and have developed revolutionary algorithms such as MP3 compression in 2004.

A semantic method

Their latest algorithm is being developed at the Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) to help chemists at Taros Chemicals extract fundamental information from chemical literature when synthesizing chemicals compounds.

Image of Unstructured Information Management Architecture workflow interface.

This image shows the UNICORE Rich Client software interface, which a chemist can use to drag and drop the UIMA pipelines to create a workflow. Image courtesy Alexander Klenner.

Questions in the mind of a chemist when starting this process can be: for a given base structure, are there any structure variants already mentioned in literature; and if so, are there any indications of their effects? Are structure variants protected by third-party rights or are they freely available? These straightforward questions currently require a complicated process to extract the answers from chemistry patent literature.

Researchers at Fraunhofer are creating algorithms, similar to ones used by Google, but even more complex. This is because the questions that chemists need answers for cannot be done by keyword analysis alone. Information has to be extracted and presented in a compact and structured way. The software being developed by SCAI automatically reads vast amounts of data to identify chemical structures, chemical names, drug names, relations between entities, and medical impacts.

Marc Zimmermann is deputy head of the SCAI bioinformatics department. He said, “we are writing software and algorithms for our customers and partners in the chemical and pharmaceutical industry. [Chemical synthesis] is a time consuming and error prone process to find the necessary information in old text books, lab journals, scientific papers, chemical patents, PhD thesis, databases, etc. The existing systems have different query interfaces, they are not cross linked, and they contain different levels of information. At the end of the process, the chemist has to read a lot of paper work and to manually extract the necessary information. That is where we want to help our users – we are building a service which tries to extract all this information directly from the PDFs.”

From the vertical to the horizontal

The project is called Unstructured Information Management Architecture using High-Performance Computers or UIMA-HPC. Unstructured Information Management Architecture is an open source standard, originally developed by IBM, for extracting large amounts of context-relevant data using semantic analysis tools. It links data via contextual meaning and adds structure to unstructured data.

Zimmermann and his team are working with German national research center, FZ Jülich, to develop new software tools that combine UIMA with grid computing management software UNICORE (UNiform Interface to COmputing Resources) to make an on-demand service for chemists.

Their algorithms have to account for changing chemical synthesis patent documentation that spans over 100 years of research.

“The combination of technologies has started in April [2011]. So far, some basic results have been achieved and the commercial partners in the UIMA-HPC project are continuously evaluating the system and providing feedback. Our project is unique, in that we combine different established techniques into a large and flexible workflow which allows us to provide a solution for all kinds of use cases. The chemical patent mining use case is only a starting point,” said Zimmermann.

This data mining and analysis engine can, in theory, be used with any parallel computing architecture; from grids to HPC and even smartphones. Zimmermann said, “actually our approach can be universally applied to any kind of hardware resource. The UIMA components can be installed on any local machine and make use of multicore architectures (even the new iPhone has two cores). UIMA is used by IBM on its famous Watson machine, one of the largest high-performance computers. By combining UIMA with UNICORE we are able to use resources from the Gauss Alliance - distributed clusters of grids and high-performance computers. Extending our methods to the cloud is in the focus of another project at SCAI.”

By spreading the data extraction process across parallel computing architectures, the data mining process is faster. This is known in academic circles as parallelizing, which is a bit of a tongue twister for the uninitiated.

Zimmermann said, “we have hundreds of thousands of chemical documents to be processed. Because of the large computing power, we can use a large number of different extraction algorithms, try different parameter settings, compare results, and use the best ones. The extracted information goes directly in a data warehouse, which then allows for chemical knowledge mining.”

The future users of the search engine have high hopes.

Image of annotated pdf that corresponds to a chemical structure report.

The software makes chemical patent information mining easier and helps link relevant information from various documents. The image on the left is an example of an annotated pdf, which corresponds to a chemical structure report on the right. You can see the same chemical structure from the report (right) as a popup inside the pdf with a link to the page (left), where it was extracted. Image courtesy Alexander Klenner.

“In the field of patents, most of this published knowledge isn't readily available with search engines. Even commercial solutions to do literature research won't provide a complete result. It's not possible to get an answer for a simple question. If this issue is solved with a setup to process the big amount of available literature and patents, one is able to get a clear picture about what is valid for his own targets of research,” said Alexander Piechot, CEO of Taros Chemicals.

100 years worth of data

The UIMA-HPC project still has a lot of work ahead. Their algorithms have to account for changing chemical synthesis patent documentation that spans over 100 years of research. This means understanding how different languages, both human and scientific, have changed over time, including the ways chemical structures have been drawn, naming conventions, notations and the scan quality of reports. To solve this problem, new extraction algorithms for each language may need to be made or a translation process created.

The project’s research was presented simultaneously at the Cracow '11 Grid Workshop and the 7th German Conference on Chemoinformatics in early November 2011. The UIMA-HPC project has until 2014 to work out all the kinks. But, they are already steaming ahead with the successful testing of a prototype which is being used on a small collection of literature.

Your rating: None Average: 4.8 (9 votes)


Post new comment

By submitting this form, you accept the Mollom privacy policy.