Decoding Your Genes

栏目: IT技术 · 发布时间: 5年前

内容简介：Every part of your body is a product of your DNA (deoxyribonucleic acid) a complex genetic code which describes exactly what your cells should be doing. We all know the famous double helix shape of a DNA molecule: it’s made of different chemical units call

Can Neural Networks Unravel The Secrets Of Our DNA?

Bethany Connolly

Jun 24 ·6min read

I explore the impact of ML on the traditional sciences by summarising exciting new research papers. In this article I am discussing another cool paper: “Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data” ( Nature Communications, 10, 2449, 2019 ).

1|The Code Behind You

Every part of your body is a product of your DNA (deoxyribonucleic acid) a complex genetic code which describes exactly what your cells should be doing. We all know the famous double helix shape of a DNA molecule: it’s made of different chemical units called ‘bases’ (Cytosine [C], Guanine [G], Adenine [A] and Thymine [T]) which are bonded together into beautiful coiling chains. This is a bit like binary computer code, but instead of a sequence of 1s and 0s, it’s Cs, Gs, As and Ts. The precise sequence of CGAT bases in these chains encodes everything about you as well as every other animal on the planet.

Decoding Your Genes — The double helix structure of DNA. Image from pixabay.com

A really important biological process is DNA methylation: this is when a simple chemical methyl group (CH3) is added to the normal DNA bases: the image below shows how small this change is. Although this change looks tiny, it can have huge consequences on gene regulation, aging and even cancer. It’s even thought that these methylations could act as therapeutic targets for cancer treatment!

2| Mapping DNA Changes With Nanotech

It’s obviously really important that we are able to map out modifications in our genetic code… but it’s actually very difficult to do. Current techniques are noisy and give such poor resolution that there is a real drive for improvement. One recent idea has been based around a really cool Nanopore technology. You can take a look at this great video showing exactly how it works, but basically an ionic current is passed over a polymer membrane containing tiny nanopores. As molecules of interest move through these pores, from one side of the membrane to the other, the current is disturbed in a characteristic way. So you put in DNA molecules and you get out a 2D sequence of changing electronic signals which can be used to identify the genetic code. This technique was recently used to sequence the entire genetic code of COVID-19 in just 7 hours !

Although this system is really good at getting the overall DNA sequence, it is a bit harder to locate where subtle methylations exist because understanding the signal is context dependent. Methylations can be located by comparing the Nanopore electronic signals of methylated and un-methylated DNA sequences.

3| How Can ML Help?

If you’re looking for patterns in sequence data, RNNs (Recurrent Neural Networks) are the perfect architecture to use. In case you’re not familiar, lets take a quick look at how they work.

A typical ‘feed-forward’ neural network follows the process of applying randomly initialised weights and biases to an input to predict an output. When the generated output is compared to the target output an error (or ‘Loss’) is calculated. This loss is then propagated back through the network to update the weights and biases with the aim of improving the output. This process is repeated over and over again with different inputs during training until the network hopefully learns to generate an accurate output. This is just a very simple overview, but sums up the main principles.

What if you are not predicting a single output but rather a sequence of outputs? Conceptually, an RNN can be thought of as a connected sequence of feed-forward networks with information passed between them. The information being passed is the hidden-state which represents all the previous inputs to the network. At each step of the RNN, the hidden state generated from the previous step is passed in, as well as the next sequence input. This then returns an output as well as the new hidden state to be passed on again. This allows the RNN to retain a ‘memory’ of the sequence information it has seen so far and makes them great for understanding sequential data. You can check out a more mathematical description here .

Let’s take a look at exactly how RNNs can decipher DNA modifications.

4| Unravelling DNA With RNNs

In the current paper, a new tool ‘DeepMod’ was developed. This is a bidirectional RNN (it passes sequence information both forwards and backwards) with long-short-term-memory (LSTM); check out a great summary of LSTMshere.

DeepMod takes a reference genetic code and a Nanopore electric signal as input. The ‘events’ in the electric signal (series of signal points generated by the Nanopore sequencer) are aligned with the DNA code in the reference. This was achieved using BWA-MEM , a alignment algorithm for matching DNA sequences with reference genomes. This algorithm is capable of matching DNA sequences up to megabases long!

The authors used a 7-feature vector description of the input signal; signal mean, standard deviation and number of signal points associated with an event combined with a four feature description of the DNA base (A, C, G or T). This acts as the input to the network (see its architecture below) which predicts if the signal event is the result of a modified base.

The algorithm was first trained and optimised using data from E. Coli bacteria DNA. A 21 unit LSTM with 3 hidden layers was found to achieve high accuracy results while maintaining reasonable computational costs. Analysis on several different E. Coli datasets showed strong results with amazing single base methylation mapping resolution. In addition the network showed great precision (up to 0.99) at identifying which bases were methylated.

5|Could This Map Human Genetic Code?

Yes! Even though Deepmod was trained only on bacterial DNA data, it was used to make accurate predictions on methylations in human DNA. This cross-species testing is really exciting because it shows that a model trained on one species can be used to accurately map the DNA structure of a species the model has never seen before. Perhaps the model could also be applied to loads of different species, and it would be really exciting to see these results too!

This is a great example of the power of neural networks to generalise to new tasks and speed up the rate of scientific learning. It’s also a great example of machine learning being applied to current scientific problems to generate an immediate and practical solution. The DeepMod code is now available online and the authors plan to maintain and update it for future users. This tool’s high speed and high accuracy ability to analyse human DNA modifications may help in the understanding and treatment of different diseases like cancer, so this is a really cool result!

6| Final Thoughts

Despite these great results, as always there are a few things to bear in mind:

DeepMod was trained and tested on only 2 types of DNA methylation, but there are actually loads of different types. More testing is needed to know if this model can be used to locate the wide range of modifications that exist in real DNA.
The model did not examine RNA (ribonucleic acid), DNA’s single stranded cousin. This is an essential biological molecule for coding, decoding and gene expression so it would be really interesting to see how the model fares with this task as well.
Finally, the model relies heavily on the alignment of the input signal with the reference DNA using BWA-MEM . If poorly aligned, the model’s performance will suffer heavily and this dependency needs to be remembered for training.

Overall this is a really promising physical application of neural networks and if you enjoyed this brief summary I would encourage you to read the original paper to get more in depth details about the DeepMod framework, training and validation process.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Decoding Your Genes

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

操作系统概念（第六版）

（美）西尔伯斯查兹 / 郑扣根 / 高等教育出版社 / 2005-11 / 55.00元

《操作系统概念》(第6版翻译版)是讨论了操作系统中的基本概念和算法，并对大量实例(如Linux系统)进行了研究。全书内容共分七部分。第一部分概要解释了操作系统是什么、做什么、是怎样设计与构造的，也解释了操作系统概念是如何发展起来的，操作系统的公共特性是什么。第二部分进程管理描述了作为现代操作系统核心的进程以及并发的概念。第三部分存储管理描述了存储管理的经典结构与算法以及不同的存储管理方案。第四部分......一起来看看《操作系统概念（第六版）》这本书的介绍吧!

码农工具