Machine Learning in Genomics

Gene editing: It includes technologies that give scientists the ability to change an organism’s DNA. Out of the many technologies present, many advances in the field of CRISPR-Cas have been made. However, to use CRISPR, researchers need to select the right target sequence first. This process can be challenging as it often involves unpredictable outcomes. However, with the use of ML we can predict the probability of a particular sequence to hit target or off-target regions reducing the cost, time and effort of identifying the right target sequence. Two such modelling approach are already present (https://www.microsoft.com/en-us/research/project/crispr/).
Pharmacogenomics: ML algorithms can be applied to predict dose of a particular drug based on individual’s genetic makeup, ethnicity, age, gender, concomitant medication and presence of other medical conditions (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5296901/). A patient’s SNPs to predict individual disease risks is an essential element of precision medicine.
Identification of genomic Variations:

A. SNP/InDel Calling: SNP/InDel calling is defined as the process of finding a single base or sequence of bases in the NGS data that differs from the reference genome. SNP/InDel are indicators of health risks and enables you to take preventive measures against a disease your genome might be susceptible to.

B. Gene copy number calling: Recently copy number variation (CNV) has gained considerable interest as a type of genomic/genetic variation that plays an important role in disease susceptibility. The interest and importance of CNVs has risen in a wide collection of diseases including Parkinson, Hirschsprung, Diabetes Mellitus, Autism, Alzheimer, Schizophreniaand Cancer. The two main steps of CNV analysis include: 1) preprocessing, and 2) segmentation. The input data are aligned short reads in BAM, SAM or Pileup formats. In the preprocessing step, WES data’s biases and noise are eliminated or reduced. Normalization and de-noising algorithms are the main components of preprocessing. In the segmentation step a statistical approach is used to merge the regions with the similar read count to estimate a CNV segment.
Predicting mRNA levels from genome sequence: Gene expression can be an indicator of a cellular state. Many studies have demonstrated the potential causal impact of differential gene expression on complex disease risk. ML algorithms can be used to predict mRNA expression levels to a substantial degree from features derived solely from genomic sequence.
Phenotype-to-genotype mapping: AI algorithms can significantly enhance the mapping of phenotype to genotype. To quote “DeepGestalt, a CNN-based facial image analysis algorithm, dramatically outperforms human dysmorphologists in this task and is precise enough to distinguish between molecular diagnoses that are mapped to the same clinical diagnosis (that is, distinct molecular forms of Noonan syndrome). When combined with genomic data, PEDIA, a genome interpretation system incorporating DeepGestalt, was able to use phenotypic features extracted from facial photographs to accurately prioritize candidate pathogenic variants for 105 different monogenic disorders across 679 individuals.”
Identification of novel gene signatures for different diseases: From the publicly available databases, ML models can be trained on whole genome sequence data available for different disease and can be used to predict novel gene signature markers(https://www.nature.com/articles/s41598-020-61298-3).
Computer vision: It can be used in clinical genomic testing. For example, deep learning of lung cancer histopathological images is able to identify cancer cells, determine their type, and predict what somatic mutations are present in the tumour.
Genotype-to-phenotype prediction: Common diseases are a result of a complex interplay between inherited genetic risk factors, environmental exposures, and behaviours. For example, several non-genetic risk factors are associated with breast cancer risk, including mammographic density, age at first birth, age at menarche, and age at menopause. Combining these non-genetic risk factors with genetic data significantly improves the accuracy of breast cancer risk models and can inform risk-based mammographic screening strategies. Integration of a variety of health data types and risk factors into comprehensive predictors of disease risk.
Agriculture: “Combining genomics and machine learning to build diagnostic tools aimed at predicting and preventing diseases in crops”.
DeepBind: To predict binding sites of RNA-binding proteins (RBPs) using the primary sequence as well as the secondary and tertiary structural profiles. It requires RNA-seq data.