Skip to main content

High throughput Python pipeline to identify Horizontal Gene Transfer

Project Information

bioinformatics, biology, data-wrangling, genomics, github, python, workflow
Project Status: Halted
Project Region: CAREERS
Submitted By: Vinayak Mathur
Project Email: vm7027@cabrini.edu
Project Institution: Cabrini University
Anchor Institution: CR-Penn State
Project Address: 610 King of Prussia Road
IAD 224
Radnor, Pennsylvania. 19087

Mentors: Simon Delattre
Students: Arun Dash

Project Description

Project Description: This project seeks to further investigate the genetic phenomenon of horizontal gene transfer (HGT), specifically when involving interactions between bacteriophages and their host bacteria. From a biological perspective, this type of horizontal gene transfer occurs when bacteriophages attach themselves to a bacterial cell and inject it with a vector such as a plasmid that integrates into the host genome and takes control of the bacterium to make copies of itself. The main aim of the project is to develop an analysis pipeline written in Python that automatically generates a large output list of bacterial accession numbers given an input list of phage accession numbers. The current program employs BLAST to create this list of accession numbers.
In the analysis pipeline, the input list is iterated through, and each phage accession number is submitted as a BLAST query to be aligned with the NCBI database of bacterial genes. The top bacterial result for each phage query ID is stored and aligned with the database of bacteriophage genes in turn. A match between the original phage query ID and the phage result of the BLAST search where the bacterial accession number is the query ID indicates the presence of horizontal gene transfer. Conducting this analysis in an HPC environment using SSH could significantly speed up the process of data collection compared to the functioning of the current pipeline or performing manual searches on the NCBI website where BLAST has been made available.

Current version of the pipeline is available here: https://github.com/genomesolver/CSPpipeline

Research goals: This research project has three major goals:
1) Identify instances of HGT in a large dataset of bacteriophage proteins: The data list produced by the program facilitates more in-depth analysis of bacteriophage-mediated horizontal gene transfer.
2) Predict likelihood of HGT: By developing a probabilistic classifier, we can attempt to predict the likelihood that a certain clade of bacteria is affected by horizontal gene transfer given the HGT status of the other members of the clade. This model could assist in establishing the statistical significance of the occurrences of HGT in bacterial relatives and help identify cellular features specific to those groups of bacteria that could potentially explain their vulnerability to infection by phages.
3) Functional analysis: A Gene Ontology (GO) enrichment analysis is another research aim to extract meaningful conclusions from this data. Since the current version of the pipeline generates a list of bacterial accession numbers that correspond to phage query IDs, that list can be processed in order to find GO terms in groups of genes regulated by the integration of the nucleic acids of the bacteriophage. This type of data analysis would be very useful to visualize and increase the understanding of how the phage infections disrupt the genetic network of the bacteria.

Additional Resources

Github Contributions: https://github.com/genomesolver/CSPpipeline
Wrap Presentation: 4

Project Information

bioinformatics, biology, data-wrangling, genomics, github, python, workflow
Project Status: Halted
Project Region: CAREERS
Submitted By: Vinayak Mathur
Project Email: vm7027@cabrini.edu
Project Institution: Cabrini University
Anchor Institution: CR-Penn State
Project Address: 610 King of Prussia Road
IAD 224
Radnor, Pennsylvania. 19087

Mentors: Simon Delattre
Students: Arun Dash

Project Description

Project Description: This project seeks to further investigate the genetic phenomenon of horizontal gene transfer (HGT), specifically when involving interactions between bacteriophages and their host bacteria. From a biological perspective, this type of horizontal gene transfer occurs when bacteriophages attach themselves to a bacterial cell and inject it with a vector such as a plasmid that integrates into the host genome and takes control of the bacterium to make copies of itself. The main aim of the project is to develop an analysis pipeline written in Python that automatically generates a large output list of bacterial accession numbers given an input list of phage accession numbers. The current program employs BLAST to create this list of accession numbers.
In the analysis pipeline, the input list is iterated through, and each phage accession number is submitted as a BLAST query to be aligned with the NCBI database of bacterial genes. The top bacterial result for each phage query ID is stored and aligned with the database of bacteriophage genes in turn. A match between the original phage query ID and the phage result of the BLAST search where the bacterial accession number is the query ID indicates the presence of horizontal gene transfer. Conducting this analysis in an HPC environment using SSH could significantly speed up the process of data collection compared to the functioning of the current pipeline or performing manual searches on the NCBI website where BLAST has been made available.

Current version of the pipeline is available here: https://github.com/genomesolver/CSPpipeline

Research goals: This research project has three major goals:
1) Identify instances of HGT in a large dataset of bacteriophage proteins: The data list produced by the program facilitates more in-depth analysis of bacteriophage-mediated horizontal gene transfer.
2) Predict likelihood of HGT: By developing a probabilistic classifier, we can attempt to predict the likelihood that a certain clade of bacteria is affected by horizontal gene transfer given the HGT status of the other members of the clade. This model could assist in establishing the statistical significance of the occurrences of HGT in bacterial relatives and help identify cellular features specific to those groups of bacteria that could potentially explain their vulnerability to infection by phages.
3) Functional analysis: A Gene Ontology (GO) enrichment analysis is another research aim to extract meaningful conclusions from this data. Since the current version of the pipeline generates a list of bacterial accession numbers that correspond to phage query IDs, that list can be processed in order to find GO terms in groups of genes regulated by the integration of the nucleic acids of the bacteriophage. This type of data analysis would be very useful to visualize and increase the understanding of how the phage infections disrupt the genetic network of the bacteria.

Additional Resources

Github Contributions: https://github.com/genomesolver/CSPpipeline
Wrap Presentation: 4