Undergraduate Research in Computational Biology

Research Experiences for Undergraduates-Projects

Overview of Projects for Summer 2018

Computational Approaches to Study the Transmission and Pathogenesis of Mycobacterium ulcerans
Dr. Jordan’s research areas include microbial ecology, transmission and pathogenesis of the environmental pathogen, Mycobacterium ulcerans. What we do not know include how the organism is transmitted to humans, and under what environmental circumstances lead to the production of mycolactone, a lipid toxin and sole virulence determinant of M. ulcerans. These gaps in the knowledge base is important because M. ulcerans infection leads to a devastating skin disease known as Buruli ulcer that impacts at least 33 countries with highest incidence in rural West Africa.
Potential project(s) for the REU fellows include 1. Environmental screening for the presence and abundance of M. ulcerans among aquatic samples collected from Benin, West Africa. The objective of this work is to determine presence and abundance among environmental samples collected from Buruli ulcer endemic and non-endemic aquatic habitats in order to test the hypothesis that M. ulcerans resides and replicates within a specific niche within the aquatic habitat. In order to test this hypothesis, DNA will be isolated from preserved samples that have been collected from aquatic habitats from Benin West Africa. The isolated DNA will be subjected to semi-quantitative and quantitative PCR targeting M. ulcerans. Positive samples will be strain typed using Variable Number Tandem Repeat Profiling and verified by amplicon sequencing and comparison against the BLAST database. We expect specific aquatic matrices (such as water filtrand, soil, or invertebrates) to be positive for M. ulcerans. We also expect positivity and concentration to be higher among samples collected from Buruli ulcer endemic habitats. 2. Impact of UV on mycolactone gene expression. The objective of this work is to determine whether there is modulation of mycolactone gene expression and production when subjected to UV. This objective will test the hypothesis that expression of genes responsible for mycolactone production is upregulated as a stress response. In order to test this hypothesis, M. ulcerans replicates will be grown to exponential phase then placed into petri plates and subjected to UV for 0 (control), and 5 seconds to 5 minutes. The bacteria will be collected and serially diluted for plating to determine UV impact on M. ulcerans growth. Additionally, RNA will be isolated from the bacteria and, following isolation and verification of RNA integrity, converted to cDNA for RT-PCR targeting genes responsible for mycolactone production. Modulation of gene expression will be analyzed using computation software in the R package. We expect mycolactone to be upregulated upon increased UV exposure. Data from both projects will be valuable for assessing the environmental niche of M. ulcerans, determining the mode of transmission of the pathogen to people, and conditions for mycolactone production. Additionally, methods described will allow the student to obtain or develop skills of molecular biology, data management and data interpretation.

Changes in Hemoglobin Expression in Response to Environmental Changes
The Hoffmann Lab is broadly interested in evolutionary genomics and molecular evolution. An overriding theme is to better understand the connection between the emergence of novel genes and the origins of biological innovations. Relating to this theme, the Hoffmann Lab 1. Explores the different mechanisms involved in the origin of new genes, 2. Assesses the forces underlying the retention and functional variation of these genes, and 3. Works to gain insight into the processes underlying intra- and inter-specific variation in the number and nature of genes in animal genomes. Current projects include studies of the evolution of animal gene families, the emergence of novel genes via gene and genome duplication, functional variation among paralogous members of a gene family, the evolution of small RNA repertoires, and the interplay between transposable elements and small RNAs. Dr. Hoffmann’s team pursues these questions using an integrative approach that involves combining bioinformatics and evolutionary genomics with perspectives from other disciplines such as molecular population genetics, cellular and structural biology, protein biochemistry and animal physiology that are brought by our collaborators.
The dual challenge of respiration (oxygen extraction and delivery) and ionoregulation is a poorly studied problem of particular physiological significance for basal aquatic vertebrates. This is particularly true of fish with both gills and an air-breathing organ (ABO) that tolerate a vast range of oxygen concentrations and salinities, such as gars. These species need to contend with extracting oxygen from media with different oxygen concentrations under a wide range of environmental conditions that probably change throughout an animal’s development. As such, the alligator gar (Atractosteus spatula), the most salinity tolerant species in the basal bony fishes with an ABO which also stands at the crux of basal vertebrate and teleost evolution, offers unique opportunities to better understand the dual physiological regulation of these systems. At the cellular and organismal level, vertebrate hemoglobins play a fundamental role in mediating responses to changes in oxygen availability, as this protein is in charge of delivering oxygen from respiratory organs to the cells of tissues to enable aerobic metabolism. Hemoglobins are the products of a gene family, and most fish synthesize different hemoglobin subunits throughout development and also in response to environmental changes. However, a clear understanding of the combined effects of developmental and environmental changes is lacking. Because of its ability to extract oxygen from water and air, its wide tolerance to changes in salinity and the availability of a high quality genome for a relatively closely related species, the alligator gar offers unique opportunities to study gene expression plasticity. Our work seeks to study this important model species through seeking answers to key questions in physiology and functional and evolutionary genomics related to a fundamental aspect of life on this planet: how organisms maintain a stable delivery of oxygen under varying conditions. Thus, as a first approximation to understanding how gars are able to deliver oxygen under different environmental conditions, and understand how exposure to physiological challenges early in development can influence responses at later stages, we are analyzing alligator gar transcriptomes to 1. Characterize the set of hemoglobins expressed at different stages of development, 2. Characterize changes in hemoglobin expression in response to changes in O2 availability and changes in salinity, and 3. Characterize changes in blood chemistry relative to oxygen binding affinity in response to changes in O2 availability and changes in salinity.

Visualization of Genomic Data
Dr. Jankun-Kelly works in the area of information and scientific visualization. He was developed novel methods for visualization interfaces, interfaces for linked image browsing, models for visual exploration, and visual analysis tools for bioinformatics.
REU projects for bioinformatics will challenge students to work together with computer scientists and biology experts to solve complex problems via interactive computer graphics. While two examples of such projects are given, actual projects will be determined in collaboration with application scientist and the student. 1. MSAVis, a multiple sequence alignment visualization system, has several feasible extensions that can be tackled in parallel by dedicated students; two are presented here. First, as it stands, MSAVis does not allow editing of protein sequences to test different alignment hypotheses; this is a feature of interest to its users. A student would add this functionality which would involve modifying MSAVis’ interaction mechanisms and integrating it with sequence alignment software. Second, there are additional protein features that could be integrated such as binding sites or information about secondary structure. Such a project would involve designing the visual metaphors for the added information and designing the interface to query the biological databases to extract them. 2. In this project, a web-based tool named GeneAtlas will be refined. The gene atlas allows the efficient comparison of multiple gene expression samples (usually from species at different times in their life cycle) to be compared efficiently. Additional interaction methods and visual metaphors could be explored to make this a tool with genuine impact on biological studies.

Genomics for Studying the Role of Polyamine Metabolism in Pneumoccal Virulence
The Nanduri lab routinely uses 1D LC ESI MS/MS to conduct global expression analysis which can be applied to study host (mouse) and the pathogen (S. pneumoniae) response in an intranasal challenge model of pneumonia. We also use single nucleotide resolution transcriptome mapping approaches such as RNA-seq to study global gene expression during infection. Both the mass spectrometry based proteomics and RNA-seq approach generate data that requires bioinformatics analysis utilizing available open source pipelines to identify a list of genes/proteins that are differentially expressed in the host and pathogen during infection. Mass spectrometry data and RNA-Seq data can also be utilized for genome structural annotation i.e. defining the expressed elements and their boundaries in a genome sequence. Furthermore, the list of differentially expressed genes and proteins are not useful unless the corresponding biological information is retrieved and analyzed in the context of pathways and networks for knowledge discovery. All these aspects of conducting polyamine research in the pneumococcus are amenable to training in multiple aspects of bioinformatics and computational biology at the undergraduate level.
Streptococcus pneumoniae (pneumococcus) is a human pathogen is associated with the etiology of meningitis, pneumonia, bacteremia, bronchitis, sinusitis, and otitis media. Based on the structure of the capsule, more than 90 different serotypes of S. pneumoniae are described in literature. Genome plasticity, serotype variability and increasing antibiotic resistance confound the efforts to control this pathogen. The availability of genome sequences for representative serotype strains and mouse models of disease allow the identification of host-pathogen interactions that underscore disease for developing therapeutic strategies. S. pneumoniae is a commensal in nasopharynx, when the host is immunocompromised, this opportunistic pathogen invades sterile spaces such as lungs causes pneumonia and when it reaches blood it results in sepsis. As pneumococcus traverses through the nasopharynx to various anatomical locations in the human body, it has to adapt its metabolism to host niche and also circumvent host defenses at each of these locations. The intersection of pneumococcal metabolism with virulence during infection is expected to elucidate key pneumococcal genes/proteins involved in pathogenesis. Polyamines are poly cationic aliphatic hydrocarbon compounds that are ubiquitous in all living cells. Polyamines, such as putrescine, spermidine and cadavarine, carry a net positive charge at physiological pH. The positive charge of polyamines helps maintain the conformation of negatively charged nucleic acids. Polyamines are involved in pathogen adaptation to growth in vivo, response to physiological stress, and modulation of host immune responses. Impaired polyamine transport and biosynthesis in S. pneumoniae TIGR4 render the bacterium incapable of surviving in mouse models of nasopharyngeal colonization, pneumonia and sepsis.

High Performance Computing for Genome Sequencing and Assembly
Dr. Peterson’s research is focused on exploring the structure and evolution of eukaryotic and prokaryotic genomes using genomic, cytogenetic, molecular biology, and computational techniques. By elucidating and comparing the sequences of genes and repeat sequences from a diverse group of organisms, his lab is illuminating trends in molecular evolution and discovering sequences responsible for economic and adaptive traits. Such research accelerates agricultural plant/animal improvement through marker-aided selection strategies and/or genetic engineering. Additionally, we are investigating repetitive DNA sequences and their role in genome evolution. At present, the research organisms we are studying include cotton, conifer trees, nematode and arthropod pests, crocodilians, and bacteria with anti-fungal properties. Bioinformatics and high performance computing play a central role in this research, especially in genome assembly and analysis of massive nucleic acid datasets. We have been involved in large-scale genome sequencing/sequence analysis projects that have been published in journals such as Nature, Nature Biotechnology, and Science.
Increasingly, Dr. Peterson’s research has focused on the use of computational biology techniques to distill biological information from large, complex datasets. REU students working in the Peterson lab will be trained to use high performance computing (HPC) instruments to assemble and annotate genomes sequenced by my research team. Training will be tailored to each individual trainee based upon his/her familiarity, if any, with UNIX, HPC, and the genome/organism assigned. After becoming proficient in UNIX, undergraduate trainees will be taught to use modern open-source genome analysis scripts to explore test data sets. Once proficiency using the scripts has been demonstrated, trainees will be assigned a previously uncharacterized DNA sequence dataset to assemble and annotate. The sheer number of genomes sequenced by the IGBB (ca. 30 per year) means that there is no shortage of data for characterization/study.

High Throughput Maize Genomics
In the Corn Host Plant Resistance Research Unit of the USDA Agriculture Research Service, Dr. Warburton investigates the genetic basis of aflatoxin and A. flavus resistance in corn using genetic and genomic tools. We are currently working to identify and validate genetic sequences associated with resistance to the toxic fungus Aspergillus flavus. In the course of genetic analyses, the lab generate very large amounts of sequencing data. This data ranges from high coverage but low depth (Genotype by Sequencing data, GBS) to low coverage and high depth (one gene sequenced multiple times in multiple individuals). This data must all be stored, retrieved and analyzed as efficiently as possible, and re-analyzed as new information comes to light. Changes in DNA sequences are correlated with changes in plan phenotypes, via genome wide association studies (GWAS), candidate gene association analysis, and linkage mapping. The data storage and retrieval is computationally intensive, and help is often needed with programs to find the exact sequence variation we need as reliably and quickly as possible. This is typically used for association or linkage mapping studies. In addition, we are beginning to work with RNA and gene expression data. Rather than the more simple changes in genetic sequence that we have dealt with using DNA sequences, there is an added component where number of copies of each RNA, or expression levels, becomes very important and must also be stored, retrieved and analyzed for each unique sequence. The added data associated with each genetic sequence requires computational skills to handle. Expertise in both databasing and programming is very useful in this, and biological understanding is good to ensure data retrieval and analyses are working correctly. There are many currently available online resources for DNA and RNA pipelines, from data generation, storage, alignment, and analysis, but for any given project and species, these pipelines almost always need tweaking and manual curation.

Genomic Dynamics of Populations
Dr. Welch is an evolutionary geneticist with two distinct research foci at present. The role of transcribed microsatellites as agents of adaptive change is being studied using the annual sunflower, Helianthus annuus, and RNAseq based methods. He is also investigating the population dynamics of small populations using Caribbean rock iguanas as a model system.Projects for undergraduates will be designed to both generate usable data, and serve as complete introductions to hypothesis driven research. Projects will be focused on understanding the role microsatellites play in generating gene expression variance, and how that phenotypic variance is influenced by selection. For example, students involved in screening microsatellites for amplification and variability in sunflowers will be testing the hypothesis that transcribed microsatellites are under greater evolutionary constraint than are anonymous microsatellites. The prediction that follows from this hypothesis is that anonymous loci should harbor more variation than those that are transcribed.
Some students will learn how to perform fragment analysis, and basic computational biology associated with population genetic studies. Students collecting data on seed set and seed mass could test the hypothesis that variance in reproductive success varies across multiple populations. In this way, students are involved in meaningful research, and they are introduced to the entire process of science from hypothesis development to reporting. Students with more advanced computational skills will be afforded the opportunity to develop bioinformatics projects using our RNAseq data. For example, one REU student in the summer of 2015 studied variation in transcriptomes by comparing sequence similarity across six individuals. She followed up by demonstrating that transcribed microsatellites tend to be found in genes that are consistently transcribed rather than in spurious transcripts that are unique to individual sunflowers. She concluded that functional genes are relatively enriched with microsatellites.