High-throughput experimental technologies are generating increasingly massive and complex genomicsequence data sets. While these data hold the promise of uncovering entirely new biology, their sheerenormity threatens to make their interpretation computationally infeasible. The continued goal of thisproject is to design and develop innovative compression-based algorithmic techniques for efficientlyprocessing massive biological data. We will branch out beyond compressive search to address theimminent need to securely store and process large-scale genomic data in the cloud, as well as to gaininsights from massive metagenomic data. ��The key underlying observation is that genomic data is highly structured, exhibiting high degrees ofself-similarity. In our previous granting period, we exploited its high redundancy and low fractaldimension to enable scalable compressive storage and acceleration for search of sequence data aswell as other biological data types relevant to structural bioinformatics and chemogenomics. In thisrenewal, we will continue to capitalize on the structure (i.e., compressibility) of genomic data to: (i)overcome privacy concerns that arise in sharing sensitive human data (e.g. on the cloud); (ii) addressnew challenges, beyond search, with metagenomic data; and (iii) seek to widen the adoption of theprevious and newly-proposed compressive algorithms for industry, research, and clinical use. We willdemonstrate the utility of our compressive techniques to the characterization of human genomic andmetagenomic variation.�We will collaborate with co-I Sahinalp's lab (Indiana University, Bloomington) on developing andapplying these tools to high-throughput data sets including autism spectrum disorder (with IsaacKohane and Evan Eichler) and cancer (with PCAWG, Pan Cancer Analysis of Whole Genomes), themicrobiome (with Eric Alm and Jian Peng), as well as human variation analysis (GATK, with EricLander and Eric Banks). The broad, long-term goal is to apply our compressive approach tomassive biological data sets to elucidate the still obscure molecular landscape of diseases.��Successful completion of these aims will result in computational methods and tools that will significantlyincrease our ability to securely store, access and analyze massive data sets and will revealfundamental aspects of genetic variation, as well as testable hypotheses for experimentalinvestigations. Not only will all developed software be made publicly available, but as part of ourintegration aim, we will also ensure that the research community can make use of our innovations withminimal effort. Through our research collaborations, we will both build these tools and demonstratetheir relevance to the characterization of human health and disease.��