Statistician and R programmer

Resume posted by aiminy in Medical.
Desired salary: $90,000.00
Desired position type: Any
Location: Miami Florida, United States

Contact aiminy


I am a statistician and R programmer, and is interested in data science. I am looking for home-based statistician and R programer job


PhD. in Bioinformatics and Computational Biology and Ph.D. minor in Computer Science, Iowa State University (2008),Ames, IA

MS in Statistics, Iowa State University (2007),Ames, IA

Computer science coursework, Iowa State University (2002) Ames, IA


Working Experience

Statistician                                                                                                       11/2015-

Sylvester Comprehensive Cancer Center

Biostatistics and Bioinformatics Core                                                             Miami, FL

Projects I am working on

  • Develop an R package to constructing enrichment network by adjusting exons and/or splicing junctions number bias in gene set enrichment analysis using RNA-Seq data
  • Develop an R package for processing and analyzing 5UTR, 3UTR and downstream of gene(DoGs) sequencing data
  • Develop pipelines for processing RNA-Seq, Chip-Seq and ATAC-Seq data
  • Develop a pipeline for protein structure predictive models process, alignment, comparison and visualization
  • Develop a method for identifying regulatory elements of genes using Chip-Seq data
  • Develop a pipeline for identifying and annotating somatic mutation for whole exome sequencing data


Statistician                                                                                                       9/2013-11/2015

Cornell University                                                                                          Ithaca, NY

Projects I worked on

  • Participate actively in software development and database administration in Sol Genomics Network project(
  • Develop yambase( and zeabase database, and contribute to the development of cassavabase database(
  • Develop parsers for parsing raw phenotype, pedigree, Genotyping-By-Sequencing data from different breeding trials
  • Develop loaders for loading phenotype, pedigree, Genotyping-By-Sequencing data into database
  • Develop an Identity-By-Descent(IBD) based General Combining Ability (GCA) model for genomic prediction
  • Help on implementing a modified augmented design and integrate this design and other design of experiment into a database server
  • Supply instruction to biologists on using R statistical language and applying statistical methods to perform data analysis
  • Process RNA-Seq data and perform De Novo Transcriptome Assembly in Trinity


Statistician                                                                                                       3/2012-8/2012

Program of Biostatistics and Biomathematics

Fred Hutchinson Cancer Research Center                                                      Seattle, WA

Projects I worked on supplying statistics and informatics support for identifying biomarkers on cancer early detection

  • Combine machine learning methods with search algorithms for identifying biomarkers on cancer early detection
  • Apply a regularized multivariate regression method to study the relationship between protein expression profile and gene expression profile
  • Apply Procrustes analysis procedure to compare protein expression profile and gene expression profile
  • Combine principal component analysis with gene ontology and pathway information to identify the set of biomarkers related to cancer status.
  • Data management and quality control for the clinical data from different types of cancer researches


Statistician                                                                                                       12/2010- 7/2011

AVEO Pharmaceuticals, Inc.                                                                          Cambridge, MA


Projects I worked on supplying statistics and bioinformatics support for identifying biomarkers in high dimensional data space in anti-cancer drug discovery research

  • Develop Bayesian statistical methods to identify biomarkers and gene regulatory network related to drug response in high-dimensional data space
  • Apply penalized Cox regression model to identify biomarkers in high-dimensional data space by integrating gene expression data and outcome of progression free survival from phase I and phase II clinical trials of two drug candidates
  • Apply and implement a Procrustes analysis procedure to compare gene expression profiles between different data sets
  • Implement the reference sample-based batch effect adjustment for microarray expression profile data
  • Develop a pipeline for bridging a public gene expression database with an in-house Postgres relational database
  • Supply statistics theory supports on biostatistics and bioinformatics methods related to data analysis in anti-cancer drug discovery


Statistician                                                                                                       2/2010-12/2010

Department of Biostatistics and Computational Biology

Dana-Farber Cancer Institute

Department of Biostatistics

Harvard School of Public Health                                                                    Boston, MA

Projects I worked on developing statistical methods and informatics pipelines, constructing a database for SNP array and next-generation sequence data related to cancer research


  • Identify the rare genetic variants using a Bayesian regression approach
  • Identify the SNP markers for quantitative traits underlying WM blood cancer using a Bayesian regression method


  • Apply Bayesian regression method to perform eQTL mapping using RNA-seq data
  • Supply statistical and bioinformatics support for a family based deep sequencing project related to WM blood cancer
  • Implement the DFCI-Genolytics project(A database project that is based on the LAMP(Linux, Apache HTTP Server, MySQL, PHP)


Statistician                                                                                                       9/2008-12/2009

Center for Integrated Animal Genomics

and Department of Animal Science, Iowa State University                            Ames, IA

Projects I worked on applying and developing statistical methods for identifying SNP genetic markers, constructing predictive models and building a database for managing different types of phenotype data in animal genomic selection

  • Moobase: A relational database for managing data in genomic selection project of energy balance traits in dairy cattle
  • Rmoo: An R package for processing phenotype data in genomic selection project of energy balance traits in dairy cattle (in this package, I developed quite a few R functions for processing the raw data, and further performed imputation of missing values for several traits(Body weight, Body Condition Score, Feed in Take, the content of protein, fat, lactose in milk) using a natural cubic spline method on a two-year longitudinal study


  • Identify the markov boundary of SNP sets to investigate epistasis
  • Study SNP-SNP interaction using filter-wrapper approach
  • Use high density SNP markers to predict the genetic value of several traits by dimension reduction and machine learning methods, and compare the prediction results from these methods with one by Bayesian method


Statistician                                                                                                       6/2002-8/2008

Baker Center for Bioinformatics and

Biological Statistics, Iowa State University                                                    Ames, IA

Projects I worked on applying statistical methods and developing computational approaches for prediction, modeling and molecular dynamics of protein structures as well as building a web server for distance weighted elastic network model

  • Develop a web server for B-factor calculation using distance weighted elastic network model
  • Work on studying the effects of different superposition methods on the correspondence between the experimental conformational changes and the motions generated from elastic network model


  • Apply principal component shaving method for clustering protein structures
  • Develop a visualization tool for visualizing protein structure data
  • Develop a novel knowledge–based side chain orientation potential for protein fold recognition


  • Solid knowledge in multivariate statistics, machine learning methods, Bayesian statistics, Design of Experiment, Clinical statistics, Survival analysis and statistical methods in bioinformatics
  • More than 10 years of working experience in predictive modeling using logistic regression, naive Bayes classifier, tree-based methods, neural network, and support vector machine
  • Extensive experience to providing statistics and bioinformatics expertise for preparing presentations, manuscripts and grant proposals
  • Fluent in using Matlab, R/Bioconductor, SAS and S-PLUS
  • Experienced in the design of relational database (MYSQL, PostgreSQL)
  • Many years of experiences in using C++ language
  • Experienced in using VB, Java, FORTRAN (LAPACK, BLZPACK), OpenGL, Ajax, Jquery and JavaScript
  • Skilled in bash shell script, PERL, HTML, PHP(LAMP), Python, Conda and Git version control
  • Working in Windows, Linux/HPC and SGI
  • Many years of experience in applying and developing statistical methods for genomic prediction, genome wide association studies and biomarker discovery
  • Many years of experience in developing score function using knowledge-based potential for native-like protein structure discrimination
  • Familiarity with using BWA, STAR, SAMtools, Picard, GATK and Trinity to analyze Next Generation Sequence(NGS) data
  • Familiarity with using statistical genetics analysis and bioinformatics tools such as rrBLUP, GenSel, Beagle FastIBD, PLINK, Haploview, SeqPup, Phylip, VMD and ProFit