CNAnova is a stand-alone software package for identifying recurrent regions of copy number aberrations (CNAs) using SNP microarray data. It runs from the command line on the Linux platforms and is composed of several modules written in the R programming language. The package archive with user manual can be downloaded from here and the sample dataset is available at this link (sample dataset).
We used 270 HapMap samples hybridized to Affymetrix SNP 6.0 arrays for the simulation of copy number changes aiming to recreate different scenarios of cross-sample co-occurrence of copy number abberations (CNAs). We used the following basic scenario. First, the frequency of CNAs and their amplitude was defined by creating CNAs that occur in 1 - 75% of data and include from 10 to 300 probes, which given distribution of SNPs on Affymetrix SNP 6.0 arrays approximately equal to 3000-100,000 kb. The positional effect of CNAs, such as occurrence of deletions in different exons of the gene, was simulated by shifting CNAs around the middle position in some samples and segments. The procedure was applied to chromosome 17 of the HapMap data and 30 regions of recurrent copy number changes were simulated. The distribution of log ratio values in each region was sampled from the normal distribution with means -0.55, 0.6, -1.1, 1.2, -1.6, 1.7 to recreate 1, 2 and 3 copy number gains/losses and constant variance of 0.31 inferred from distribution of log-ratio values in the real CNVs of the HapMap data. Finally, to test algorithms for the robustness against outliers present within regions of recurrent CNAs, we selected 6 highly recurrent CNA regions and simulated non-significant log-ratio changes with mean (+/-) 0.2/0.3 in 4 samples that do not have CNAs in those region.
Simulation Data Set Content
All simulated data is stored in the following file [SimulationData.tar.gz, size ~ 1Gb ].It contains files for the three different simulation scenarios: (1) simulation of 1,2 and 3 copy number changes across the full dynamic range of CNA frequency (3-70%), (2) simulation of only single copy number changes with low to median frequency (<30%) of occurrence and (3) simulation of 1 and 2 copy number changes, again low to median frequency.>
Each of the simulation scenarios comprises:
- Simulation_data_#.txt log ratio values of the final simulated data
- CNAs_SD_for_Simulation_data_#.csv standard deviation of probe intensities within CNA region
- CNAs_means_for_Simulation_data_#.csv mean of probe intensities within CNA region
- CNAs_length_for_Simulation_data_#.csv number of probes within CNA region
- CNAs_positions_for_Simulation_data_#.csv start position of CNA regions