Data formats and interpretation
Plain text genotypes | Oxstats | Normalised signals | Interpretation
Flatfiles Readable Ex Silico - version 1.0
These formats are intended to be both readable by the human eye as well as easily parsed computationally. This format is distinguished by the characters 'fs1' in the filenames.
Genotype Data
The Affymetrix 500K SNP chip can yield approximately 4 GB per cohort, thus this platform's genotype data has been partitioned according to chromosome and sorted according to SNP position. Given its smaller size, the Infinium 15K SNP chip is presented as a single file per cohort.
Each file is presented in tab-delimited format and contains one genotype per line. Regardless of the way the SNPs are organized, all assays are sorted according to sample so that the file can be readily separated into sample blocks. It should also be noted, that all genotypes for Affymetrix have been configured to the '+' strand of the SNP.
The following is a brief example of the genotype data format:
SNP SAMPLE GENOTYPE SCORE rs1234567 WTCCC12345 CC 0.9262 rs1234568 WTCCC12345 TC 0.8650 rs1234569 WTCCC12345 AA 0.9117
Sample Support Files
All cohorts come with information describing each sample. These files are tab-delimited and contain each sample's gender, plate and well number, collection region, supplier code, and cohort. Where available, they may include ages of onset and recruitment. They are denoted 'samples' files, e.g.:
Infinium_20060727fs1_samples_58C.txt
The following is a brief example of a sample support file:
SAMPLE GENDER* COHORT SUPPLIER PLATE/WELL REGION AGE_ONSET+ WTCCC12345 2 58C 1958OC 12701b2 Southern 4 WTCCC12346 1 58C 1958OC 12701c2 Eastern 4 WTCCC12347 2 58C 1958OC 12701d2 Northwestern 4
*Females denoted 2, males denoted 1, undefined on manifest is denoted 0.
+Ages are in decades, with 0 for ages 0 to 9 inclusive, 1 for 10 to 19, etc.
SNP Support Files
Each chip (Affymetrix 500K or Infinium 15K) comes with information describing each SNP assay. Each field is tab-delimited and contains each SNP's label (e.g. dbSNP rs#), chromosome, position, strand, type (if applicable), and alleles. Given the different methods of each platform, each platform's SNP support file is slightly different. For example, Affymetrix uses at least 6 different probes for each allele of each SNP. It should also be noted that all strands for Affymetrix have been reconfigured to '+'. These files can be found in the 'Common' area of Data Access and are denoted 'SNP' files, e.g.:
Infinium_20060727fs1_SNP.txt Affx_20060707fs1_SNP.txt
The following is a brief example of an Infinium SNP support file:
SNP CHROMOSOME POSITION STRAND ALLELES PROBE_A PROBE_B rs123456 6 26020613 + A/T ... ... rs1234567 10 44816126 - C/T ... ... rs12345678 8 186274531 + G/A ... ...
- PROBE_A
- Probe for the first allele
- PROBE_B
- Probe for the second allele
The following is a brief example of an Affymetrix SNP support file:
SNP CHROMOSOME POSITION STRAND ALLELE OFFSET PROBE_STRAND PROBE_SEQ rs123456 7 78244234 + G 3 f AACAATTCGTTC... rs123456 7 78244234 + C 3 f AACAATTCGTTC... rs123456 7 78244234 + G -4 r CCTATTTTATTT... rs123456 7 78244234 + C -4 r CCTATTTTATTT... rs123456 7 78244234 + G -4 f CGTTCACTCAAT... rs123456 7 78244234 + C -4 f CGTTCACTGAAT... rs123456 7 78244234 + C -2 r TATTTTATTTTA... rs123456 7 78244234 + G -2 r TATTTTATTTTA... rs123456 7 78244234 + G 1 r TTTATTTTATTG... rs123456 7 78244234 + C 1 r TTTATTTTATTC... rs123456 7 78244234 + C 0 r TTTTATTTTATT... rs123456 7 78244234 + G 0 r TTTTATTTTATT...
- ALLELE
- SNP allele (A, C, G, or T) in the target (design) sequence
- OFFSET
- Offset of centre position of the probe from the SNP position
- PROBE_STRAND
- A forward (f) design means that the actual probe synthesized on the array will hybridize to the forward strand of the design sequence. A reverse (r) will hybridize to the reverse strand.
Note that, for some data sets on this site, the chromosome X data has been split into two 'chromosomes': 23 and 24. The region not homologous with Y (23) needed to be treated differently from the pseudo autosomal region (24).
Oxstats
Data from Affymetrix 500K experiments are also provided in a format which may be used directly with the software packages SNPTEST, IMPUTE and GTOOL. For further details, please see http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html.
Normalised signals
Quantile normalised signal data were generated from the Affymetrix intensity ('CEL') files and used as input to the Chiamo genotype calling program. Software to perform the normalisation is available (see Available software). The format of the signal data is tab-delimited plain text, one line per SNP, consisting of IDs, position, alleles and one pair of intensities per sample for each of the two alleles.
The following is a brief example of a signal file.
AFFYID RSID pos AlleleA AlleleB 1234A1_A 1234A1_B 1234A2_A 1234A2_B ... SNP_A-0123456 rs001 10000 C T 0.407238 1.366599 0.347438 0.922283 ... SNP_A-0123457 rs002 20000 A G 0.958866 1.084143 0.148448 1.534463 ... SNP_A-0123458 rs003 30000 C G 1.943426 0.291587 1.610764 0.061066 ...
Please note that these files may contain very long lines (as do those in OXSTATS format) and are not intended to be human-readable.
Interpretation
Exclusions
The Affymetrix 500K data sets are generally provided 'as is' and should be viewed in conjuction with the exclusion lists provided. There will be one list for the samples in each panel and one for the SNPs, each giving the reason(s) for omission.
For the genotypes called by BRLMM, it is recommended that those with score > 0.5 be treated as no calls. For the Chiamo data, the recommended probabiliy threshold for inclusion is 0.9 and above. See the consortium paper for further information on the Chiamo data.