Frequently Asked Questions
I have downloaded the "summary statistics". Are more data available?
Genotypic data for all 3000 WTCCC controls and 14 000 disease samples are available to qualified investigators. Access will be approved by the Consortium Data Access Committee (CDAC). For further details, see Access to WTCCC genotype data and samples.
How long will it take to process my application?
This will vary depending on the timing of the committee meetings, but we aim to process the application within two months.
I have confirmation from the Wellcome Trust that I am an "approved user". How may I access the data?
Getting access involves two stages. First you need to create an account to access the secure website. Secondly, your account needs to be activated to allow access to the data (please note that this is a currently a manual step). Send an e-mail to wtccc_admin@sanger.ac.uk to request this. There is no need to forward the approval letter from the CDAC.
I have registered to access the secure web site but am unable to login. Why is this?
We need to verifiy e-mail addresses as part of the registration process. An e-mail will have been sent to you containing a link which must be opened in order to complete registration.
I have some questions regarding the various forms and agreements. Whom should I contact?
For questions about the access procedure and data use policies, please contact cdac@wellcome.ac.uk. For queries regarding the website and data only, please e-mail wtccc_admin@sanger.ac.uk. New queries should preferablly be sent to this address rather directly to a member of WTCCC staff.
I wish to access the data as part of a group of collaborators. Will each collaborator need to make a separate application?
A single application can be submitted, but the full contact details for each collaborator must be provided. If more than one Institution is involved, a separate signed Data Access Agreement must be submitted for each Institution.
I have already been granted access to genotype data for the controls. May I have access to genotype data for the cases too? What do I need to do?
Users previously granted access to the controls through the CDAC may have access to case data, without a separate application, by e-mailing cdac@wellcome.ac.uk. You may be asked to sign the most recent version of the Data Access Agreement. See Access to WTCCC genotype data and samples.
Are phenotypic data available for the disease samples?
The WTCCC has limited phenotype data on the disease samples: disease status, age, sex and broad geographical region within Britain. Access to additional phenotype data must be arranged directly with the relevant principal investigator. The principal investigators for each disease group are provided in the Overview page on the website. For the 1958 Birth cohort controls, access is by application to the 1958 Oversight Committee. Further details can be found at British 1958 Birth Cohort.
Which genotype calling methods did the consortium use in its analysis?
The analysis in the consortium papers used genotypes derived using Chiamo (Affymetrix 500K) and GenCall (Infinium 15K).
What score thresholds should be used in selecting no calls for individual genotypes?
It is recommended these genotypes be discarded: probability < 0.9 (Chiamo); score > 0.5 (BRLMM); score < 0.15 (GenCall). Exclusion lists (indicating poorly performing assays and samples) and filtered data (with such data removed) are also available.
I have successfully logged into the site but cannot find any data.
See the question "I have confirmation..." above.
What are CEL files?
CEL files contain the raw probe intensities from the Affymetrix chips. It is from these data that genotype calls are derived, using algorithms such as BRLMM and CHIAMO.
Are the CEL files available to download?
Yes, but due to their size (the data for the 3000 controls are approximately 150 gb; the cases add an extra 700 gb) separate arrangements exist for getting hold of them. Currently, data for the controls are available from an rsync server at the Sanger Institute. Those for the cases can be obtained by supplying a hard drive. Please contact wtccc_admin@sanger.ac.uk. Note that [unformatted] drive(s) should have larger capacity than the data required: 1 TB drive (or 2 x 500 GB) is sufficient for 850 GB data. A USB2 drive formatted as FAT32 is preferred.
Have you got software to open CEL files? Why are they so big?
Affymetrix provides tools and libraries to manipulate the various data files, such as the Affymetrix Power Tools API written in C++. This code is available, under GPL, via the Affymetrix web site.
I work in the group of another researcher who has been granted access to the data. May I also have access?
If you are under the direct supervision of the approved user, it is not necessary for you to make a separate application. Your supervisor must alert the CDAC that you will be viewing the data, by email to cdac@wellcome.ac.uk. If you are not under the direct supervision of the approved user, it will be necessary for you to make a new application for access to the data.
Could you provide files in a format for use by plink?
Please see Data formats for descriptions of the formats currently exported. It is not our intention to provide data in too many different formats, mainly due to the size of the data: we anticipate bioinformaticians will have sufficient expertise in making the relevant conversions. Where this is not the case, the Sanger team may be able to help; contact wtccc_admin@sanger.ac.uk.
What software should I use to manipulate the data I have downloaded?
See also the previous question. The data are provided 'as is' in formats deemed to be appropriate. Users are expected to be able to handle the data they download. It should be stressed that some of these formats are designed to be processed computationally rather than read by eye or opened with, for example, standard office packages.
Is it possible to automate the downloading of files from the site?
The authentication system stores information in a cookie. You could
automate downloading with utilities such as
"wget" and "curl".
Once logged in, a command such as
wget --load-cookies COOKIES_FILE URL
should work. Clearly, you
will still have to do this once per file (and get
the list of URLs from the web page).
How do I download the data using ftp or sftp?
The individual level genotype data are currently only available via the web interface. There are no immediate plans to provide an alternative.
Where has the "Data Access" link gone?
Data available only to users that are logged in now appears under "Registered access". Under that menu will be a "Login" link which will be replaced by a list of pages upon successful login. "Data files" contains the secure data formerly accessed via "Data Access"; "Support files" contains the files previously listed under "Documents". If the log in is successful (you do not get a page saying "Invalid account details") but you do not get the "Data files" link, this may be accessed directly at https://www.wtccc.org.uk/cgi-bin/manager.