Triticum aestivum Assembly and Gene Annotation
Triticum aestivum (bread wheat) is a major global cereal grain essential to human nutrition. Wheat was one of the first cereals to be domesticated, originating in the fertile crescent around 7000 years ago. Bread wheat is hexaploid, with a genome size estimated at ~17 Gbp, composed of three closely-related and independently maintained genomes that are the result of a series of naturally occurring hybridization events. The ancestral progenitor genomes are considered to be Triticum urartu (the A-genome donor) and an unknown grass thought to be related to Aegilops speltoides (the B-genome donor). This first hybridization event produced tetraploid emmer wheat (AABB, T. dicoccoides) which hybridized again with Aegilops tauschii (the D-genome donor) to produce modern bread wheat.
CS42 TGAC v1 Assembly
250 bp paired-end reads were generated from two CS42 libraries constructed using a PCR-free protocol. In total, 1.1 billion PE reads were generated providing 32.78x coverage of the CS42 genome. The w2rap-contigger (based on DISCOVAR de novo ) was used to assemble contigs. The contigger is available in Github  and is fully described elsewhere . It utilises PCR-free libraries to reduce coverage bias, uses long 250bp reads generated by the latest Illumina sequencing technology and retains the majority of the variation present in the reads when generating contigs. Multiple Nextera long mate-pair libraries were generated for scaffolding with insert sizes ranging from 2-12 Kb. LMP reads were processed using Nextclip  and contigs were scaffolded using the SOAPdenovo2  prepare-> map-> scaffold pipeline.
Scaffolds were classified into chromosome-arm bins using arm-specific Chromosome Survey Sequence (CSS) reads . Scaffolds from 3B were not separated into short/long arm bins as individual arm datasets were not generated for this chromosome in the CSS project. The ‘sect’ method of KAT  was used to compute kmer coverage over each scaffold using each CSS read set. Each non-repetitive kmer in a scaffold was scored proportionally to coverage on each CSS arm and scaffolds were classified using the following set of rules:
- Scaffolds with less than 10% of the kmers producing a vote were left as unclassified (marked as Chromosome arm “U”). These are mostly small and/or repetitive sequences.
- Scaffolds with a top score towards a CSS set at least double the second top score were classified to the highest scoring chromosome arm.
- Scaffolds with a top score towards a CSS set less than double the second top score were left as unclassified (marked as Chromosome arm “U”, but with the two top scores and CSS sets included in the sequence name). This category contains scaffolds that are classified as combinations of the two arms from the same chromosome, probably due to imprecise identification during flow-sorting. It also contains scaffolds from regions of the genome with specific flow-sorting biases, and assembly chimeras, which will all be investigated further.
Rather than using a simple length cutoff to include scaffolds in the final assembly, a content filter was applied to the scaffolds classified into each chromosome-arm bin in order to ensure short scaffolds containing unique content were not excluded from the assembly. Scaffolds were sorted by length, longest first. Scaffolds longer than 5Kbp were automatically added to the assembly. Scaffolds between 5Kbp and 500bp were added from longest to smallest if 20% of the kmers in the scaffold were not already present in the assembly. Scaffolds shorter than 500bp were excluded.
For assigned scaffolds, the arm assignment is included in the FASTA identifier. For unassigned scaffolds with more than 10% voting kmers, the highest and second highest vote is included in the FASTA identifier to indicate possible arms.
IWGSC gene predictions on the Chromosome Survey Sequence (CSS), PGSB/MIPS version 2.2 
Gene models were derived from the spliced-alignment of publicly available wheat fl-cDNAs and the protein sequences of related grass species, barley, Brachypodium, rice and Sorghum. A large RNA-Seq dataset, covering different tissues and different developmental stages, was used to identify wheat specific genes and additional splice variants. Redundant transcript structures from these different sources were merged.
A total of 99,386 protein-coding genes were predicted, with 193,667 transcripts and splice variants. To simplify display in the genome browser, splice variants are shown on a separate track, which is off by default.
IWGSC gene predictions on chromosome 3B, GDEC/INRA version 1.0 
A total of 5,326 protein-coding genes, 1,938 pseudogenes, and 85% of transposable elements were generated on the 3B chromosome by the GDEC group at INRA. An additional 251 gene models and 188 pseudogenes were annotated on un-anchored 3B scaffolds.
The GDEC gene set replaces the PGSB/MIPS gene set for the purpose of functional annotation and comparative genomics within Ensembl Plants. However, the PGSB/MIPS genes have been projected from the CSS onto the chromosome 3B assembly, and are shown on a separate track.
Chloroplast and mitochondrial genes
Triticeae-CAP predicted transcripts set - Krasileva et al. 
Predicted transcripts have been inferred from Exonerate alignments of wheat coding sequences (CDS) from two sets of transcripts: Triticum turgidum assembled RNAseq data (Krasileva et al., Genome Biology 2013, 14:R66, Supplemental dataset 7) and a collection of publicly available wheat transcripts filtered to exclude pseudogenes, sequences shorter than 90 bp, and ORFs similar to those present in the T. turgidum set. Click here for example. The program findorf was used to predict the CDS within these transcripts as described in Krasileva et al. . See Triticeae-CAP project page for more information.
Repeat feature and non-coding RNA annotation
Non-coding RNA genes have been annotated using tRNAScan-SE (Lowe, T.M. and Eddy, S.R. 1997), RFAM (Griffiths-Jones et al 2005), and RNAmmer (Lagesen K.,et al 2007) as part of our standard non-coding RNA annotation pipeline.
Wheat RNA-Seq, ESTs, and UniGene datasets have been aligned to the Triticum aestivum genome:
- 454 RNA-seq data were aligned using STAR, for the following ENA studies:
- Illumina RNA-seq data were aligned using STAR, for the following ENA study:
- Wheat UniGene cluster sequence data were aligned using Exonerate, following the standard Ensembl pipeline. Click here for example.
- All publicly available wheat EST data were aligned using STAR. Click here for example.
- TriFLDB  sequences were aligned using STAR. Click here for example.
Analysis of the bread wheat genome using comparative whole genome shotgun sequencing - Brenchley et al. 
The wheat genome assemblies previously generated by Brenchley et al. (PMID:23192148) have also been aligned to the survey sequence, Brachypodium, barley and the wild wheat progenitors (Triticum urartu and Aegilops tauschii). Homoeologous variants inferred between the three wheat genomes (A, B, and D) are displayed in the context of the gene models of these five genomes.
Sequences of diploid progenitor and ancestral species permitted homoeologous variants to be classified into two groups, 1) SNPs that differ between the A and D genomes (where the B genome is unknown) and, 2) SNPs that are the same between the A and D genomes, but differ in B.
The wheat gene alignments and the projected wheat SNPs are available on the Location view of the Triticum aestivum, Brachypodium distachyon and Hordeum vulgare genomes, as additional tracks under the "Wheat SNPs and alignments" section of the "Configure This page" menu. Click here for a bread wheat example. Click here for a Brachypodium example. Click here for a barley example.
Transcriptome assembly in diploid einkorn wheat Triticum monococcum - Fox et al. 
Genome-wide transcriptomes of two Triticum monococcum subspecies were constructed, the wild winter wheat T. monococcum ssp. aegilopoides (accession G3116) and the domesticated spring wheat T. monococcum ssp. monococcum (accession DV92) by generating de novo assemblies of RNA-Seq data derived from both etiolated and green seedlings. Assembled data is available from the Jaiswal lab and raw reads are available from INSDC projects PRJNA203221 and PRJNA195398.
The de novo transcriptome assemblies of DV92 and G3116 represent 120,911 and 117,969 transcripts, respectively. They were mapped to the bread wheat, barley and Triticum urartu genomes using STAR. Click here for a bread wheat example.
~504,092 SNP markers provided by CerealsDB, from the University of Bristol, were mapped to the TGACv1 assembly, running on ungapped model, with the following filtering criteria, 100% coverage, and 100% identity match.
These SNPs can be part of the following platforms:
- The Axiom 820K SNP Array (contains ~820,000 SNP markers of which 504,092 have been mapped).
- The Axiom 35K SNP Array (contains 35,000 SNP markers of which 21,423 have been mapped).
Note that a SNP marker can be part of more than one platform.
Wheat sequence search v2.0 online
Full sequence-based searching of the wheat genome is now available within the standard Ensembl Genomes sequence search facilities (ENA search and BLAST). The previous custom wheat-only search has now been discontinued.
- International Wheat Genome Sequencing Consortium (IWGSC)
- URGI Wheat Portal
- GDEC Portal
- PGSB International Wheat Survey Genome Database
- PGSB 5x 454 Survey Wheat Genome Database
- Triticeae Genomics For Sustainable Agriculture resource page
- Triticeae-CAP in UC Davis University
- Triticum monococcum resources from Jaiswal Lab in Oregon State University
- TREP, the Triticeae Repeat Sequence Database
- TriFLDB, the triticeae full-length CDS database
- ENA study ERP000319: 454 pyrosequencing of the Triticum aestivum (bread wheat) genome to 5X coverage
- ENA study ERP001415: 454 sequencing of Triticum aestivum (bread wheat) cv. Chinese spring cDNA samples from a pool of tissues, from plants under drought stress and from circadian-sampled leaves
- ENA study ERP004505: Analysis of the bread wheat grain transcriptome reveals complex genome interplay in a hexaploid cereal
- Triticum aestivum ESTs at ENA
- Triticum aestivum Unigene cluster sequences at NCBI
- CerealsDB from the Functional Genomics Group at the University of Bristol
- Wheat Hapmap project
- A chromosome-based draft sequence of the hexaploid bread wheat (Triticum aestivum) genome.
2014. Science. 345:1251788.
- A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome.
Chapman JA, Mascher M, Bulu A, Barry K, Georganas E, Session A, Strnadova V, Jenkins J, Sehgal S, Oliker L et al. 2015. Genome Biol. 16:26.
- Structural and functional partitioning of bread wheat chromosome 3B.
Choulet F, Alberti A, Theil S, Glover N, Barbe V, Daron J, Pingault L, Sourdille P, Couloux A, Paux E et al. 2014. Science. 345:1249721.
- Separating homeologs by phasing in the tetraploid wheat transcriptome.
Krasileva KV, Buffalo V, Bailey P, Pearce S, Ayling S, Tabbita F, Soria M, Wang S, Consortium I, Akhunov E et al. 2013. Genome Biol. 14:R66.
- Homoeolog-specific transcriptional bias in allopolyploid wheat.
Akhunova AR, Matniyazov RT, Liang H, Akhunov ED. 2010. BMC Genomics. 11:505.
- Analysis of the bread wheat genome using whole-genome shotgun sequencing.
Brenchley R, Spannagl M, Pfeifer M, Barker GL, D'Amore R, Allen AM, McKenzie N, Kramer M, Kerhornou A, Bolser D et al. 2012. Nature. 491:705-710.
- Genome interplay in the grain transcriptome of hexaploid bread wheat.
Pfeifer M, Kugler KG, Sandve SR, Zhan B, Rudi H, Hvidsten TR, , Mayer KF, Olsen OA. 2014. Science. 345:1250091.
- TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics.
Mochida K, Yoshida T, Sakurai T, Ogihara Y, Shinozaki K. 2009. Plant Physiol. 150:1135-1146.
- De Novo Transcriptome Assembly and Analyses of Gene Expression during Photomorphogenesis in Diploid Wheat Triticum monococcum.
Fox SE, Geniza M, Hanumappa M, Naithani S, Sullivan C, Preece J, Tiwari VK, Elser J, Leonard JM, Sage A et al. 2014. PLoS ONE. 9:e96855.
- CerealsDB 2.0: an integrated resource for plant breeders and scientists.
Wilkinson PA, Winfield MO, Barker GL, Allen AM, Burridge A, Coghill JA, Edwards KJ. 2012. BMC Bioinformatics. 13:219.
- Characterization of polyploid wheat genomic diversity using a high-density 90000 single nucleotide polymorphism array.
Wang S, Wong D, Forrest K, Allen A, Chao S, Huang BE, Maccaferri M, Salvi S, Milner SG, Cattivelli L et al. 2014. Plant Biotechnol. J..
- Transcript-specific, single-nucleotide polymorphism discovery and linkage analysis in hexaploid bread wheat (Triticum aestivum L.).
Allen AM, Barker GL, Berry ST, Coghill JA, Gwilliam R, Kirby S, Robinson P, Brenchley RC, D'Amore R, McKenzie N et al. 2011. Plant Biotechnol. J. 9:1086-1099.
- A haplotype map of allohexaploid wheat reveals distinct patterns of selection on homoeologous genomes.
Jordan KW, Wang S, Lun Y, Gardiner LJ, MacLachlan R, Hucl P, Wiebe K, Wong D, Forrest KL, et al. 2015. Genome Biol. 16:48.
General information about this species can be found in Wikipedia.
|Assembly||IWGSC1+popseq, Nov 2014|
|Golden Path Length||6,483,288,884|
|Genebuild method||Imported from IWGSC|
|Non coding genes||9,993|
|Small non coding genes||9,971|
|Long non coding genes||14|
|Misc non coding genes||8|
|T. turgidum RNA-seq alignments||83,160|
|T. aestivum RNA-seq alignments||39,237|