Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Supplementary material

NetMHCIIpan-2.0: Improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure

Here, you will find the data set used for training (Quantitative peptide binding data) and evaluation (SYFPEITHI ligands and IEDB T cell epitope data) of the NetMHCIIpan-2.0 method.


Quantitative peptide binding data

The quantitative binding data are partitioned in 5 files to be used for cross-validation. For instance does the f000 file contain training data, and c000 file test data for the first cross-validation partitioning.

The format for each of the files is

AAAGAEAGKATTEEQ 0.0895297 DRB1_0101
AAAGAEAGKATTEEQ 0.0731308 DRB1_0901
AMLHWSLILPGIKAQ 0.604124 DRB1_0101
EGTKVTFHVEKGSNP 0.0130254 DRB3_0101
IPVFLQEALNIALVA 0 DRB1_0301
CIEYVTLNASQYANC 0.264118 DRB1_0301
CIEYVTLNASQYANC 0.356533 DRB1_0901
LHRVVLLESIAQFGD 0.261746 DRB3_0101
LRKAGKSVVVLNRKT 0 DRB1_0401
KYKFVRIQPGQTFSV 0.606066 DRB1_1501

where the first column gives the peptide, the second column the log50k transformed binding affinity (i.e. 1 - log50k( aff nM)), and the last column the HLA-DR allele.

When classifying the peptides into binders and non-binders for calculation of the AUC values for instance, a threshold of 500 nM is used. This means that peptides with log50k transformed binding affinity values greater than 0.426 are classified as binders.

f000 (Train data) c000 (Test data)
f001 (Train data) c001 (Test data)
f002 (Train data) c002 (Test data)
f003 (Train data) c003 (Test data)
f004 (Train data) c004 (Test data)

SYFPEITHI data

The data in the SYF data set contains source proteins of MHC ligands downloaded from the SYFPEITHI database. The data set contains data for 1164 HLA-DR ligands covering 28 different HLA-DR alleles. The format for each FASTA entry is

>HLA-DRB1_0101 AVDDVQYVDEIASVLTSQ
MKHHHHHHHSDYDIPTTENLYFQGSAAATGPSFWLGNETLKVPLALFALNRQRLCERLRK
NPAVQAGSIVVLQGGEETQRYCTDTGVLFRQESFFHWAFGVTEPGCYGVIDVDTGKSTLF
VPRLPASHATWMGKIHSKEHFKEKYAVDDVQYVDEIASVLTSQKPSVLLTLRGVNTDSGS
VCREASFDGISKFEVNNTILHPEIVECRVFKTDMELEVLRYTNKISSEAHREVMKAVKVG
MKEYELESLFEHYCYSRGGMRHSSYTCICGSGENSAVLHYGHAGAPNDRTIQNGDMCLFD
MGGEYYCFASDITCSFPANGKFTADQKAVYEAVLRSSRAVMGAMKPGVWWPDMHRLADRI
HLEELAHMGILSGSVDAMVQAHLGAVFMPHGLGHFLGIDVHDVGGYPEGVERIDEPGLRS
LRTARHLQPGMVLTVEPGIYFIDHLLDEALADPARASFFNREVLQRFRGFGGVRIEEDVV
VTDSGIELLTCVPRTVEEIEACMAGCDKAFTPFSGPK

Each FASTA entry is characterized by the HLA-DR allele (HLA-DRB1_0101), and the HLA-DR ligand.

SYFPEITHI dataset

IEDB data

The data in the IEDB data set contains source proteins of MHC class II T cell epitopes downloaded from the IEDB database. The data set contain 1325 HLA-DR epitope covering 42 different HLA-DR alleles. The format for each FASTA entry is

>HLA-DRB1_0101 AETPGCVAYIGISFLDQASQ
MKIRLHTLLAVLTAAPLLLAAAGCGSKPPSGSPETGAGAGTVATTPASSPVTLAETGSTL
LYPLFNLWGPAFHERYPNVTITAQGTGSGAGIAQAAAGTVNIGASDAYLSEGDMAAHKGL
MNIALAISAQQVNYNLPGVSEHLKLNGKVLAAMYQGTIKTWDDPQIAALNPGVNLPGTAV
VPLHRSDGSGDTFLFTQYLSKQDPEGWGKSPGFGTTVDFPAVPGALGENGNGGMVTGCAE
TPGCVAYIGISFLDQASQRGLGEAQLGNSSGNFLLPDAQSIQAAAAGFASKTPANQAISM
IDGPAPDGYPIINYEYAIVNNRQKDAATAQTLQAFLHWAITDGNKASFLDQVHFQPLPPA
VVKLSDALIATISS

Each FASTA entry is characterized by the HLA-DR allele (HLA-DRB1_0101), and the HLA-DR epitope.

IEDB dataset

References

Morten Nielsen, Sune Justesen, Ole Lund, Claus Lundegaard, and Soren Buus
NetMHCIIpan-2.0: Improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure