- 1. Quality Control
-
-
The repository of high-quality interactions contains only interactions from high-quality high-throughput (HT) studies, and interactions from small-scale studies that have been reported at least twice in the literature.
-
Since the number of HT publications is relatively low as compared to the vast number of small-scale studies, we manually inspect each of the HT studies. We ensure that high-quality HT experiments included in HINT include validation by orthogonal assays (e.g., co-immunoprecipitation). Experiments that do not perform any validation of their screens are not included. Please note that we are not able to inspect (much less validate) individual interactions within each study, we can only examine each HT study by their reported validation results.
-
On the other hand, since it is impossible to manually check all small-scale studies, we require two independent publications to report the same interaction for it to be included in our dataset. While some interactions from dedicated small-scale studies have been validated multiple times in the same publication and are of high quality, a significant fraction of interactions from small-scale experiments are not easily reproducible. One main reason is due to the fact that many small-scale publications started with a large-scale screen (e.g., pull-down mass spectrometry) for their proteins of interest, which often found dozens to hundreds of interactors. The authors then only focused on one or two of these interactions for detailed studies. As a result, the rest of these interactions contain many false positives. For interactions reported by only one publication, it is impossible to separate the high-quality ones from the rest. Therefore, to ensure high quality, we only include interactions reported by two independent publications in our dataset.
-
Protein post-translational modifications (e.g., ubiqitination, sumoylation) are not considered as protein-protein interactions.
- 2. Batch download file format
-
Each row represents an interaction, including two Uniprot IDs, two gene names, two ORF names (if available), two alias (if available) and publication list. Each publication information consists of Pubmed ID, evidence code and if it's high throughput. Multiple entries for the same item are separated by the pipe symbol
|
.