About ClanTox

Toxins are proteins that are present in animal venom. They are extremely diverse in terms of sequence, structure and function. Toxins include several types of proteins such as ion channel inhibitors, phospholipases, disintegrins and others.

Although they are diverse, many toxins share a common feature: They are small and extremely stable proteins. One of the major reasons for this is a large number of disulfide bridges. An evolutionary explanation for toxin structural stability that toxins often need to travel through the bloodstream of the venom recipient in order to reach their intended final target.

It is interesting to note that over the years it has become apparent that many toxins have homologs that do not act in venom. Some toxins have even been found to be expressed in non-venomous tissues. We refer to these toxin homologs as toxin-like proteins.

ClanTox is a classifier that, given a protein sequence, tries to predict whether it represents a toxin or a toxin-like protein. The output of ClanTox is reported as a pair of numbers: the mean score and the standard deviation.

To understand what these numbers signify, we must first explain the prediction process (see figure). Sequences given to ClanTox are first transformed into a vector based on various features extracted from the sequence. Next, this vector representation is sent to 10 sub-classifiers. .

Each of these sub-classifiers has been trained on a set of ion-channel inhibitors and a along with a random set of non-toxin proteins. As a result, each classifier will give a slightly different prediction.

The labels P1,P2 and P3 are considered positive (i.e. either toxin or toxin-like) while the label N is considered negative (non-toxins). The mean score and standard deviation are simply the average and standard deviation of all 10 predictions.
For ease of use, we divide the scores into 4 general categories: class P3 ("Toxin-like"), class P2 ("Probably toxin-like"), class P1 ("Possibly toxin-like") and class N ("Probably not toxin-like"). These are general "rules of thumb" to work by. What does this mean quantitatively? Well, if you apply the classifier to a non-redundant set ~30000 proteins (all SwissProt proteins <= 150 amino acids), 7% will be classified as class 1 or higher, 3.2% will be classified as class 2 or 3, and 1.8% will be classified as class 3. For a set of proteins of arbitrary lengths, the percentages are much lower.