Hi,
BCI2000 uses a "Linear Classifier" in order to discriminate ("classify") between the two cases "P300 present" and "no P300 present".
In order to understand what that means, and how components interact with each other, it is important to understand the distinction between a "Classifier", and "Classifier Training".
A "Classifier", or "Classification algorithm" is an algorithm that sorts data points into "classes", based on a set of parameters. Typically, discrimination is done between two classes only, which is also sufficient, as any more complex distinction may be broken up in a series of binary distinctions, or bits. It would be most natural for the two classes to be named "0" and "1", but they often have different labels, such as "-1" and "+1", or "P300" and "no P300".
Proper operation of a classifier very much depends on proper choice of its parameters. These may be determined from prior information, or from observed data. The latter is called "Classifier Training". Typically, there exist many training methods for a given type of classifier, but not all training methods apply to all classifiers, so the distinction is not always explicitly made in literature. Still, it is important to be made for the sake of understanding.
Data vectors may be interpreted as points in a high-dimensional space, and often the two classes correspond to the centers of two distinct point clouds in that space. (A notable counterexample would be classes that correspond to concentric shells with different average radius.)
Typically, there is some overlap between the clouds, so it will not be possible to uniquely determine the cloud a single data vector belongs to. Rather, a classifier's result in general will be some measure of evidence in favor of cloud 1, and against cloud 0.
The simplest of all possible classifiers is the
"Linear Classifier", which simply consists of scalar multiplication of a data vector with a weight vector of equal length. Parameters of the "Linear Classifier" are the entries of its weight vector, and correspond to
b1..bn in your formula.
Linear classification is not only the most simple, but also the best possible classification algorithm in all cases where the following two conditions are met:
1) the distinction between classes corresponds to a distance between cloud centers in space,
2) each data point cloud follows a Gaussian ("normal") distribution.
Moreover, a Gaussian distribution is the most reasonable default assumption in the absence of prior information about cloud shape. Thus, linear classification is the appropriate choice whenever one knows that 1) applies, and when there is no evidence against 2). Further, it will often be possible to choose a transformation of data points based on prior information, such that 1) applies, and 2) is met approximately -- in the counterexample above, replacing data points with their log-distance from the common center would be such a transformation.
As a result of its favorable properties, linear classification suits a very large number of classification problems.
In case of P300 classification, after filtering out frequencies below 1Hz, and in the absence of EEG artifacts, the assumption of Gaussianity will be met for the raw measurement samples. Thus, there exists no classification method that would be able to classify P300 responses better than linear classification does.
In BCI2000, the classifier is used to compute an individual y-value for each stimulus presentation. The P3Speller or StimulusPresentation application then compares y-values, and chooses the stimulus with the largest associated y-value. According to the theory of linear classification, this favors the stimulus which has most evidence for it to have elicited a P300 response. [In the case of the P3Speller, the choice is a bit more complicated because there is no 1:1 correspondence between presentations, and choices, see:
http://www.bci2000.org/wiki/index.php/P ... _Selection]
The
b in your equation is an offset that is typically chosen such that the result is zero whenever evidence is equally high (or equally low) for both classes. This is not done in BCI2000, because y-values are compared against each other, so the result is not affected by the choice of
b. Thus, in BCI2000, it is always zero.
For
Training a linear classifier, multiple methods exist.
The most basic algorithm for training a linear classifier is called "Linear Discriminant Analysis", which uses second-order statistics in order to determine classifier weights.
Thus, the quality of the resulting classifier depends on the quality of the covariance estimate, which in turn depends on the ratio of the squared number of weights, to the number of training data points. In practice, for reasons of time and effort, it may be difficult or impossible to obtain the amount of data required for a good classifier training when using plain LDA.
Thus, training algorithms have been devised which try to make better use of training data than plain LDA. Stepwise LDA (SWLDA) tries to improve the quality of the covariance estimate by reducing the number of non-zero weights to a minimum. Starting with a subset of data dimensions, it determines for each dimension how much including/removing it into/from the set would alter the quality of the estimate, and modifies the set accordingly.
Another popular training algorithm is called "Linear Support Vector Machine", and works by spanning a plane between the two data clouds such that the mean signed distance of training data points from that plane is maximized (taking the distance with reverse sign depending on class). Then, the weight vector is taken to be perpendicular to that plane, and the offset b is chosen to be such that a point on the plane has y = 0.
In the limit of a large number of Gaussian distributed training data points, the symmetry of the problem is such that the LDA algorithm will choose the line between cloud centers as the direction of the weight vector, whereas the SVM algorithm will choose a separating plane which is perpendicular to that line.
Thus, the weights resulting from SVM training will agree with those resulting from LDA training, up to a scaling factor, which is irrelevant for classification. Under the assumption of Gaussianity, as it is taken in case of P300 detection, performance differences between LDA-trained vs SVM-trained classifiers will therefore be due to improbable data points (outliers), or EEG artifacts, in the training data, and will vanish as recording quality and/or the amount of training data are increased.
HTH,
Juergen