bci2000.org BBS

Posted: **15 May 2013, 14:47**

Hi, I'm new here. I'm confused about SWLDA role in this P3speller application. Something that I know about SWLDA is that SWLDA making a linear equation (y = b +b1x1+b2x2+...) which bn is weight and xn is the brain coordinate, so when online spelling is performed, some coordinate in brain that make response when false signal detected as ERP is suppressed (with low bn) and right signal detected as ERP is amplified (with high bn). And if that is true, what is y variable role? Do you have any clear explanation about this? How the P300Classifier make weight from the .dat file?
Thanks.

Posted: **16 May 2013, 11:54**

Hi,

BCI2000 uses a "Linear Classifier" in order to discriminate ("classify") between the two cases "P300 present" and "no P300 present".

In order to understand what that means, and how components interact with each other, it is important to understand the distinction between a "Classifier", and "Classifier Training".
A "Classifier", or "Classification algorithm" is an algorithm that sorts data points into "classes", based on a set of parameters. Typically, discrimination is done between two classes only, which is also sufficient, as any more complex distinction may be broken up in a series of binary distinctions, or bits. It would be most natural for the two classes to be named "0" and "1", but they often have different labels, such as "-1" and "+1", or "P300" and "no P300".

Proper operation of a classifier very much depends on proper choice of its parameters. These may be determined from prior information, or from observed data. The latter is called "Classifier Training". Typically, there exist many training methods for a given type of classifier, but not all training methods apply to all classifiers, so the distinction is not always explicitly made in literature. Still, it is important to be made for the sake of understanding.

Data vectors may be interpreted as points in a high-dimensional space, and often the two classes correspond to the centers of two distinct point clouds in that space. (A notable counterexample would be classes that correspond to concentric shells with different average radius.)
Typically, there is some overlap between the clouds, so it will not be possible to uniquely determine the cloud a single data vector belongs to. Rather, a classifier's result in general will be some measure of evidence in favor of cloud 1, and against cloud 0.

The simplest of all possible classifiers is the "Linear Classifier", which simply consists of scalar multiplication of a data vector with a weight vector of equal length. Parameters of the "Linear Classifier" are the entries of its weight vector, and correspond to b1..bn in your formula.
Linear classification is not only the most simple, but also the best possible classification algorithm in all cases where the following two conditions are met:
1) the distinction between classes corresponds to a distance between cloud centers in space,
2) each data point cloud follows a Gaussian ("normal") distribution.

Moreover, a Gaussian distribution is the most reasonable default assumption in the absence of prior information about cloud shape. Thus, linear classification is the appropriate choice whenever one knows that 1) applies, and when there is no evidence against 2). Further, it will often be possible to choose a transformation of data points based on prior information, such that 1) applies, and 2) is met approximately -- in the counterexample above, replacing data points with their log-distance from the common center would be such a transformation.

As a result of its favorable properties, linear classification suits a very large number of classification problems.

In case of P300 classification, after filtering out frequencies below 1Hz, and in the absence of EEG artifacts, the assumption of Gaussianity will be met for the raw measurement samples. Thus, there exists no classification method that would be able to classify P300 responses better than linear classification does.

In BCI2000, the classifier is used to compute an individual y-value for each stimulus presentation. The P3Speller or StimulusPresentation application then compares y-values, and chooses the stimulus with the largest associated y-value. According to the theory of linear classification, this favors the stimulus which has most evidence for it to have elicited a P300 response. [In the case of the P3Speller, the choice is a bit more complicated because there is no 1:1 correspondence between presentations, and choices, see:
http://www.bci2000.org/wiki/index.php/P ... _Selection]

The b in your equation is an offset that is typically chosen such that the result is zero whenever evidence is equally high (or equally low) for both classes. This is not done in BCI2000, because y-values are compared against each other, so the result is not affected by the choice of b. Thus, in BCI2000, it is always zero.

For Training a linear classifier, multiple methods exist.
The most basic algorithm for training a linear classifier is called "Linear Discriminant Analysis", which uses second-order statistics in order to determine classifier weights.
Thus, the quality of the resulting classifier depends on the quality of the covariance estimate, which in turn depends on the ratio of the squared number of weights, to the number of training data points. In practice, for reasons of time and effort, it may be difficult or impossible to obtain the amount of data required for a good classifier training when using plain LDA.

Thus, training algorithms have been devised which try to make better use of training data than plain LDA. Stepwise LDA (SWLDA) tries to improve the quality of the covariance estimate by reducing the number of non-zero weights to a minimum. Starting with a subset of data dimensions, it determines for each dimension how much including/removing it into/from the set would alter the quality of the estimate, and modifies the set accordingly.

Another popular training algorithm is called "Linear Support Vector Machine", and works by spanning a plane between the two data clouds such that the mean signed distance of training data points from that plane is maximized (taking the distance with reverse sign depending on class). Then, the weight vector is taken to be perpendicular to that plane, and the offset b is chosen to be such that a point on the plane has y = 0.

In the limit of a large number of Gaussian distributed training data points, the symmetry of the problem is such that the LDA algorithm will choose the line between cloud centers as the direction of the weight vector, whereas the SVM algorithm will choose a separating plane which is perpendicular to that line.
Thus, the weights resulting from SVM training will agree with those resulting from LDA training, up to a scaling factor, which is irrelevant for classification. Under the assumption of Gaussianity, as it is taken in case of P300 detection, performance differences between LDA-trained vs SVM-trained classifiers will therefore be due to improbable data points (outliers), or EEG artifacts, in the training data, and will vanish as recording quality and/or the amount of training data are increased.

HTH,
Juergen

Posted: **26 May 2013, 15:22**

Great thanks for the reply, but I am still confused with the role.

I concluded that:
- We take one training data (spelling perform) and we get .dat file, name it: example.dat.
- We perform P300Classifier on example.dat and get one parms file, name it: example.prm.
- When I opened it, there is contained matrix (in sparse representation) that exist 4 columns: input channel, input element (bin), output channel, and weight.

so, from my equation (y = b1x1+b2x2+...+bnxn), b1 is the weight, x1 is the input channel, and y is output channel. Is that true?

Something that I explore is about one of the parameters in P300Classifier that is Max Model Features.
When I set it 1, P300Classifier makes a prm file that contain seven rows, which same input channel, same output channel, same weight, but different input element (bin).. so, what is input element (bin) role?

And, why P300Classifier generates seven consecutive input element (bin) ?

when I set the max model features 2, P300Classifier makes a prm file that contain fourteen rows, and so on with replenishment (increasing), but in some cases, there is some reduction (seven input element (bin)), and finally, there is stable prm though increase the max model feature. what is it mean? (I think SWLDA is performed).

SWLDA is an iteration that include the most significant predictor variables but have p-value < 0.10 (forward regression) and after that reduce the least significant predictor variables that have p-value >0.15. I really don't know what is it mean and the correlation between this algorithm and P300Classifier. When SWLDA is performed, weights is generated. Is that mean SWLDA make a filter? so, the coordinate that not presented P300 is suppressed and the coordinate that presented P300 is amplified?

For example:

max model features : 1

1 50 1 -10
1 51 1 -10
1 52 1 -10
1 53 1 -10
1 54 1 -10
1 55 1 -10
1 56 1 -10
---------------------------------------
max model features : 2

1 50 1 -10
1 51 1 -10
1 52 1 -10
1 53 1 -10
1 54 1 -10
1 55 1 -10
1 56 1 -10
1 71 1 7.90624
1 72 1 7.90624
1 73 1 7.90624
1 74 1 7.90624
1 75 1 7.90624
1 76 1 7.90624
1 77 1 7.90624
---------------------------------------
max model features : 3

1 64 1 2.01708
1 65 1 2.01708
1 66 1 2.01708
1 67 1 2.01708
1 68 1 2.01708
1 69 1 2.01708
1 70 1 2.01708
2 50 1 -10
2 51 1 -10
2 52 1 -10
2 53 1 -10
2 54 1 -10
2 55 1 -10
2 56 1 -10
2 71 1 5.68911
2 72 1 5.68911
2 73 1 5.68911
2 74 1 5.68911
2 75 1 5.68911
2 76 1 5.68911
2 77 1 5.68911
---------------------------------

What is the weight role?
What is -10 means ?
Why there is reduction in 71-77 input elements (bin)'s weights (from 7.90624 to 5.68911)?

Really sorry for long question, but I don't have any clear explanation from many references.

Are there any correlation between SWLDA with R-squared?

Regards,
Antonius

Posted: **26 May 2013, 22:37**

I think you're missing one key detail.

The P300Speller does not classify continuous data. It classifies segments of data. If your segment is about 800 ms long and you have a sampling rate of 256 samples per second then each event related potential (ERP) will have ~204 samples. If you have 16 channels then you will have a total of 3264 features per ERP. SWLDA will operate on those 3264 features to determine which ones are useful in classifying an ERP as attended vs non-attended.

Usually we downsample the ERP before classification so we have something more like 12 samples * 16 channels = 192 features.

So, when you asked

so, from my equation (y = b1x1+b2x2+...+bnxn), b1 is the weight, x1 is the input channel, and y is output channel. Is that true?

No, that is not true. x1 is the input channel at a specific time point. x2 might be the same channel at a different time point. Thus you get a y-value for each stimulus (i.e., row or column flash) but you do not get a continuous y-value that is linearly dependent on the continuous input. Then, you average your y-values for column A, for column B, ..., for row E, and for row F. Find the column and row with the largest y-value and their intersection is the attended letter.

The way the P300Classifier works is largely based on Dean Krusienski's work. I think this is the paper:
http://www.ncbi.nlm.nih.gov/pubmed/17124334

I also like a recent paper by Jason Farquhar and Jez Hill that might be seen as a follow-up to Dean's work:
http://www.ncbi.nlm.nih.gov/pubmed/23250668

If you like Jason and Jez's approach but you don't quite understand regularized linear regression then I suggest you take a look at this free online course about Machine Learning:
https://www.coursera.org/course/ml

Posted: **27 May 2013, 10:36**

Are there any correlation between SWLDA with R-squared?

The R-squared, or "Determination Coefficient", is the percentage of total data variance that is explained by the difference between classes. In the usual derivation, LDA is described to maximize the ratio of "between-class-scatter" S_b to "within-class-scatter" S_w, which is also called "Fisher Ratio". Now, total data variance is the sum of these two scatter values, and the explained variance is identical to S_b, so the R-squared is equal to S_b/(S_b+S_w). Using a little algebra, it is easy to see that the R-squared will be maximized at the same time as the Fisher ratio. Thus, LDA is an algorithm to determine the linear combination of data dimensions for which the R-squared is maximized.

LDA makes the implicit assumption that the maximum R-squared over the training data will also be the expected maximum R-squared over the set of possible unseen data.
In the absence of prior information about the set of unseen data, one cannot do better than that, so no algorithm will be able to outperform plain LDA in that case.

As soon as prior information is available, it will be possible to introduce a meaningful distinction between the set of training data, and the set of expected unseen data, and to ignore all information in the training data that does not fit that prior information. Thus, for a limited amount of training data, the training result will then be better that that of plain LDA, because the estimated R-squared will be less noisy, and thus its maximum over the training data will, on average, be closer of its maximum over all possible input data.

Compared to plain LDA, SWLDA adds the assumption that only a subset of data dimensions contains useful information, and exploits that assumption in order to improve the ratio of data points to classifier parameters.

In a P300 application, the SWLDA assumption will be true because the classifier sees the entire epoch of around 1s, whereas the P300 itself will be present during about 1/4 of that time. In other words, 3/4 of data point dimensions will not carry information about the presence of a P300 at all. Thus, using SWLDA rather than plain LDA will improve the result in a similar way as if the amount of training data had been increased by a factor of 4.

"Regularization" is a more general concept of introducing prior information into the training algorithm. Basically, the idea is to encode prior information into a mathematical formula such that the formula's result is an estimate of the plausibility of a given training result. Then, the formula is added to the target function, which effects to an advantage of more plausible results over less plausible results. E.g., in order to encode the SWLDA assumption into a mathematical formula, one might count the number of weights exceeding some relative threshold, and subtract it from the R-squared prior to maximization. In general, such ad-hoc regularization will not work well, however, and it will be necessary to use additional prior information in form of a theoretical framework. If the resulting target function is a meaningful expression in such a framework, chances are good to obtain an actual improvement compared to the non-regularized algorithm.

HTH,
Juergen

Posted: **07 Jul 2014, 08:44**

Hello

I am confuse in using SWLDA as a classifier for my problem.. I am using it for speech signal of different of alphabets.

for example
we recorded 3 Alphabets
Alphabet A
Alphabet B
Alphabet C

After feature extraction, we have 8x6 matrix for training data and 8x6 matrix for testing
then i used SWLDA classier SWLDA=stepwisefit(train',Truelabel,'penter',.00001,'premove',0.00005,'display','on');
where train is training data of 8x6 and Truelabels are vector 6x1 like [112233]

The output SWLDA is vector 8x1 is then multiplied by testing data which 8x6 matrix: hat=SWLDA'*test .
Now hat is data is 6x1. So how can i find the accuracy for each A, B, and C

Thank you

Posted: **07 Jul 2014, 10:19**

Does this have anything to do with P300? My answer below assumes that it does not. If you are using the P300 then please tell me because my answer would be quite different. Also, please explain your task in more detail because I do not understand how you can use the P300 for what you described.

stepwisefit is a linear regression technique. It really only applies when y is a continuous variable or when y is binary (0's and 1's, or -1's and 1's). It doesn't make sense to do a linear regression when you have 3 classes: is Alphabet C somehow 3 times greater than Alphabet A but only 1.5 times greater than Alphabet B?

You need a multi-class LDA.
This is from a paragraph in the LDA wikipedia page:

If classification is required, instead of dimension reduction, there are a number of alternative techniques available. For instance, the classes may be partitioned, and a standard Fisher discriminant or LDA used to classify each partition. A common example of this is "one against the rest" where the points from one class are put in one group, and everything else in the other, and then LDA applied. This will result in C classifiers, whose results are combined. Another common method is pairwise classification, where a new classifier is created for each pair of classes (giving C(C − 1)/2 classifiers in total), with the individual classifiers combined to produce a final classification.

Doing a search for 'multiclass lda' on the matlab file exchange yielded a few potential solutions.

bci2000.org BBS

SWLDA Role in P300

SWLDA Role in P300

Re: SWLDA Role in P300

Re: SWLDA Role in P300

Re: SWLDA Role in P300

Re: SWLDA Role in P300

Re: SWLDA Role in P300

Re: SWLDA Role in P300