Hi, I'm new here. I'm confused about SWLDA role in this P3speller application. Something that I know about SWLDA is that SWLDA making a linear equation (y = b +b1x1+b2x2+...) which bn is weight and xn is the brain coordinate, so when online spelling is performed, some coordinate in brain that make response when false signal detected as ERP is suppressed (with low bn) and right signal detected as ERP is amplified (with high bn). And if that is true, what is y variable role? Do you have any clear explanation about this? How the P300Classifier make weight from the .dat file?
Thanks.
SWLDA Role in P300
Re: SWLDA Role in P300
Hi,
BCI2000 uses a "Linear Classifier" in order to discriminate ("classify") between the two cases "P300 present" and "no P300 present".
In order to understand what that means, and how components interact with each other, it is important to understand the distinction between a "Classifier", and "Classifier Training".
A "Classifier", or "Classification algorithm" is an algorithm that sorts data points into "classes", based on a set of parameters. Typically, discrimination is done between two classes only, which is also sufficient, as any more complex distinction may be broken up in a series of binary distinctions, or bits. It would be most natural for the two classes to be named "0" and "1", but they often have different labels, such as "1" and "+1", or "P300" and "no P300".
Proper operation of a classifier very much depends on proper choice of its parameters. These may be determined from prior information, or from observed data. The latter is called "Classifier Training". Typically, there exist many training methods for a given type of classifier, but not all training methods apply to all classifiers, so the distinction is not always explicitly made in literature. Still, it is important to be made for the sake of understanding.
Data vectors may be interpreted as points in a highdimensional space, and often the two classes correspond to the centers of two distinct point clouds in that space. (A notable counterexample would be classes that correspond to concentric shells with different average radius.)
Typically, there is some overlap between the clouds, so it will not be possible to uniquely determine the cloud a single data vector belongs to. Rather, a classifier's result in general will be some measure of evidence in favor of cloud 1, and against cloud 0.
The simplest of all possible classifiers is the "Linear Classifier", which simply consists of scalar multiplication of a data vector with a weight vector of equal length. Parameters of the "Linear Classifier" are the entries of its weight vector, and correspond to b1..bn in your formula.
Linear classification is not only the most simple, but also the best possible classification algorithm in all cases where the following two conditions are met:
1) the distinction between classes corresponds to a distance between cloud centers in space,
2) each data point cloud follows a Gaussian ("normal") distribution.
Moreover, a Gaussian distribution is the most reasonable default assumption in the absence of prior information about cloud shape. Thus, linear classification is the appropriate choice whenever one knows that 1) applies, and when there is no evidence against 2). Further, it will often be possible to choose a transformation of data points based on prior information, such that 1) applies, and 2) is met approximately  in the counterexample above, replacing data points with their logdistance from the common center would be such a transformation.
As a result of its favorable properties, linear classification suits a very large number of classification problems.
In case of P300 classification, after filtering out frequencies below 1Hz, and in the absence of EEG artifacts, the assumption of Gaussianity will be met for the raw measurement samples. Thus, there exists no classification method that would be able to classify P300 responses better than linear classification does.
In BCI2000, the classifier is used to compute an individual yvalue for each stimulus presentation. The P3Speller or StimulusPresentation application then compares yvalues, and chooses the stimulus with the largest associated yvalue. According to the theory of linear classification, this favors the stimulus which has most evidence for it to have elicited a P300 response. [In the case of the P3Speller, the choice is a bit more complicated because there is no 1:1 correspondence between presentations, and choices, see:
http://www.bci2000.org/wiki/index.php/P ... _Selection]
The b in your equation is an offset that is typically chosen such that the result is zero whenever evidence is equally high (or equally low) for both classes. This is not done in BCI2000, because yvalues are compared against each other, so the result is not affected by the choice of b. Thus, in BCI2000, it is always zero.
For Training a linear classifier, multiple methods exist.
The most basic algorithm for training a linear classifier is called "Linear Discriminant Analysis", which uses secondorder statistics in order to determine classifier weights.
Thus, the quality of the resulting classifier depends on the quality of the covariance estimate, which in turn depends on the ratio of the squared number of weights, to the number of training data points. In practice, for reasons of time and effort, it may be difficult or impossible to obtain the amount of data required for a good classifier training when using plain LDA.
Thus, training algorithms have been devised which try to make better use of training data than plain LDA. Stepwise LDA (SWLDA) tries to improve the quality of the covariance estimate by reducing the number of nonzero weights to a minimum. Starting with a subset of data dimensions, it determines for each dimension how much including/removing it into/from the set would alter the quality of the estimate, and modifies the set accordingly.
Another popular training algorithm is called "Linear Support Vector Machine", and works by spanning a plane between the two data clouds such that the mean signed distance of training data points from that plane is maximized (taking the distance with reverse sign depending on class). Then, the weight vector is taken to be perpendicular to that plane, and the offset b is chosen to be such that a point on the plane has y = 0.
In the limit of a large number of Gaussian distributed training data points, the symmetry of the problem is such that the LDA algorithm will choose the line between cloud centers as the direction of the weight vector, whereas the SVM algorithm will choose a separating plane which is perpendicular to that line.
Thus, the weights resulting from SVM training will agree with those resulting from LDA training, up to a scaling factor, which is irrelevant for classification. Under the assumption of Gaussianity, as it is taken in case of P300 detection, performance differences between LDAtrained vs SVMtrained classifiers will therefore be due to improbable data points (outliers), or EEG artifacts, in the training data, and will vanish as recording quality and/or the amount of training data are increased.
HTH,
Juergen
BCI2000 uses a "Linear Classifier" in order to discriminate ("classify") between the two cases "P300 present" and "no P300 present".
In order to understand what that means, and how components interact with each other, it is important to understand the distinction between a "Classifier", and "Classifier Training".
A "Classifier", or "Classification algorithm" is an algorithm that sorts data points into "classes", based on a set of parameters. Typically, discrimination is done between two classes only, which is also sufficient, as any more complex distinction may be broken up in a series of binary distinctions, or bits. It would be most natural for the two classes to be named "0" and "1", but they often have different labels, such as "1" and "+1", or "P300" and "no P300".
Proper operation of a classifier very much depends on proper choice of its parameters. These may be determined from prior information, or from observed data. The latter is called "Classifier Training". Typically, there exist many training methods for a given type of classifier, but not all training methods apply to all classifiers, so the distinction is not always explicitly made in literature. Still, it is important to be made for the sake of understanding.
Data vectors may be interpreted as points in a highdimensional space, and often the two classes correspond to the centers of two distinct point clouds in that space. (A notable counterexample would be classes that correspond to concentric shells with different average radius.)
Typically, there is some overlap between the clouds, so it will not be possible to uniquely determine the cloud a single data vector belongs to. Rather, a classifier's result in general will be some measure of evidence in favor of cloud 1, and against cloud 0.
The simplest of all possible classifiers is the "Linear Classifier", which simply consists of scalar multiplication of a data vector with a weight vector of equal length. Parameters of the "Linear Classifier" are the entries of its weight vector, and correspond to b1..bn in your formula.
Linear classification is not only the most simple, but also the best possible classification algorithm in all cases where the following two conditions are met:
1) the distinction between classes corresponds to a distance between cloud centers in space,
2) each data point cloud follows a Gaussian ("normal") distribution.
Moreover, a Gaussian distribution is the most reasonable default assumption in the absence of prior information about cloud shape. Thus, linear classification is the appropriate choice whenever one knows that 1) applies, and when there is no evidence against 2). Further, it will often be possible to choose a transformation of data points based on prior information, such that 1) applies, and 2) is met approximately  in the counterexample above, replacing data points with their logdistance from the common center would be such a transformation.
As a result of its favorable properties, linear classification suits a very large number of classification problems.
In case of P300 classification, after filtering out frequencies below 1Hz, and in the absence of EEG artifacts, the assumption of Gaussianity will be met for the raw measurement samples. Thus, there exists no classification method that would be able to classify P300 responses better than linear classification does.
In BCI2000, the classifier is used to compute an individual yvalue for each stimulus presentation. The P3Speller or StimulusPresentation application then compares yvalues, and chooses the stimulus with the largest associated yvalue. According to the theory of linear classification, this favors the stimulus which has most evidence for it to have elicited a P300 response. [In the case of the P3Speller, the choice is a bit more complicated because there is no 1:1 correspondence between presentations, and choices, see:
http://www.bci2000.org/wiki/index.php/P ... _Selection]
The b in your equation is an offset that is typically chosen such that the result is zero whenever evidence is equally high (or equally low) for both classes. This is not done in BCI2000, because yvalues are compared against each other, so the result is not affected by the choice of b. Thus, in BCI2000, it is always zero.
For Training a linear classifier, multiple methods exist.
The most basic algorithm for training a linear classifier is called "Linear Discriminant Analysis", which uses secondorder statistics in order to determine classifier weights.
Thus, the quality of the resulting classifier depends on the quality of the covariance estimate, which in turn depends on the ratio of the squared number of weights, to the number of training data points. In practice, for reasons of time and effort, it may be difficult or impossible to obtain the amount of data required for a good classifier training when using plain LDA.
Thus, training algorithms have been devised which try to make better use of training data than plain LDA. Stepwise LDA (SWLDA) tries to improve the quality of the covariance estimate by reducing the number of nonzero weights to a minimum. Starting with a subset of data dimensions, it determines for each dimension how much including/removing it into/from the set would alter the quality of the estimate, and modifies the set accordingly.
Another popular training algorithm is called "Linear Support Vector Machine", and works by spanning a plane between the two data clouds such that the mean signed distance of training data points from that plane is maximized (taking the distance with reverse sign depending on class). Then, the weight vector is taken to be perpendicular to that plane, and the offset b is chosen to be such that a point on the plane has y = 0.
In the limit of a large number of Gaussian distributed training data points, the symmetry of the problem is such that the LDA algorithm will choose the line between cloud centers as the direction of the weight vector, whereas the SVM algorithm will choose a separating plane which is perpendicular to that line.
Thus, the weights resulting from SVM training will agree with those resulting from LDA training, up to a scaling factor, which is irrelevant for classification. Under the assumption of Gaussianity, as it is taken in case of P300 detection, performance differences between LDAtrained vs SVMtrained classifiers will therefore be due to improbable data points (outliers), or EEG artifacts, in the training data, and will vanish as recording quality and/or the amount of training data are increased.
HTH,
Juergen
Re: SWLDA Role in P300
Great thanks for the reply, but I am still confused with the role.
I concluded that:
 We take one training data (spelling perform) and we get .dat file, name it: example.dat.
 We perform P300Classifier on example.dat and get one parms file, name it: example.prm.
 When I opened it, there is contained matrix (in sparse representation) that exist 4 columns: input channel, input element (bin), output channel, and weight.
so, from my equation (y = b1x1+b2x2+...+bnxn), b1 is the weight, x1 is the input channel, and y is output channel. Is that true?
Something that I explore is about one of the parameters in P300Classifier that is Max Model Features.
When I set it 1, P300Classifier makes a prm file that contain seven rows, which same input channel, same output channel, same weight, but different input element (bin).. so, what is input element (bin) role?
And, why P300Classifier generates seven consecutive input element (bin) ?
when I set the max model features 2, P300Classifier makes a prm file that contain fourteen rows, and so on with replenishment (increasing), but in some cases, there is some reduction (seven input element (bin)), and finally, there is stable prm though increase the max model feature. what is it mean? (I think SWLDA is performed).
SWLDA is an iteration that include the most significant predictor variables but have pvalue < 0.10 (forward regression) and after that reduce the least significant predictor variables that have pvalue >0.15. I really don't know what is it mean and the correlation between this algorithm and P300Classifier. When SWLDA is performed, weights is generated. Is that mean SWLDA make a filter? so, the coordinate that not presented P300 is suppressed and the coordinate that presented P300 is amplified?
For example:
max model features : 1
1 50 1 10
1 51 1 10
1 52 1 10
1 53 1 10
1 54 1 10
1 55 1 10
1 56 1 10

max model features : 2
1 50 1 10
1 51 1 10
1 52 1 10
1 53 1 10
1 54 1 10
1 55 1 10
1 56 1 10
1 71 1 7.90624
1 72 1 7.90624
1 73 1 7.90624
1 74 1 7.90624
1 75 1 7.90624
1 76 1 7.90624
1 77 1 7.90624

max model features : 3
1 64 1 2.01708
1 65 1 2.01708
1 66 1 2.01708
1 67 1 2.01708
1 68 1 2.01708
1 69 1 2.01708
1 70 1 2.01708
2 50 1 10
2 51 1 10
2 52 1 10
2 53 1 10
2 54 1 10
2 55 1 10
2 56 1 10
2 71 1 5.68911
2 72 1 5.68911
2 73 1 5.68911
2 74 1 5.68911
2 75 1 5.68911
2 76 1 5.68911
2 77 1 5.68911

What is the weight role?
What is 10 means ?
Why there is reduction in 7177 input elements (bin)'s weights (from 7.90624 to 5.68911)?
Really sorry for long question, but I don't have any clear explanation from many references.
Are there any correlation between SWLDA with Rsquared?
Regards,
Antonius
I concluded that:
 We take one training data (spelling perform) and we get .dat file, name it: example.dat.
 We perform P300Classifier on example.dat and get one parms file, name it: example.prm.
 When I opened it, there is contained matrix (in sparse representation) that exist 4 columns: input channel, input element (bin), output channel, and weight.
so, from my equation (y = b1x1+b2x2+...+bnxn), b1 is the weight, x1 is the input channel, and y is output channel. Is that true?
Something that I explore is about one of the parameters in P300Classifier that is Max Model Features.
When I set it 1, P300Classifier makes a prm file that contain seven rows, which same input channel, same output channel, same weight, but different input element (bin).. so, what is input element (bin) role?
And, why P300Classifier generates seven consecutive input element (bin) ?
when I set the max model features 2, P300Classifier makes a prm file that contain fourteen rows, and so on with replenishment (increasing), but in some cases, there is some reduction (seven input element (bin)), and finally, there is stable prm though increase the max model feature. what is it mean? (I think SWLDA is performed).
SWLDA is an iteration that include the most significant predictor variables but have pvalue < 0.10 (forward regression) and after that reduce the least significant predictor variables that have pvalue >0.15. I really don't know what is it mean and the correlation between this algorithm and P300Classifier. When SWLDA is performed, weights is generated. Is that mean SWLDA make a filter? so, the coordinate that not presented P300 is suppressed and the coordinate that presented P300 is amplified?
For example:
max model features : 1
1 50 1 10
1 51 1 10
1 52 1 10
1 53 1 10
1 54 1 10
1 55 1 10
1 56 1 10

max model features : 2
1 50 1 10
1 51 1 10
1 52 1 10
1 53 1 10
1 54 1 10
1 55 1 10
1 56 1 10
1 71 1 7.90624
1 72 1 7.90624
1 73 1 7.90624
1 74 1 7.90624
1 75 1 7.90624
1 76 1 7.90624
1 77 1 7.90624

max model features : 3
1 64 1 2.01708
1 65 1 2.01708
1 66 1 2.01708
1 67 1 2.01708
1 68 1 2.01708
1 69 1 2.01708
1 70 1 2.01708
2 50 1 10
2 51 1 10
2 52 1 10
2 53 1 10
2 54 1 10
2 55 1 10
2 56 1 10
2 71 1 5.68911
2 72 1 5.68911
2 73 1 5.68911
2 74 1 5.68911
2 75 1 5.68911
2 76 1 5.68911
2 77 1 5.68911

What is the weight role?
What is 10 means ?
Why there is reduction in 7177 input elements (bin)'s weights (from 7.90624 to 5.68911)?
Really sorry for long question, but I don't have any clear explanation from many references.
Are there any correlation between SWLDA with Rsquared?
Regards,
Antonius
Re: SWLDA Role in P300
I think you're missing one key detail.
The P300Speller does not classify continuous data. It classifies segments of data. If your segment is about 800 ms long and you have a sampling rate of 256 samples per second then each event related potential (ERP) will have ~204 samples. If you have 16 channels then you will have a total of 3264 features per ERP. SWLDA will operate on those 3264 features to determine which ones are useful in classifying an ERP as attended vs nonattended.
Usually we downsample the ERP before classification so we have something more like 12 samples * 16 channels = 192 features.
So, when you asked
The way the P300Classifier works is largely based on Dean Krusienski's work. I think this is the paper:
http://www.ncbi.nlm.nih.gov/pubmed/17124334
I also like a recent paper by Jason Farquhar and Jez Hill that might be seen as a followup to Dean's work:
http://www.ncbi.nlm.nih.gov/pubmed/23250668
If you like Jason and Jez's approach but you don't quite understand regularized linear regression then I suggest you take a look at this free online course about Machine Learning:
https://www.coursera.org/course/ml
The P300Speller does not classify continuous data. It classifies segments of data. If your segment is about 800 ms long and you have a sampling rate of 256 samples per second then each event related potential (ERP) will have ~204 samples. If you have 16 channels then you will have a total of 3264 features per ERP. SWLDA will operate on those 3264 features to determine which ones are useful in classifying an ERP as attended vs nonattended.
Usually we downsample the ERP before classification so we have something more like 12 samples * 16 channels = 192 features.
So, when you asked
No, that is not true. x1 is the input channel at a specific time point. x2 might be the same channel at a different time point. Thus you get a yvalue for each stimulus (i.e., row or column flash) but you do not get a continuous yvalue that is linearly dependent on the continuous input. Then, you average your yvalues for column A, for column B, ..., for row E, and for row F. Find the column and row with the largest yvalue and their intersection is the attended letter.so, from my equation (y = b1x1+b2x2+...+bnxn), b1 is the weight, x1 is the input channel, and y is output channel. Is that true?
The way the P300Classifier works is largely based on Dean Krusienski's work. I think this is the paper:
http://www.ncbi.nlm.nih.gov/pubmed/17124334
I also like a recent paper by Jason Farquhar and Jez Hill that might be seen as a followup to Dean's work:
http://www.ncbi.nlm.nih.gov/pubmed/23250668
If you like Jason and Jez's approach but you don't quite understand regularized linear regression then I suggest you take a look at this free online course about Machine Learning:
https://www.coursera.org/course/ml
Re: SWLDA Role in P300
The Rsquared, or "Determination Coefficient", is the percentage of total data variance that is explained by the difference between classes. In the usual derivation, LDA is described to maximize the ratio of "betweenclassscatter" S_b to "withinclassscatter" S_w, which is also called "Fisher Ratio". Now, total data variance is the sum of these two scatter values, and the explained variance is identical to S_b, so the Rsquared is equal to S_b/(S_b+S_w). Using a little algebra, it is easy to see that the Rsquared will be maximized at the same time as the Fisher ratio. Thus, LDA is an algorithm to determine the linear combination of data dimensions for which the Rsquared is maximized.Are there any correlation between SWLDA with Rsquared?
LDA makes the implicit assumption that the maximum Rsquared over the training data will also be the expected maximum Rsquared over the set of possible unseen data.
In the absence of prior information about the set of unseen data, one cannot do better than that, so no algorithm will be able to outperform plain LDA in that case.
As soon as prior information is available, it will be possible to introduce a meaningful distinction between the set of training data, and the set of expected unseen data, and to ignore all information in the training data that does not fit that prior information. Thus, for a limited amount of training data, the training result will then be better that that of plain LDA, because the estimated Rsquared will be less noisy, and thus its maximum over the training data will, on average, be closer of its maximum over all possible input data.
Compared to plain LDA, SWLDA adds the assumption that only a subset of data dimensions contains useful information, and exploits that assumption in order to improve the ratio of data points to classifier parameters.
In a P300 application, the SWLDA assumption will be true because the classifier sees the entire epoch of around 1s, whereas the P300 itself will be present during about 1/4 of that time. In other words, 3/4 of data point dimensions will not carry information about the presence of a P300 at all. Thus, using SWLDA rather than plain LDA will improve the result in a similar way as if the amount of training data had been increased by a factor of 4.
"Regularization" is a more general concept of introducing prior information into the training algorithm. Basically, the idea is to encode prior information into a mathematical formula such that the formula's result is an estimate of the plausibility of a given training result. Then, the formula is added to the target function, which effects to an advantage of more plausible results over less plausible results. E.g., in order to encode the SWLDA assumption into a mathematical formula, one might count the number of weights exceeding some relative threshold, and subtract it from the Rsquared prior to maximization. In general, such adhoc regularization will not work well, however, and it will be necessary to use additional prior information in form of a theoretical framework. If the resulting target function is a meaningful expression in such a framework, chances are good to obtain an actual improvement compared to the nonregularized algorithm.
HTH,
Juergen

 Posts: 2
 Joined: 07 Jul 2014, 05:48
Re: SWLDA Role in P300
Hello
I am confuse in using SWLDA as a classifier for my problem.. I am using it for speech signal of different of alphabets.
for example
we recorded 3 Alphabets
Alphabet A
Alphabet B
Alphabet C
After feature extraction, we have 8x6 matrix for training data and 8x6 matrix for testing
then i used SWLDA classier SWLDA=stepwisefit(train',Truelabel,'penter',.00001,'premove',0.00005,'display','on');
where train is training data of 8x6 and Truelabels are vector 6x1 like [112233]
The output SWLDA is vector 8x1 is then multiplied by testing data which 8x6 matrix: hat=SWLDA'*test .
Now hat is data is 6x1. So how can i find the accuracy for each A, B, and C
Thank you
I am confuse in using SWLDA as a classifier for my problem.. I am using it for speech signal of different of alphabets.
for example
we recorded 3 Alphabets
Alphabet A
Alphabet B
Alphabet C
After feature extraction, we have 8x6 matrix for training data and 8x6 matrix for testing
then i used SWLDA classier SWLDA=stepwisefit(train',Truelabel,'penter',.00001,'premove',0.00005,'display','on');
where train is training data of 8x6 and Truelabels are vector 6x1 like [112233]
The output SWLDA is vector 8x1 is then multiplied by testing data which 8x6 matrix: hat=SWLDA'*test .
Now hat is data is 6x1. So how can i find the accuracy for each A, B, and C
Thank you
Re: SWLDA Role in P300
Does this have anything to do with P300? My answer below assumes that it does not. If you are using the P300 then please tell me because my answer would be quite different. Also, please explain your task in more detail because I do not understand how you can use the P300 for what you described.
stepwisefit is a linear regression technique. It really only applies when y is a continuous variable or when y is binary (0's and 1's, or 1's and 1's). It doesn't make sense to do a linear regression when you have 3 classes: is Alphabet C somehow 3 times greater than Alphabet A but only 1.5 times greater than Alphabet B?
You need a multiclass LDA.
This is from a paragraph in the LDA wikipedia page:
stepwisefit is a linear regression technique. It really only applies when y is a continuous variable or when y is binary (0's and 1's, or 1's and 1's). It doesn't make sense to do a linear regression when you have 3 classes: is Alphabet C somehow 3 times greater than Alphabet A but only 1.5 times greater than Alphabet B?
You need a multiclass LDA.
This is from a paragraph in the LDA wikipedia page:
Doing a search for 'multiclass lda' on the matlab file exchange yielded a few potential solutions.If classification is required, instead of dimension reduction, there are a number of alternative techniques available. For instance, the classes may be partitioned, and a standard Fisher discriminant or LDA used to classify each partition. A common example of this is "one against the rest" where the points from one class are put in one group, and everything else in the other, and then LDA applied. This will result in C classifiers, whose results are combined. Another common method is pairwise classification, where a new classifier is created for each pair of classes (giving C(C − 1)/2 classifiers in total), with the individual classifiers combined to produce a final classification.
Who is online
Users browsing this forum: No registered users and 2 guests