mRMR FAQ
Q1.2 What are "MID" and
"MIQ"?
Q1.4 Where to download the mRMR software
and source codes?
Q1.5 What is the correct format of my input data?
Q1.7 How to cite/acknowledge mRMR?
Q1.9 Damage & other risks & disclaimer?
2. How to use the online version
Q2.1 Where is the online version of mRMR?
Q2.2 Is it true that the online version
only considers the mutual information based mRMR?
Q2.3 What are the input/parameters of the
online program?
Q2.4 What should be the input file and its format for the online
version?
Q2.5 What is the meaning of the output of the online program?
3. How to use the C/C++ version
Q3.1 Where is the C/C++ version of mRMR?
Q3.2 Can I use the C/C++ version for Linux, Mac, Unix, or other
*nix machines?
Q3.3 Can I use the C/C++ version Windows machines?
Q3.4 Is it true that the C/C++ version
only considers the mutual information based mRMR?
Q3.5 What are the input/parameters of the
C/C++ program?
Q3.6 What should be the input file and its format for the C/C++
version?
Q3.7 What is the meaning of the output of the C/C++ program?
Q3.8 Are the C/C++ program and the online version the same?
Q3.9 Any help information available when I try
to run the C/C++ version?
Q3.10 The C/C++ binary program hangs. Why?
4. How to use the Matlab version
Q4.1 Where is the Matlab version of mRMR?
Q4.2 Can I use the Matlab versions for
Linux, Mac, Unix, Windows, or other machines?
Q4.3 Is it true that the released Matlab version only considers the mutual information based mRMR?
Q4.4 What are the input/parameters of the
Matlab version?
Q4.5 What is the output of the Matlab
version?
Q4.6 What are mrmr_mid_d , mrmr_miq_d , and mrmr_mibase_d?
Q4.7 Do the Matlab version and C/C++
version produce the same results?
5. Handle discrete/categorical and
continuous variables
Q5.1 How mRMR handles continuous
variables?
Q5.3 What is the best way to discretize
the data?
Q6.1 A typo in Eq. (8) in your JBCB05 paper (and
also the CSB03 paper)?
Q6.2 Can I ask questions and what is your contact?
1. Basic questions
Q1.1 What is mRMR?
A. It
means minimum-Redundancy-Maximum-Relevance feature/variable/attribute
selection. The goal is to select a feature subset set that best characterizes
the statistical property of a target classification variable, subject to the
constraint that these features are mutually as dissimilar to each other as
possible, but marginally as similar to the classification variable as possible.
We showed several different forms of mRMR, where
"relevance" and "redundancy" were defined using mutual
information, correlation, t-test/F-test, distances, etc.
Importantly,
for mutual information, we showed that the method to detect mRMR
features also searches for a feature set of which features jointly have the
maximal statistical "dependency" on the classification variable. This
"dependency" term is defined using a new form of the high-dimensional
mutual information.
The mRMR method was first developed as a fast and powerful
feature "filter". We then also showed a method to combine mRMR and "wrapper" selection methods. These
methods have produced promising results on a range of datasets in many
different areas.
Q1.2 What
are "MID" and "MIQ"?
A. MID
and MIQ represent the Mutual Information Difference and Quotient schemes,
respectively, to combine the relevance and redundancy that are defined using
Mutual Information (MI). They are the two most used mRMR
schemes.
Q1.3 How to use mRMR?
A. There
are three ways. 1) Prepare your data and run our online program at the web site
http://research.janelia.org/peng/proj/mRMR . 2) Download the precompiled C/C++ version (binary) and
run on your own machine. 3) Download the Matlab
versions (binary plus source codes) and run on your own machine. You can find
the software download links at our web site, too.
Q1.4 Where to download the mRMR software and source codes?
A. You
can download different versions at this web site, or follow the links to download
the Matlab versions. The Matlab
version contains all key source codes, including the mRMR
algorithm (in Matlab) and mutual information
computation toolbox (in C/C++ and can be compiled as Matlab
mex functions).
Q1.5 What is the correct format
of my input data?
A. See
the answers to the respective mRMR versions below.
Q1.6 How
should I understand the results of mRMR? For example,
I ran it on a small 3-variable data set & it gave me an output like 2 1
3. Does that mean 2 is the least
statistical dependent & 3 is most dependent? That is, 2 is
the most relevant & 3 is least relevant?
A. That
means the first selected feature is 2, and the last is 3.
That also means the combination of 2 and 1 is better than the
combination of 2 and 3.
That also means 2 is the best feature if you only want one
feature. And "2 and 1" is
the best combination if you want two features.
However,
this does NOT mean 3 is the least relevant or most
dependent. If you select features without considering the relationship between
all the features, but only between individual features and the target class
variable, you may get results that 3 is more relevant than 1. However the
combination of 2 and 1 is better than that of 2 and 3.
Q1.7 How to cite/acknowledge mRMR?
A. We
will appreciate if you appropriately cite the following papers:
[TPAMI05]
Hanchuan Peng, Fuhui Long, and Chris Ding,
"Feature selection based on mutual information: criteria of
max-dependency, max-relevance, and min-redundancy," IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005.
[PDF]
This
paper presents a theory of mutual information based feature selection. Demonstrates
the relationship of four selection schemes: maximum dependency, mRMR, maximum relevance, and minimal redundancy. Also gives the combination scheme of "mRMR
+ wrapper" selection and mutual information estimation of
continuous/hybrid variables.
[JBCB05]
Chris Ding, and Hanchuan Peng, "Minimum redundancy feature selection from
microarray gene expression data," Journal of Bioinformatics and
Computational Biology, Vol. 3, No. 2, pp.185-205, 2005. [PDF]
This
paper presents a comprehensive suite of experimental results of mRMR for microarray gene selection on many different
conditions. It is an extended version of the CSB03 paper.
[CSB03]
Chris Ding, and Hanchuan Peng, "Minimum redundancy feature selection from
microarray gene expression data," Proc. 2nd IEEE Computational Systems
Bioinformatics Conference (CSB 2003), pp.523-528, Stanford, CA, Aug, 2003. [PDF]
This
paper presents the first set of mRMR results and
different definitions of relevance/redundancy terms.
[IS05]
Hanchuan Peng, Chris Ding, and Fuhui Long,
"Minimum redundancy maximum relevance feature selection," IEEE
Intelligent Systems, Vol. 20, No. 6, pp.70-71, November/December, 2005. [PDF]
A short
invited essay that introduces mRMR and demonstrates
the importance to reduce redundancy in feature selection.
[Bioinfo07]
Jie Zhou,
and Hanchuan Peng, "Automatic recognition and annotation of gene
expression patterns of fly embryos," Bioinformatics, Vol. 23, No. 5, pp.
589-596, 2007. [PDF]
One application of mRMR in selecting
good wavelet image features.
A. The mRMR software packages can be downloaded and used, subject
to the following conditions: Software and source code Copyright (C) 2000-2007
Written by Hanchuan Peng. These software packages are copyright under the
following conditions: Permission to use, copy, and modify the software and
their documentation is hereby granted to all academic and not-for-profit
institutions without fee, provided that the above copyright notice and this
permission notice appear in all copies of the software and related
documentation and our publications (TPAMI05, JBCB05, CSB03, etc.) are
appropriately cited. Permission to distribute the software or modified or
extended versions thereof on a not-for-profit basis is explicitly granted,
under the above conditions. However, the right to use this software by
companies or other for profit organizations, or in conjunction with for profit
activities, and the right to distribute the software or modified or extended
versions thereof for profit are NOT granted except by prior arrangement and
written consent of the copyright holders. For these purposes, downloads of the
source code constitute "use" and downloads of this source code by for
profit organizations and/or distribution to for profit institutions in
explicitly prohibited without the prior consent of the copyright holders. Use
of this source code constitutes an agreement not to criticize, in any way, the
code-writing style of the author, including any statements regarding the extent
of documentation and comments present. The software is provided "AS-IS"
and without warranty of any kind, expressed, implied or otherwise, including
without limitation, any warranty of merchantability or fitness for a particular
purpose. In no event shall the authors be liable for any special, incidental,
indirect or consequential damages of any kind, or any damages whatsoever
resulting from loss of use, data or profits, whether or not advised of the
possibility of damage, and on any theory of liability, arising out of or in
connection with the use or performance of these software packages.
Q1.9 Damage & other risks & disclaimer?
A. See
the detailed disclaimer and conditions in the answer to Q1.8.
In short, we will NOT be liable for any damage of any kinds, or loss of data,
because you use the released software. It is all at your own risk.
[Return to top] [Return to mRMR main page]
2. How to use the online
version
Q2.1 Where is the online
version of mRMR?
A. The
web site http://research.janelia.org/peng/proj/mRMR .
Q2.2 Is
it true that the online version only considers the mutual information based mRMR?
A. Yes.
Since the mutual information based mRMR produces most
promising results and most used, the online program only uses mutual
information to define relevance and redundancy of features.
Q2.3 What are
the input/parameters of the online program?
A. You
need an input file, of course (some people just clicked "Submit job"
without specifying anything…). You also need to choose how the relevancy and
redundancy terms should be combined (i.e. MID or MIQ), how many features you
want to select, the property of your variables (categorical or continuous), and
how you want to discretize your data in case you have
continuous data.
Q2.4 What should be the input
file and its format for the online version?
A. It
should be a CSV (CSV, comma separated values) file, where each row is a sample
and each column is a variable/attribute/feature. Make sure your data is
separated by comma, but not blank space of other characters! The first row must
be the feature names, and the first column must be the classes
for samples. You may download a testing example data set here, which is the microrray
data of lung cancer (7 classes). In this sample data set, each
variable/feature/column has been discretized as
3-states, encoded in digits "-2", "0" and "2".
You may use other integers (such as -1, 0, 1) for the categorical/discrete
states defined by yourself, - but never use letters or combinations of digits
and letters (such as "10v"). Try not to use strange states such as
1001 or 10000 as the program will use these values to guess what will be a
reasonable amount memory to allocate. For example, if each variable has only 5
states, then try to use -2,-1,0,1,2, or 0,1,2,3,4, but
NEVER use something like "-10000, -1000, 0, 1000, 10000"! (Note: The
released version was only designed for the obviously meaningful inputs.) More
examples can be found at the mRMR web site, too.
Your data
can contain continuous values, except the first column (which is the class
variable) and first row (which is the header). In this case, you can ask mRMR to do discretization for
you. See FAQ part 5.
If you
have variables that are continuous or have many categorical states (e.g.
several hundred), you may want to read more the mRMR
handle continuous variables. See FAQ part 5.
Q2.5 What is the meaning of the
output of the online program?
A. The
meanings of most parts of the online program are intuitive: it automatically
compares the features selected using the conventional maximum relevance (MaxRel) method and the mRMR.
Suppose you ask the program to select 50 features for you, then
you can also truncate the results and use the first 20 or 30 features. You can
also test the classification accuracy using the first K features, where K=1,…,50 in this case. In this way, you can actually see with
what number of features you will get the satisfactory cross-validation
classification accuracy. This method was used in our papers, too.
The first
column is the order of features selected. The second column is the actual
indexes of the features. The third column includes the respective feature names
extracted from your input file.
The last
column is just the best score in the process to select the *current* best feature.
Indeed, it is the value of "relevancy - (or /) redundancy" for MID
(or MIQ) for the current selected feature. Because for classification all
selected features will be used together, this score does not indicate anything
for classification, thus it is NOT important for you to use.
[Return to top] [Return to mRMR main page]
3. How to use the C/C++
version
Q3.1 Where is the C/C++ version
of mRMR?
A. Follow
the download links at the web site http://research.janelia.org/peng/proj/mRMR .
Q3.2 Can I use the C/C++
version for Linux, Mac, Unix, or other *nix machines?
A. Yes.
There are precompiled versions. If you cannot find one for your machine, send
me email and I will try to compile one for you if possible. You can also
compile from the source codes directly by running “make -f mrmr.makefile”.
Q3.3 Can I use the C/C++
version Windows machines?
A. You
should be able to compile and run using MinGW and mSys. Probably also yes when you use Cygwin,
but untested. Or you may want to simply consider using the Matlab
version or the online version.
Q3.4 Is
it true that the C/C++ version only considers the mutual information based mRMR?
A. Yes.
Since the mutual information based mRMR produces most
promising results and most used, the released C/C++ program only uses mutual
information to define relevance and redundancy of features.
Q3.5 What are
the input/parameters of the C/C++ program?
A. You
need an input file with the same format explained in Q2.4.
You also need to choose how the relevancy and redundancy terms should be
combined (i.e. MID or MIQ), how many features you want to select, the property
of your variables (categorical or continuous), and how you want to discretize your data in case you have continuous data. Just type "mrmr" or the
appropriate program name for the help information.
Q3.6 What should be the input
file and its format for the C/C++ version?
A. The
same of the online version. See Q2.4.
Q3.7 What is the meaning of the
output of the C/C++ program?
A. The
same of the online version. See Q2.5.
Q3.8 Are the C/C++ program and
the online version the same?
A. Yes.
The online program just provides you a convenient way to run mRMR on relatively small datasets. If you have big
datasets, try to use the C/C++ version or the Matlab
version.
Q3.9 Any help information available when I try to run the C/C++
version?
A. Just
run "mrmr" (or other appropriate program
name if you get a special one from me) and there will be the default help
information how to run it. You can also see some examples if you use the
web-based online program, which will display the command used at the top of the
result page.
Q3.10 The C/C++ binary program
hangs. Why?
A. This
program does NOT hang except you feed it an inappropriate input file. For
example, some people use variables of hundred of categorical states or
continuous values, which cannot yield a meaningful mutual information
estimation in many cases. The released mRMR binary
versions are using mutual information for discrete variables (if you are
interested in the mutual information estimation for continuous variable, you
can find the formula in the TPAMI05 paper).
If you
have continuous data with big dynamic range (say, from 1 to 1000), the binary
version mRMR program treats each variable as one with
1000 categories, and thus the computation of mutual information takes a long
time to run, that is why you see the program
"hangs".
I suggest
you pre-thershold (discretize)
your data using some of your own favorite ways, in case you don't like set the
threshold as mean+/- std.
[Return to top] [Return to mRMR main page]
4. How to use the Matlab version
Q4.1 Where is the Matlab version of mRMR?
A. Follow
the download links at the web site http://research.janelia.org/peng/proj/mRMR , you will find the Matlab
versions released to MatlabCentral.
For most
commonly used machines, i.e. Linux, Mac, and Windows, the simplest way is to
download the mRMR Matlab
version with the precompiled mutual information toolbox. Or, you can download
the mRMR codes and the mutual information toolbox
separately. The mutual information toolbox was also written in C/C++ and can be
recompiled as mex functions for Matlab
running on other platforms. These codes are essentially similar to the mutual
information computation in the C/C++ version of mRMR.
Of
course, you will find the mutual information toolbox useful for purposes other
than mRMR. For the mRMR
case, you will need to set the Matlab path to this
toolbox.
Q4.2 Can I use the Matlab versions for Linux, Mac, Unix, Windows, or other
machines?
A. Yes.
Q4.3 Is
it true that the released Matlab version only
considers the mutual information based mRMR?
A. Yes. If
you want to use other variants of mRMR, such as
correlation, distance, or t-test/F-test based, you can simply replace the mutual
information computing function using the corrcoef(.) function in Matlab (using
correlation as an example). All these variants have been described in the JBCB
2005/CSB 2003 papers.
Q4.4 What are
the input/parameters of the Matlab version?
A. Three
arrays, D, F, and K. D is an m*n array of your candidate features, each row is
a sample, and each column is a feature/attribute/variable. F is an m*1 vector,
specifying the classification variable (i.e. the class information for each of
the m samples). K is the maximal number of features you want to select.
D must be
pre-discretized as a categorical array, i.e.
containing only integers. F must be categorical, indicating the class
information.
Q4.5 What is the output of the Matlab version?
A. The
indexes of the first K features selected. This corresponds to the second column
produced in the C/C++ version.
Q4.6 What are mrmr_mid_d ,
mrmr_miq_d , and mrmr_mibase_d?
A. The
first, mrmr_mid_d, is MID .
The second, mrmr_miq_d, is MIQ. The third, mrmr_mibase_d, is maximum relevance selection (for comparison).
See Q1.2 for more explanations. You can also read our
papers for a comparison of MID and MIQ.
Q4.7 Do the Matlab
version and C/C++ version produce the same results?
A.
Theoretically yes. Practically may have slight difference in some cases, as
they use different floating number precisions (i.e. one uses double and the
other uses float). However we have not observed any dramatic differences
between these two versions.
5. Handle
discrete/categorical and continuous variables
Q5.1 How mRMR
handles continuous variables?
A. mRMR is a framework where the
relevance & redundancy terms should be combined. Mutual information, which
is used most of the time, is a useful method to define these two terms, but
other options exist. There are three ways for mRMR to
handle continuous variables.
(1) Use
t-test / F-test (bi-class/multiclass) for relevance measure and the correlation
among variables as redundancy measure. Other scores such as distances can also
be considered. See the CSB03 & JBCB05 papers for details.
(2) Use
mutual information of discrete variables, - this needs to first discretize variables/features. We have chosen to discretize them using mean+/-alpha*std
(alpha=1 or 0 or 2 or 0.5). The choice of alpha will have some influence on the
actual features selected (more correctly, the ordering of features you select,
- but if you select several more, you may find a lot of them are the same,
although may be in different order). This is actually a very robust way to
select features. See the TPAMI05, CSB03, JBCB05 papers for details.
(3) Use
mutual information for continuous variables. The mutual information can be
estimated using Parzen windows. See the TPAMI05 paper
for the formulas. The computation of Parzen window is
more expensive than the discrete version of mutual information computation.
See Q1.7 for more information of these papers.
Q5.2 My variables are
continuous or have as many as 500/1000 discrete/categorical states. Can I use mRMR?
A. You
can either use mRMR for continuous variables (based
on correlation, Parzen-windows mutual information
estimation, etc) or first discretize and use the mRMR for discrete variables.
Q5.3 What is the best way to discretize the data?
A. In our
experiments we have discretized data based on their
mean values and standard deviations. We thresholded
at mean ± alpha*std, where the alpha value usually
range from 0.5 to 2. Typically 2, 3 or no more than 5 states for each variable
will produce satisfactory results that are quite robust too. But you need to
make a decision what would be the most meaningful way to discretize
your data.
Q5.4 Why your README file mentions "Results with the continuous variables are not
as good as with Discrete variables"?
A. We
showed some comparison results in our papers (see Q1.7) on
both discrete and continuous cases using the same datasets, typically the discretized results are better. There are several reasons,
e.g. discretization will often lead to a more robust
classification.
[Return to top] [Return to mRMR main page]
6. Other questions
Q6.1 A typo in Eq. (8) in your JBCB05 paper (and also the CSB03
paper)?
A. Yes,
there is a typo in JBCB05 paper (and also the CSB03 paper), Eq. (8) on page
189, the term (gk-g) should come with a square like (gk-g)^2.
Q6.2 Can I ask questions and
what is your contact?
A. Yes,
you can ask questions if you cannot find answers above. My contact is Hanchuan
(dot) Peng (at) gmail (dot )
com.
[Return to top] [Return to mRMR main page]