mRMR FAQ

A. It means minimum-Redundancy-Maximum-Relevance feature/variable/attribute selection. The goal is to select a feature subset set that best characterizes the statistical property of a target classification variable, subject to the constraint that these features are mutually as dissimilar to each other as possible, but marginally as similar to the classification variable as possible. We showed several different forms of mRMR, where "relevance" and "redundancy" were defined using mutual information, correlation, t-test/F-test, distances, etc.

Importantly, for mutual information, we showed that the method to detect mRMR features also searches for a feature set of which features jointly have the maximal statistical "dependency" on the classification variable. This "dependency" term is defined using a new form of the high-dimensional mutual information.

The mRMR method was first developed as a fast and powerful feature "filter". We then also showed a method to combine mRMR and "wrapper" selection methods. These methods have produced promising results on a range of datasets in many different areas.

Q1.2 What are "MID" and "MIQ"?

A. MID and MIQ represent the Mutual Information Difference and Quotient schemes, respectively, to combine the relevance and redundancy that are defined using Mutual Information (MI). They are the two most used mRMR schemes.

Q1.3 How to use mRMR?

A. There are three ways. 1) Prepare your data and run our online program at the web site http://research.janelia.org/peng/proj/mRMR . 2) Download the precompiled C/C++ version (binary) and run on your own machine. 3) Download the Matlab versions (binary plus source codes) and run on your own machine. You can find the software download links at our web site, too.

Q1.4 Where to download the mRMR software and source codes?

A. You can download different versions at this web site, or follow the links to download the Matlab versions. The Matlab version contains all key source codes, including the mRMR algorithm (in Matlab) and mutual information computation toolbox (in C/C++ and can be compiled as Matlab mex functions).

Q1.5 What is the correct format of my input data?

A. See the answers to the respective mRMR versions below.

A. That means the first selected feature is 2, and the last is 3.

That also means the combination of 2 and 1 is better than the combination of 2 and 3.

That also means 2 is the best feature if you only want one feature. And "2 and 1" is the best combination if you want two features.

However, this does NOT mean 3 is the least relevant or most dependent. If you select features without considering the relationship between all the features, but only between individual features and the target class variable, you may get results that 3 is more relevant than 1. However the combination of 2 and 1 is better than that of 2 and 3.

Q1.7 How to cite/acknowledge mRMR?

A. We will appreciate if you appropriately cite the following papers:

[TPAMI05] Hanchuan Peng, Fuhui Long, and Chris Ding, "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. [PDF]

This paper presents a theory of mutual information based feature selection. Demonstrates the relationship of four selection schemes: maximum dependency, mRMR, maximum relevance, and minimal redundancy. Also gives the combination scheme of "mRMR + wrapper" selection and mutual information estimation of continuous/hybrid variables.

[JBCB05] Chris Ding, and Hanchuan Peng, "Minimum redundancy feature selection from microarray gene expression data," Journal of Bioinformatics and Computational Biology, Vol. 3, No. 2, pp.185-205, 2005. [PDF]

This paper presents a comprehensive suite of experimental results of mRMR for microarray gene selection on many different conditions. It is an extended version of the CSB03 paper.

[CSB03] Chris Ding, and Hanchuan Peng, "Minimum redundancy feature selection from microarray gene expression data," Proc. 2nd IEEE Computational Systems Bioinformatics Conference (CSB 2003), pp.523-528, Stanford, CA, Aug, 2003. [PDF]

This paper presents the first set of mRMR results and different definitions of relevance/redundancy terms.

[IS05] Hanchuan Peng, Chris Ding, and Fuhui Long, "Minimum redundancy maximum relevance feature selection," IEEE Intelligent Systems, Vol. 20, No. 6, pp.70-71, November/December, 2005. [PDF]

A short invited essay that introduces mRMR and demonstrates the importance to reduce redundancy in feature selection.

[Bioinfo07] Jie Zhou, and Hanchuan Peng, "Automatic recognition and annotation of gene expression patterns of fly embryos," Bioinformatics, Vol. 23, No. 5, pp. 589-596, 2007. [PDF]

One application of mRMR in selecting good wavelet image features.

Q1.8 License information?

A. The mRMR software packages can be downloaded and used, subject to the following conditions: Software and source code Copyright (C) 2000-2007 Written by Hanchuan Peng. These software packages are copyright under the following conditions: Permission to use, copy, and modify the software and their documentation is hereby granted to all academic and not-for-profit institutions without fee, provided that the above copyright notice and this permission notice appear in all copies of the software and related documentation and our publications (TPAMI05, JBCB05, CSB03, etc.) are appropriately cited. Permission to distribute the software or modified or extended versions thereof on a not-for-profit basis is explicitly granted, under the above conditions. However, the right to use this software by companies or other for profit organizations, or in conjunction with for profit activities, and the right to distribute the software or modified or extended versions thereof for profit are NOT granted except by prior arrangement and written consent of the copyright holders. For these purposes, downloads of the source code constitute "use" and downloads of this source code by for profit organizations and/or distribution to for profit institutions in explicitly prohibited without the prior consent of the copyright holders. Use of this source code constitutes an agreement not to criticize, in any way, the code-writing style of the author, including any statements regarding the extent of documentation and comments present. The software is provided "AS-IS" and without warranty of any kind, expressed, implied or otherwise, including without limitation, any warranty of merchantability or fitness for a particular purpose. In no event shall the authors be liable for any special, incidental, indirect or consequential damages of any kind, or any damages whatsoever resulting from loss of use, data or profits, whether or not advised of the possibility of damage, and on any theory of liability, arising out of or in connection with the use or performance of these software packages.

Q1.9 Damage & other risks & disclaimer?

A. See the detailed disclaimer and conditions in the answer to Q1.8. In short, we will NOT be liable for any damage of any kinds, or loss of data, because you use the released software. It is all at your own risk.

[Return to top] [Return to mRMR main page]

2. How to use the online version

Q2.1 Where is the online version of mRMR?

A. The web site http://research.janelia.org/peng/proj/mRMR .

Q2.2 Is it true that the online version only considers the mutual information based mRMR?

A. Yes. Since the mutual information based mRMR produces most promising results and most used, the online program only uses mutual information to define relevance and redundancy of features.

Q2.3 What are the input/parameters of the online program?

A. You need an input file, of course (some people just clicked "Submit job" without specifying anything…). You also need to choose how the relevancy and redundancy terms should be combined (i.e. MID or MIQ), how many features you want to select, the property of your variables (categorical or continuous), and how you want to discretize your data in case you have continuous data.

Q2.4 What should be the input file and its format for the online version?

A. It should be a CSV (CSV, comma separated values) file, where each row is a sample and each column is a variable/attribute/feature. Make sure your data is separated by comma, but not blank space of other characters! The first row must be the feature names, and the first column must be the classes for samples. You may download a testing example data set here, which is the microrray data of lung cancer (7 classes). In this sample data set, each variable/feature/column has been discretized as 3-states, encoded in digits "-2", "0" and "2". You may use other integers (such as -1, 0, 1) for the categorical/discrete states defined by yourself, - but never use letters or combinations of digits and letters (such as "10v"). Try not to use strange states such as 1001 or 10000 as the program will use these values to guess what will be a reasonable amount memory to allocate. For example, if each variable has only 5 states, then try to use -2,-1,0,1,2, or 0,1,2,3,4, but NEVER use something like "-10000, -1000, 0, 1000, 10000"! (Note: The released version was only designed for the obviously meaningful inputs.) More examples can be found at the mRMR web site, too.

Your data can contain continuous values, except the first column (which is the class variable) and first row (which is the header). In this case, you can ask mRMR to do discretization for you. See FAQ part 5.

If you have variables that are continuous or have many categorical states (e.g. several hundred), you may want to read more the mRMR handle continuous variables. See FAQ part 5.

Q2.5 What is the meaning of the output of the online program?

A. The meanings of most parts of the online program are intuitive: it automatically compares the features selected using the conventional maximum relevance (MaxRel) method and the mRMR. Suppose you ask the program to select 50 features for you, then you can also truncate the results and use the first 20 or 30 features. You can also test the classification accuracy using the first K features, where K=1,…,50 in this case. In this way, you can actually see with what number of features you will get the satisfactory cross-validation classification accuracy. This method was used in our papers, too.

The first column is the order of features selected. The second column is the actual indexes of the features. The third column includes the respective feature names extracted from your input file.

The last column is just the best score in the process to select the *current* best feature. Indeed, it is the value of "relevancy - (or /) redundancy" for MID (or MIQ) for the current selected feature. Because for classification all selected features will be used together, this score does not indicate anything for classification, thus it is NOT important for you to use.

[Return to top] [Return to mRMR main page]

3. How to use the C/C++ version

Q3.1 Where is the C/C++ version of mRMR?

A. Follow the download links at the web site http://research.janelia.org/peng/proj/mRMR .

Q3.2 Can I use the C/C++ version for Linux, Mac, Unix, or other *nix machines?

A. Yes. There are precompiled versions. If you cannot find one for your machine, send me email and I will try to compile one for you if possible. You can also compile from the source codes directly by running “make -f mrmr.makefile”.

Q3.3 Can I use the C/C++ version Windows machines?

A. You should be able to compile and run using MinGW and mSys. Probably also yes when you use Cygwin, but untested. Or you may want to simply consider using the Matlab version or the online version.

Q3.4 Is it true that the C/C++ version only considers the mutual information based mRMR?

A. Yes. Since the mutual information based mRMR produces most promising results and most used, the released C/C++ program only uses mutual information to define relevance and redundancy of features.

Q3.5 What are the input/parameters of the C/C++ program?

A. You need an input file with the same format explained in Q2.4. You also need to choose how the relevancy and redundancy terms should be combined (i.e. MID or MIQ), how many features you want to select, the property of your variables (categorical or continuous), and how you want to discretize your data in case you have continuous data. Just type "mrmr" or the appropriate program name for the help information.

Q3.6 What should be the input file and its format for the C/C++ version?

A. The same of the online version. See Q2.4.

Q3.7 What is the meaning of the output of the C/C++ program?

A. The same of the online version. See Q2.5.

Q3.8 Are the C/C++ program and the online version the same?

A. Yes. The online program just provides you a convenient way to run mRMR on relatively small datasets. If you have big datasets, try to use the C/C++ version or the Matlab version.

Q3.9 Any help information available when I try to run the C/C++ version?

A. Just run "mrmr" (or other appropriate program name if you get a special one from me) and there will be the default help information how to run it. You can also see some examples if you use the web-based online program, which will display the command used at the top of the result page.

Q3.10 The C/C++ binary program hangs. Why?

A. This program does NOT hang except you feed it an inappropriate input file. For example, some people use variables of hundred of categorical states or continuous values, which cannot yield a meaningful mutual information estimation in many cases. The released mRMR binary versions are using mutual information for discrete variables (if you are interested in the mutual information estimation for continuous variable, you can find the formula in the TPAMI05 paper).

If you have continuous data with big dynamic range (say, from 1 to 1000), the binary version mRMR program treats each variable as one with 1000 categories, and thus the computation of mutual information takes a long time to run, that is why you see the program "hangs".

I suggest you pre-thershold (discretize) your data using some of your own favorite ways, in case you don't like set the threshold as mean+/- std.

[Return to top] [Return to mRMR main page]

4. How to use the Matlab version

Q4.1 Where is the Matlab version of mRMR?

A. Follow the download links at the web site http://research.janelia.org/peng/proj/mRMR , you will find the Matlab versions released to MatlabCentral.

For most commonly used machines, i.e. Linux, Mac, and Windows, the simplest way is to download the mRMR Matlab version with the precompiled mutual information toolbox. Or, you can download the mRMR codes and the mutual information toolbox separately. The mutual information toolbox was also written in C/C++ and can be recompiled as mex functions for Matlab running on other platforms. These codes are essentially similar to the mutual information computation in the C/C++ version of mRMR.

Of course, you will find the mutual information toolbox useful for purposes other than mRMR. For the mRMR case, you will need to set the Matlab path to this toolbox.

Q4.2 Can I use the Matlab versions for Linux, Mac, Unix, Windows, or other machines?

A. Yes.

Q4.3 Is it true that the released Matlab version only considers the mutual information based mRMR?

A. Yes. If you want to use other variants of mRMR, such as correlation, distance, or t-test/F-test based, you can simply replace the mutual information computing function using the corrcoef(.) function in Matlab (using correlation as an example). All these variants have been described in the JBCB 2005/CSB 2003 papers.

Q4.4 What are the input/parameters of the Matlab version?

A. Three arrays, D, F, and K. D is an m*n array of your candidate features, each row is a sample, and each column is a feature/attribute/variable. F is an m*1 vector, specifying the classification variable (i.e. the class information for each of the m samples). K is the maximal number of features you want to select.

D must be pre-discretized as a categorical array, i.e. containing only integers. F must be categorical, indicating the class information.

Q4.5 What is the output of the Matlab version?

A. The indexes of the first K features selected. This corresponds to the second column produced in the C/C++ version.

Q4.6 What are mrmr_mid_d , mrmr_miq_d , and mrmr_mibase_d?

A. The first, mrmr_mid_d, is MID . The second, mrmr_miq_d, is MIQ. The third, mrmr_mibase_d, is maximum relevance selection (for comparison). See Q1.2 for more explanations. You can also read our papers for a comparison of MID and MIQ.

Q4.7 Do the Matlab version and C/C++ version produce the same results?

A. Theoretically yes. Practically may have slight difference in some cases, as they use different floating number precisions (i.e. one uses double and the other uses float). However we have not observed any dramatic differences between these two versions.

5. Handle discrete/categorical and continuous variables

Q5.1 How mRMR handles continuous variables?

A. mRMR is a framework where the relevance & redundancy terms should be combined. Mutual information, which is used most of the time, is a useful method to define these two terms, but other options exist. There are three ways for mRMR to handle continuous variables.

(1) Use t-test / F-test (bi-class/multiclass) for relevance measure and the correlation among variables as redundancy measure. Other scores such as distances can also be considered. See the CSB03 & JBCB05 papers for details.

(2) Use mutual information of discrete variables, - this needs to first discretize variables/features. We have chosen to discretize them using mean+/-alpha*std (alpha=1 or 0 or 2 or 0.5). The choice of alpha will have some influence on the actual features selected (more correctly, the ordering of features you select, - but if you select several more, you may find a lot of them are the same, although may be in different order). This is actually a very robust way to select features. See the TPAMI05, CSB03, JBCB05 papers for details.

(3) Use mutual information for continuous variables. The mutual information can be estimated using Parzen windows. See the TPAMI05 paper for the formulas. The computation of Parzen window is more expensive than the discrete version of mutual information computation.

See Q1.7 for more information of these papers.

Q5.2 My variables are continuous or have as many as 500/1000 discrete/categorical states. Can I use mRMR?

A. You can either use mRMR for continuous variables (based on correlation, Parzen-windows mutual information estimation, etc) or first discretize and use the mRMR for discrete variables.

Q5.3 What is the best way to discretize the data?

A. In our experiments we have discretized data based on their mean values and standard deviations. We thresholded at mean ± alpha*std, where the alpha value usually range from 0.5 to 2. Typically 2, 3 or no more than 5 states for each variable will produce satisfactory results that are quite robust too. But you need to make a decision what would be the most meaningful way to discretize your data.

Q5.4 Why your README file mentions "Results with the continuous variables are not as good as with Discrete variables"?

A. We showed some comparison results in our papers (see Q1.7) on both discrete and continuous cases using the same datasets, typically the discretized results are better. There are several reasons, e.g. discretization will often lead to a more robust classification.

[Return to top] [Return to mRMR main page]

6. Other questions

Q6.1 A typo in Eq. (8) in your JBCB05 paper (and also the CSB03 paper)?

A. Yes, there is a typo in JBCB05 paper (and also the CSB03 paper), Eq. (8) on page 189, the term (gk-g) should come with a square like (gk-g)^2.

Q6.2 Can I ask questions and what is your contact?

A. Yes, you can ask questions if you cannot find answers above. My contact is Hanchuan (dot) Peng (at) gmail (dot ) com.

[Return to top] [Return to mRMR main page]