Mc 2

From Free Software Directory
Revision as of 08:21, 12 April 2011 by WikiSysop (talk | contribs)$7

(diff) ← Older revision | Approved revision (diff) | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Converts text documents into a vector space model

MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. The MC program: 1. Recursively descends directories, finding text files 2. Processes files selectively through full regular expression matching of file names. 3. Builds a sparse matrix of word/token counts. The particular sprse marix format used is given here. 4. Processes any user specified text formats(email address or URLs) as a whole token through regular expression matching or FLEX definition. 5. Prunes vocabulary by word length and frequency 6. Excludes user specified stop words 7. Sets word vector weights according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8. Writes all data structures to disk in the Compressed Column Storage format. The application does not have English parsing or part-of-speech tagging facilities or complete documentation


version 2.29 (stable)
released on 9 November 2004



LicenseVerified byVerified onNotes
GPLv2Janet Casey2 July 2001

Leaders and contributors

James Fan Maintainer

Resources and communication

Software prerequisites

Required to buildSTL
Required to buildpthread library
Required to buildFLEX

This entry (in part or in whole) was last reviewed on 20 February 2017.


Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the page “GNU Free Documentation License”.

The copyright and license notices on this page only apply to the text on this page. Any software or copyright-licenses or other similar notices described in this text has its own copyright notice and license, which can usually be found in the distribution or license text itself.