Categories
MC
MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections.
The MC program: 1. Recursively descends directories, finding text files 2. Processes files selectively through full regular expression matching of file names. 3. Builds a sparse matrix of word/token counts. The particular sprse marix format used is given here. 4. Processes any user specified text formats(email address or URLs) as a whole token through regular expression matching or FLEX definition. 5. Prunes vocabulary by word length and frequency 6. Excludes user specified stop words 7. Sets word vector weights according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8. Writes all data structures to disk in the Compressed Column Storage format.
The application does not have English parsing or part-of-speech tagging facilities or complete documentation
Last updated 29 Nov, 2007
About
Leadership
- James Fan - Maintainer
Requirements
- FLEX (Build Prerequisite)
- STL (Build Prerequisite)
- pthread library (Build Prerequisite)
Versions
2.19
2.19 stable released 2001-06-26
- Released: 26 Jun, 2001
- Code Maturity: Stable
- Source Archive: http://www.cs.utexas.edu/users/jfan/dm/src/
- Licenses: GPLv2
- Interfaces: Command Line
User Community and Support
User READEM available in HTML format from http://www.cs.utexas.edu/users/jfan/dm/README.html



