Difference between revisions of "Mc 2"

From Free Software Directory
Jump to: navigation, search
(Created page with "{{Entry |Name=mc |Short description=Converts text documents into a vector space model |Full description=MC is a C++ program that creates vector-space models from text documents t...")
 
(Changed URLs to webarchive removed gnu)
Line 3: Line 3:
 
|Short description=Converts text documents into a vector space model
 
|Short description=Converts text documents into a vector space model
 
|Full description=MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. The MC program: 1. Recursively descends directories, finding text files 2. Processes files selectively through full regular expression matching of file names. 3. Builds a sparse matrix of word/token counts. The particular sprse marix format used is given here. 4. Processes any user specified text formats(email address or URLs) as a whole token through regular expression matching or FLEX definition. 5. Prunes vocabulary by word length and frequency 6. Excludes user specified stop words 7. Sets word vector weights according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8. Writes all data structures to disk in the Compressed Column Storage format. The application does not have English parsing or part-of-speech tagging facilities or complete documentation
 
|Full description=MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. The MC program: 1. Recursively descends directories, finding text files 2. Processes files selectively through full regular expression matching of file names. 3. Builds a sparse matrix of word/token counts. The particular sprse marix format used is given here. 4. Processes any user specified text formats(email address or URLs) as a whole token through regular expression matching or FLEX definition. 5. Prunes vocabulary by word length and frequency 6. Excludes user specified stop words 7. Sets word vector weights according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8. Writes all data structures to disk in the Compressed Column Storage format. The application does not have English parsing or part-of-speech tagging facilities or complete documentation
 +
|Homepage URL=http://web.archive.org/web/20090921192233/http://www.cs.utexas.edu/users/jfan/dm/
 
|User level=none
 
|User level=none
|Status=Vanished
+
|Keywords=data-mining,text-mining,vector-space-model,bag-of-words
|Component programs=
+
|Version identifier=2.29
|Homepage URL=http://www.cs.utexas.edu/users/jfan/dm/
+
|Version date=2004/11/09
|VCS checkout command=
+
|Version status=stable
|Computer languages=C++
+
|Version download=web.archive.org/web/20090723195805/http://www.cs.utexas.edu/users/jfan/dm/src/mc.src.2.29.tar.gz
|Documentation note=User README available in HTML format from http://www.cs.utexas.edu/users/jfan/dm/README.html
+
|Last review by=BABA200
|Paid support=
+
|Last review date=2017/02/20
|IRC help=
 
|IRC general=
 
|IRC development=
 
|Related projects=
 
|Keywords=data mining,text mining,vector space model,bag of words,MC
 
|Is GNU=y
 
|Last review by=James Fan
 
|Last review date=2008-04-30
 
 
|Submitted by=Database conversion
 
|Submitted by=Database conversion
 
|Submitted date=2011-04-01
 
|Submitted date=2011-04-01
|Version identifier=2.19
+
|Status=
|Version date=2001-06-26
+
|Is GNU=No
|Version status=stable
+
|License verified date=2001-07-02
|Version download=http://www.cs.utexas.edu/users/jfan/dm/src/
+
}}
 +
{{Project license
 +
|License=GPLv2
 +
|License verified by=Janet Casey
 
|License verified date=2001-07-02
 
|License verified date=2001-07-02
|Version comment=2.19 stable released 2001-06-26
 
 
}}
 
}}
 
{{Person
 
{{Person
 +
|Real name=James Fan
 
|Role=Maintainer
 
|Role=Maintainer
|Real name=James Fan
 
 
|Email=jfan@cs.utexas.edu
 
|Email=jfan@cs.utexas.edu
 
|Resource URL=
 
|Resource URL=
}}
 
{{Resource
 
|Resource audience=Bug Tracking,Developer,Support
 
|Resource kind=E-mail
 
|Resource URL=mailto:jfan@cs.utexas.edu
 
 
}}
 
}}
 
{{Software category
 
{{Software category
 
|Database=administration
 
|Database=administration
 
|Interface=command-line
 
|Interface=command-line
 +
|Programming-language=C++
 
|Works-with=database
 
|Works-with=database
}}
 
{{Project license
 
|License=GPLv2
 
|License verified by=Janet Casey
 
|License verified date=2001-07-02
 
 
}}
 
}}
 
{{Software prerequisite
 
{{Software prerequisite
Line 61: Line 47:
 
|Prerequisite description=FLEX
 
|Prerequisite description=FLEX
 
}}
 
}}
 +
{{Featured}}

Revision as of 15:14, 20 February 2017


[edit]

MC

https://web.archive.org/web/20090921192233/http://www.cs.utexas.edu/users/jfan/dm/
Converts text documents into a vector space model.

MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. The MC program: 1. Recursively descends directories, finding text files 2. Processes files selectively through full regular expression matching of file names. 3. Builds a sparse matrix of word/token counts. The particular sprse marix format used is given here. 4. Processes any user specified text formats(email address or URLs) as a whole token through regular expression matching or FLEX definition. 5. Prunes vocabulary by word length and frequency 6. Excludes user specified stop words 7. Sets word vector weights according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8. Writes all data structures to disk in the Compressed Column Storage format. The application does not have English parsing or part-of-speech tagging facilities or complete documentation





Licensing

License

Verified by

Verified on

Notes

Verified by

Bendikker

Verified on

17 January 2019




Leaders and contributors

Contact(s)Role
James Fan Maintainer


Resources and communication

Software prerequisites

KindDescription
Required to buildpthread library
Required to buildSTL
Required to buildFLEX




Entry



"web.archive.org/web/20090723195805/http" has not been listed as valid URI scheme.




















Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the page “GNU Free Documentation License”.

The copyright and license notices on this page only apply to the text on this page. Any software or copyright-licenses or other similar notices described in this text has its own copyright notice and license, which can usually be found in the distribution or license text itself.