PHP Classes

PHP Word Frequency Analysis: Extract text frequent terms of two or more words

Recommend this page to a friend!
  Info   View files Example   Screenshots Screenshots   View files View files (7)   DownloadInstall with Composer Download .zip   Reputation   Support forum   Blog    
Ratings Unique User Downloads Download Rankings
StarStarStar 55%Total: 488 This week: 1All time: 5,890 This week: 560Up
Version License PHP version Categories
frequency-analyzer 1GNU General Publi...5Algorithms, PHP 5, Text processing
Description 

Author

This class can extract text frequent terms of two or more words.

It can parse a given text and extract terms made of individual words or multiple words.

A given list of words can be considered for exclusion.

Innovation Award
PHP Programming Innovation award nominee
October 2014
Number 8


Prize: One copy of the Zend Studio
Analyzing the frequency of the appearance of words in a text is an useful method to determine what are the most relevant topics that the text is talking about.

This class provides interesting solution to compute the frequency of expressions made of one or multiple words in a text document.

Manuel Lemos
Picture of Alejandro Mitrou
  Performance   Level  
Name: Alejandro Mitrou <contact>
Classes: 2 packages by
Country: Argentina Argentina
Age: 50
All time rank: 286238 in Argentina Argentina
Week rank: 416 Up3 in Argentina Argentina Up
Innovation award
Innovation award
Nominee: 2x

Example

<?php
 
include('frequentTermsAnalyzer.php');

 
$generalTextFile = file_get_contents('data/wikipedia_new_york_city.txt');
 
$generalTextFile .= file_get_contents('data/wikipedia_social_media.txt');
 
$generalTextFile .= file_get_contents('data/wikipedia_personal_finance.txt');
 
$generalTextFile .= file_get_contents('data/wikipedia_barbicue.txt');
 
$candidates = explode(' ', vacuumCleaner($generalTextFile));
 
$analyzer = new frequentTermsAnalyzer($candidates);
 
$excludedWords = $analyzer->getFrequentWords();
 
#print_r($excludedWords);
 
 
$dataFile = 'data/wikipedia_personal_finance.txt';
 
$particularTextFile = file_get_contents($dataFile);
 
$candidates = explode(' ', vacuumCleaner($particularTextFile));
 
$analyzer = new frequentTermsAnalyzer($candidates, $excludedWords);
 
$compoundWords = $analyzer->getCompoundTerms();
  print
"Processing data file: ". $dataFile ."\n";
 
print_r($compoundWords);
 
  function
vacuumCleaner($str){
   
$str = strtolower($str);
   
$str = preg_replace('/[^a-z ]/', '', $str);
    return
preg_replace('/\s+/', ' ', $str);
  }
?>


Details

Frequent Terms Analyzer This simple library discovers terms composed by two or more words which appears a significant amount of times within a given text. About the sample script The provided sample script discovers compound terms once provided with a occurrence probability threshold. It also eliminates those grammar elements such as prepositions, articles, verbs and isolated letters at the beginning and end of each term. Please take in consideration that in order to improve its accuracy, the library should be trained with more sample data about as many diverse topics as possible. This would improve the identification of those common words which in general are unspecific and not part of a compound noun (ie. has, been, a, the, this, etc.). All training texts have been acquired from Wikipedia as they are open-source. Constructor and public methods 1. public function __construct(&$termsArray, &$excludedWords = array()) Instantiates the array containing each word from the text as an element. It can also receive a list of words to exclude from both sides of the term as using the trim() function. 2. public function getFrequentWords($threshold = 0.01) Obtains a list of frequent single word terms. By default it is considered that a word should be present at least in 1% of the text to be considered as frequent. You might need to fine tune the threshold value according to your available data. 3. public function getCompoundTerms($threshold = 0.001) Obtains a list of compound terms, with as many words as the library is able to find in 0,1% of the text. Once again, you might need to fine tune the threshold value according to your available data. Further development This first version solves the proposed problem and its good enough to serve my current needs with an acceptable execution time. Having this said, there are at least a couple of things to improve: 1. Even though I've placed some pointers, passing big data as a reference, the algorithm uses a huge amounts of memory compared to the original data size. This is mainly caused be the usage of arrays to held each word as an element, which in PHP uses quite some memory. A better solution could be to traverse a source string instead of storing single words into an array. 2. Once discovered two-words common terms, the method analysis1() traverses the original data once again in search of three-words common terms, without considering previous results. Initially I've planned to incrementally use the collected information to take advantage of it during the process, but this also will require a future release to be completed. Please feel free to improve this small library as much as you like. License This development subscribes to GPL license model. If it’s useful to you, just use it leaving a link to Alejandro Mitrou [www.WiseTonic.com] in your acknowledgement page and/or within your documentation. This software is provided as it is, without warranty of any kind express or implied.

Screenshots  
  • sampleImage.png
  Files folder image Files  
File Role Description
Files folder imagedata (4 files)
Plain text file frequentTermsAnalyzer.php Class Class file
Accessible without login Plain text file README.frequentTermsAnalyzer Doc. Brief documentation
Accessible without login Plain text file testInTextFile.php Example Test script

  Files folder image Files  /  data  
File Role Description
  Accessible without login Plain text file wikipedia_barbicue.txt Data data file
  Accessible without login Plain text file wikipedia_new_york_city.txt Data Data file
  Accessible without login Plain text file wikipedia_personal_finance.txt Data Data file
  Accessible without login Plain text file wikipedia_social_media.txt Data Data file

 Version Control Unique User Downloads Download Rankings  
 0%
Total:488
This week:1
All time:5,890
This week:560Up
 User Ratings  
 
 All time
Utility:75%StarStarStarStar
Consistency:66%StarStarStarStar
Documentation:58%StarStarStar
Examples:58%StarStarStar
Tests:-
Videos:-
Overall:55%StarStarStar
Rank:1869