I have recently released a sentiment analysis PHP class under the GPL licence that both analyses the sentiment of text as well as matches text with previously analysed phrases that are positive, negative or neutral. In other words, it learns from your input and becomes more accurate over time. Below I will outline the general concept.
Simple example
include ('sentiment_analyser.class.php'); $sa = new SentimentAnalysis(); $sa->initialize(); $sa->analyse("Thank you. This was the best customer service I have ever received."); $score = $sa->return_sentiment_rating(); var_dump($score);
Concept
This class serves three purposes:
- Estimate the sentiment for a string based on emotion words, booster words, emoticons and polarity changers
- Allow you to save analysed data into positive, negative or neutral datasets
- Identify if we have any phrase matches on previously analysed positive, negative and neutral phrases
Should there be any high quality phrase matches, it would take precedent over the sentiment analysis and return the phrase match rating instead.
Sentiment Analysis
Strings are broken into tokenised arrays of single words. These words are analysed against TXT files that contain emotion words with ratings, emoticons with ratings, booster words with ratings and possible polarity changers.
A score is then calculated based on this analyse and this forms the “Sentiment analysis score”.
Phrase Analysis
This function is key to identifying whether the phrase in questions can be compared to phrases that we have analysed and stored before. It uses Levenshtein distance to calculate distance between 4,5,6,7,8,9 and 10 word length phrases against the dataset we already have. We also make use of PHP’s similar_text to double verify proximity.
This means that the more phrases we have analysed previously improves the entire dataset and allows phrases to be more accurately scored against historical data.
- The phrase is broken up into ngram lengths
- The array is reverse sorted so we compare 10 word length phrases first, then 9, and so on
- Phrases are matched against positive, negative and neutral phrases in the relevant TXT files
- Only matches that meet the minimum levenshtein_min_distance and similiarity_min_distance are kept