Very quick note, Opera 11.60 is officially released today, download it, try it, love it.
Posts written/thought up whilst travelling to and from work, some take 80 minutes and some don't.
06 December 2011
Opera 11.60
30 November 2011
WordWhacker V0.4
Published | Actual | |
Birmingham New Street | 07:10 | 07:10 |
London Euston | 08:30 | 08:32 |
Public sector workers are striking today over pensions, I expected the trains to be a disaster but we left New Street on time and got into Euston only slightly late so the day hasn’t started too badly at least.
Anyway, as previously covered, I have been trying to prove that there are some words and phrases that are equivalent to others in some way.
v0.4
Back from holiday and I was refreshed, to get better results I’d need to also look at the combinations of words, for the results to make sense though I’d need to take account of the types of word going into the phrase for instance two adjectives do not make sense whereas an adjective followed by a noun does.
Some more searching brought me to WordNet which has word lists broken down into four files, .adj, .adv, .noun and .verb which unsurprisingly contains lists of adjectives, adverbs, nouns and verbs respectively.
I updated the application some more, changing the control of the application slightly to cope with the word lists, and adding some protection so not to process every possible combination of every word pair that exists, this could lead to huge files and massive computation times. The flow of control was updated to the following:
- Set the charset to use
- Set the text case, could be:
- Camel - i.e. capitalise the fist letter of each word
- Lower - i.e. all words are processed as being lower case
- Both - creates word lists containing both lower and camel case
- Camel - i.e. capitalise the fist letter of each word
- Set whether phrases should be included - the source files contain phrases like a_good_deal, far_and_away or to_the_contrary, these can be excluded
- Set a list of words that should be included in the list of word pairs - for instance to target on happiness and joy “happiness, joy” would be entered
- Generate the results
Here is the Java code to achieve the above:
package words; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileWriter; import java.io.FilenameFilter; import java.io.IOException; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Map.Entry; import java.util.Set; import org.apache.commons.csv.CSVPrinter; /** * Class to generate numerical values for words and compare equivalence to other words. * * @author a */ public class WordWhackerV04 { public enum Charset { ASCII, UNICODE, POSITIONAL } public enum TextCase { CAMEL, LOWER, BOTH } public static Map<Character,Integer> letters = new HashMap<Character,Integer>(); public static Map<Character,Integer> asciiletters = new HashMap<Character,Integer>(); static { asciiletters.put('A', 65);asciiletters.put('B', 66);asciiletters.put('C', 67); asciiletters.put('D', 68);asciiletters.put('E', 69);asciiletters.put('F', 70); asciiletters.put('G', 71);asciiletters.put('H', 72);asciiletters.put('I', 73); asciiletters.put('J', 74);asciiletters.put('K', 75);asciiletters.put('L', 76); asciiletters.put('M', 77);asciiletters.put('N', 78);asciiletters.put('O', 79); asciiletters.put('P', 80);asciiletters.put('Q', 81);asciiletters.put('R', 82); asciiletters.put('S', 83);asciiletters.put('T', 84);asciiletters.put('U', 85); asciiletters.put('V', 86);asciiletters.put('W', 87);asciiletters.put('X', 88); asciiletters.put('Y', 89);asciiletters.put('Z', 90); asciiletters.put('a', 97); asciiletters.put('b', 98); asciiletters.put('c', 99); asciiletters.put('d', 100);asciiletters.put('e', 101);asciiletters.put('f', 102); asciiletters.put('g', 103);asciiletters.put('h', 104);asciiletters.put('i', 105); asciiletters.put('j', 106);asciiletters.put('k', 107);asciiletters.put('l', 108); asciiletters.put('m', 109);asciiletters.put('n', 110);asciiletters.put('o', 111); asciiletters.put('p', 112);asciiletters.put('q', 113);asciiletters.put('r', 114); asciiletters.put('s', 115);asciiletters.put('t', 116);asciiletters.put('u', 117); asciiletters.put('v', 118);asciiletters.put('w', 119);asciiletters.put('x', 120); asciiletters.put('y', 121);asciiletters.put('z', 122);asciiletters.put(' ', 20); letters.put('a', 1); letters.put('b', 2); letters.put('c', 3); letters.put('d', 4); letters.put('e', 5); letters.put('f', 6); letters.put('g', 7); letters.put('h', 8); letters.put('i', 9); letters.put('j', 10);letters.put('k', 11);letters.put('l', 12); letters.put('m', 13);letters.put('n', 14);letters.put('o', 15); letters.put('p', 16);letters.put('q', 17);letters.put('r', 18); letters.put('s', 19);letters.put('t', 20);letters.put('u', 21); letters.put('v', 22);letters.put('w', 23);letters.put('x', 24); letters.put('y', 25);letters.put('z', 26);letters.put(' ', 0); } public Map<String,Integer> adjs = new HashMap<String,Integer>(); public Map<String,Integer> advs = new HashMap<String,Integer>(); public Map<String,Integer> nouns = new HashMap<String,Integer>(); public Map<String,Integer> verbs = new HashMap<String,Integer>(); BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in)); public Charset useCharset = Charset.ASCII; private TextCase useTextCase = TextCase.BOTH; private boolean includePhrases = false; private static final int MAX_RESULTS = 1024;//Max rows in Excel is 65536 private int resultCount = 0; /** * @param args a {@link java.lang.String}[] of program arguments */ public static void main(String[] args) { WordWhackerV04 whacker = new WordWhackerV04(); whacker.driveApp(); } /** * Utility method to control the flow of the application */ private void driveApp() { String input = ""; try { this.changeCharset(); this.changeTextCase(); this.changeIncludePhrases(); this.createWordlists(); while(!"0".equals(input)) { System.out.println("1. Generate equivalences"); System.out.println("0. Exit"); input = stdin.readLine(); if("1".equalsIgnoreCase(input)) { this.generateEquivalence(); } } System.exit(0); } catch (IOException e) { e.printStackTrace(); } } /** * Method to set the charset for seeding */ private void changeCharset() { String input = ""; try { System.out.println("Please enter the ID of the charset to use [ASCII]"); System.out.println("1. ASCII"); System.out.println("2. Unicode"); System.out.println("3. Positional"); input = stdin.readLine(); if("1".equalsIgnoreCase(input)) { this.useCharset = Charset.ASCII; } else if("2".equalsIgnoreCase(input)) { this.useCharset = Charset.UNICODE; } else if("3".equalsIgnoreCase(input)) { this.useCharset = Charset.POSITIONAL; } } catch (IOException e) { e.printStackTrace(); } } /** * Method to set whether phrases whould be included */ private void changeIncludePhrases() { String input = ""; try { System.out.println("Please set whether to include phrases [No]"); System.out.println("1. Yes"); System.out.println("2. No"); input = stdin.readLine(); if("1".equalsIgnoreCase(input)) { this.includePhrases = true; } } catch (IOException e) { e.printStackTrace(); } } /** * Method to choose which case of characters to use. */ private void changeTextCase() { String input = ""; try { System.out.println("Please set the text case to process to use [BOTH]"); System.out.println("1. Camel"); System.out.println("2. Lower"); System.out.println("3. Both"); input = stdin.readLine(); if("1".equalsIgnoreCase(input)) { this.useTextCase = TextCase.CAMEL; } else if("2".equalsIgnoreCase(input)) { this.useTextCase = TextCase.LOWER; } else if("3".equalsIgnoreCase(input)) { this.useTextCase = TextCase.BOTH; } } catch (IOException e) { e.printStackTrace(); } } /** * Read in a file and store in a Map */ private void createWordlists() { File dictDir = new File(System.getProperty("user.dir")+"\\dict"); if(dictDir.exists()) { File[] files = dictDir.listFiles(new FilenameFilter() { @Override public boolean accept(File dir, String name) { boolean accept = false; if(!name.endsWith(".csv") && !name.endsWith(".txt")) { accept = true; } return accept; } }); for(File file : files) { createStrippedList(file); } } } /** * Method to generate the equivalent words and phrases */ private void generateEquivalence() { String input = ""; try { System.out.println("Enter specific words separated by , (comma): "); input = stdin.readLine(); List<String> specificWords = null; if(input.length() > 0) { String[] inSpecWords = input.split(","); specificWords = new ArrayList<String>(inSpecWords.length*2); for(String specWord : inSpecWords) { if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) { specificWords.add(specWord.toLowerCase()); } if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) { specificWords.add(String.format("%s%s", Character.toUpperCase( specWord.charAt(0)), specWord.substring(1) .toLowerCase())); } } } System.out.println("Please enter the text to generate equivalence for: "); input = stdin.readLine(); int stringValue = getWordValue(input); File outFile = new File(input+".csv"); final CSVPrinter printer = new CSVPrinter(new FileWriter(outFile)); try { this.saveMatchingWords(printer, stringValue); this.saveMatchingAdjNoun(printer, specificWords, stringValue); this.saveMatchingVerbAdv(printer, specificWords, stringValue); } catch(MaxResultsReachedException mrre) { System.out.println(mrre.getMessage()); } System.out.println(outFile.getName()+" file created"); } catch (IOException e) { e.printStackTrace(); } } /** * Method for saving the matching words to a given value of source word * * @param printer pointer to the CSV file to write to * @param stringValue value of the source word * @throws MaxResultsReachedException */ private void saveMatchingWords(CSVPrinter printer, int stringValue) throws MaxResultsReachedException { this.appendRowData(printer,getKeysByValue(adjs,stringValue),"adj",stringValue); this.appendRowData(printer,getKeysByValue(advs,stringValue),"adv",stringValue); this.appendRowData(printer,getKeysByValue(nouns,stringValue),"noun",stringValue); this.appendRowData(printer,getKeysByValue(verbs,stringValue),"verb",stringValue); } /** * Method to generate Adjective-Noun pairs which match the value of the * source word, all results will include one of the provided specificWords * or all possible matched if this is empty * * @param printer pointer to the CSV file to write to * @param specificWords a list of words that should appear in the results * @param stringValue value of the source word * @throws MaxResultsReachedException */ private void saveMatchingAdjNoun(CSVPrinter printer, List<String> specificWords, int stringValue) throws MaxResultsReachedException { this.saveMatchingPair(printer,adjs,nouns,specificWords,stringValue,"adj-noun"); } /** * Method to generate Verb-Adverb pairs which match the value of the * source word, all results will include one of the provided specificWords * or all possible matched if this is empty * * @param printer pointer to the CSV file to write to * @param specificWords a list of words that should appear in the results * @param stringValue value of the source word * @throws MaxResultsReachedException */ private void saveMatchingVerbAdv(CSVPrinter printer, List<String> specificWords, int stringValue) throws MaxResultsReachedException { this.saveMatchingPair(printer,verbs,advs,specificWords,stringValue,"verb-adv"); } /** * Method to save the matching words * * @param printer pointer to the CSV file to write to * @param map1 pointer to the first map of words to use * @param map2 pointer to the second map of words to use * @param specificWords a list of words that should appear in the results * @param stringValue value of the source word * @param type String containing the type of word or phrase * @throws MaxResultsReachedException */ private void saveMatchingPair(CSVPrinter printer, Map<String, Integer> map1, Map<String, Integer> map2, List<String> specificWords, int stringValue, String type) throws MaxResultsReachedException { if(specificWords != null && specificWords.size()>0) { for(String specificWord : specificWords) { if(map1.containsKey(specificWord)) { Map<String, Integer> tmpMap = new HashMap<String, Integer>(); tmpMap.put(specificWord, map1.get(specificWord)); processWordPairs(printer, stringValue,tmpMap,map2,type); } if(map2.containsKey(specificWord)) { Map<String, Integer> tmpMap = new HashMap<String, Integer>(); tmpMap.put(specificWord, map2.get(specificWord)); processWordPairs(printer, stringValue,map1,tmpMap,type); } } } else { processWordPairs(printer, stringValue,map1,map2,type); } } /** * Method for processing word pairs * * @param printer pointer to the CSV file to write to * @param stringValue value of the source word * @param map1 pointer to the first map of words to use * @param map2 pointer to the second map of words to use * @param type String containing the type of word or phrase * @throws MaxResultsReachedException */ private void processWordPairs(CSVPrinter printer, int stringValue, Map<String, Integer> map1, Map<String, Integer> map2, String type) throws MaxResultsReachedException { Set<Map.Entry<String, Integer>> map1Vals = map1.entrySet(); for(Map.Entry<String, Integer> entry : map1Vals) { if(entry.getValue() < stringValue) { //only process if less than int remVal = stringValue - entry.getValue(); Set<String> map2Vals = getKeysByValue(map2,remVal); for(String map2Val : map2Vals) { appendRowData(printer, entry.getKey() + " " + map2Val, type,entry.getValue()+remVal); } } } } /** * Iterates through a set of matches, writing each as a row in the results csv file * * @param printer pointer to the CSV file to write to * @param col1 set of values to write to the csv file * @param col2 the type of word/phrase to write to the csv * @param stringValue value of the source word * @throws MaxResultsReachedException */ private void appendRowData(CSVPrinter printer, Set<String> col1, String col2, int stringValue) throws MaxResultsReachedException { for(String value : col1) { appendRowData(printer, value, col2, stringValue); } } /** * writes the result as a row in the output csv file * * @param printer pointer to the CSV file to write to * @param col1 result to write to the csv file * @param col2 the type of word/phrase to write to the csv * @param stringValue value of the source word * @throws MaxResultsReachedException */ private void appendRowData(CSVPrinter printer, String col1, String col2, int stringValue) throws MaxResultsReachedException { if(resultCount<MAX_RESULTS) { printer.println(new String[]{col1,col2,Integer.toString(stringValue)}); resultCount++; } else { throw new MaxResultsReachedException("Maximum number of results reached"); } } /** * Utility method to get all the matching Keys of a Map by the given value * * @param <K> the key object * @param <V> the value object * @param map a Map to search through * @param value the value to search for * * @return a set of keys which match the given value */ private <K, V> Set<K> getKeysByValue(Map<K, V> map, V value) { Set<K> keys = new HashSet<K>(); for (Entry<K, V> entry : map.entrySet()) { if (entry.getValue().equals(value)) { keys.add(entry.getKey()); } } return keys; } /** * Method to strip the source word list according to the options entered by the user. * This will take out phrases and popluate the maps with word numeric values. * * @param file a link to the source word list file. */ private void createStrippedList(File file) { BufferedReader bufferedStream = null; try { Map<String, Integer> mapPointer = this.getWordMap(file); bufferedStream = new BufferedReader( new InputStreamReader( new FileInputStream(file))); String line = ""; while((line = bufferedStream.readLine()) != null) { String word = getWord(line); if(word.matches("^[a-zA-Z].*")) { if(!this.includePhrases && word.contains("_")) { continue; } if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) { word = word.toLowerCase(); mapPointer.put(word, this.getWordValue(word)); } if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) { word = String.format("%s%s",Character.toUpperCase(word.charAt(0)), word.substring(1).toLowerCase()); mapPointer.put(word, this.getWordValue(word)); } } } } catch (IOException e) { e.printStackTrace(); } finally { if(bufferedStream != null) { try { bufferedStream.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * Method to get the actual word from a source file, this extracts only * the pertinent bit from the source line * * @param sentence a String sentence to process * * @return the first word in the sentence. */ private static String getWord(String sentence) { String[] items = sentence.split(" "); return items[0]; } /** * Method to get a Map of Strings to equivalent values based on a given source file * * @param file a pointer to a source word list file * * @return a Map of words to numeric values. */ private Map<String, Integer> getWordMap(File file) { Map<String, Integer> tmpMap = null; if(file.getName().endsWith(".adj")) { tmpMap=this.adjs; } else if(file.getName().endsWith(".adv")) { tmpMap=this.advs; } else if(file.getName().endsWith(".noun")) { tmpMap=this.nouns; } else if(file.getName().endsWith(".verb")) { tmpMap=this.verbs; } return tmpMap; } /** * Method to return the numeric value for a given word * * @param word a {@link java.lang.String} containing the word * @return an int representing the words numeric value */ private int getWordValue(String word) { int returnable = 0; char[] chars = word.toCharArray(); for(char theChar : chars) { Integer charValue = null; switch(useCharset) { case ASCII: charValue = asciiletters.get(theChar); break; case UNICODE: charValue = Character.getNumericValue(theChar); break; case POSITIONAL: charValue = letters.get(Character.toLowerCase(theChar)); break; default: break; } if(charValue != null) { returnable = returnable + charValue; } } return returnable; } /** * Exception defined as inner class */ private class MaxResultsReachedException extends Exception { private static final long serialVersionUID = 1L; public MaxResultsReachedException(String message) { super(message); } } }
The flow through the application prompts the user to enter their choices and then generates the output file; in the example below we can see that more than 1024 possibilities would have been generated (value of MAX_RESULTS)
Please enter the ID of the charset to use [ASCII]
1. ASCII
2. Unicode
3. Positional
1
Please set the text case to process to use [BOTH]
1. Camel
2. Lower
3. Both
1
Please set whether to include phrases [No]
1. Yes
2. No
2
1. Generate equivalences
0. Exit
1
Enter specific words that should appear separated by , (comma):
Please enter the text to generate equivalence for:
Happiness
The maximum number of results has been reached
Happiness.csv file created
1. Generate equivalences
0. Exit
0
Generating on Happiness gives many options for output, most are taken up by single matching words but the result set does cover some pairs, here is an example of the output CSV file:
Tenacious adj 939
Unlimited adj 939
Wished-for adj 939
Gainfully adv 939
Excitedly adv 939
Certainly adv 939
Orchestra noun 939
Whitetail noun 939
Foresight noun 939
Implement verb 939
Orientate verb 939
Recapture verb 939
Lovely Wax adj-noun 939
Downy Nest adj-noun 939
Fit Snoopy adj-noun 939
During my investigations I have found some very funny combinations as profanities and negative words were not removed from the dictionaries. So, while I have been able to prove that my friends company is “Perpetual Happiness” (positional), “Righteous Happiness” (ASCII), and “Phenomenal Happiness” (Unicode) I have also seen some results far from complimentary.
At this point I have stopped developing the script as it has achieved what I wanted it to. There are three builds I have in mind for this for the future, all have a learning aim:
- Add the code to a git repo and add to GitHub
- Make the processing distributed using Hadoop or similar, this should enable me to create multiple word sentences
- Turn this into a webapp using Spring so that I can learn more about this framework
- Moved Hadoop version of the webapp into the cloud using Amazon/Cloud Foundry or similar.
Who knows, if I get another 80 minutes free time I might implement them.