06 December 2011

Opera 11.60

Very quick note, Opera 11.60 is officially released today, download it, try it, love it.

30 November 2011

WordWhacker V0.4

PublishedActual
Birmingham New Street 07:10 07:10
London Euston 08:30 08:32

Public sector workers are striking today over pensions, I expected the trains to be a disaster but we left New Street on time and got into Euston only slightly late so the day hasn’t started too badly at least.

Anyway, as previously covered, I have been trying to prove that there are some words and phrases that are equivalent to others in some way.

v0.4

Back from holiday and I was refreshed, to get better results I’d need to also look at the combinations of words, for the results to make sense though I’d need to take account of the types of word going into the phrase for instance two adjectives do not make sense whereas an adjective followed by a noun does.

Some more searching brought me to WordNet which has word lists broken down into four files, .adj, .adv, .noun and .verb which unsurprisingly contains lists of adjectives, adverbs, nouns and verbs respectively.

I updated the application some more, changing the control of the application slightly to cope with the word lists, and adding some protection so not to process every possible combination of every word pair that exists, this could lead to huge files and massive computation times. The flow of control was updated to the following:

  1. Set the charset to use
  2. Set the text case, could be:
    1. Camel - i.e. capitalise the fist letter of each word
    2. Lower - i.e. all words are processed as being lower case
    3. Both - creates word lists containing both lower and camel case
  3. Set whether phrases should be included - the source files contain phrases like a_good_deal, far_and_away or to_the_contrary, these can be excluded
  4. Set a list of words that should be included in the list of word pairs - for instance to target on happiness and joy “happiness, joy” would be entered
  5. Generate the results

Here is the Java code to achieve the above:

package words;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.FilenameFilter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import org.apache.commons.csv.CSVPrinter;

/**
 * Class to generate numerical values for words and compare equivalence to other words.
 *
 * @author a
 */
public class WordWhackerV04 {

    public enum Charset {
        ASCII, UNICODE, POSITIONAL
    }

    public enum TextCase {
        CAMEL, LOWER, BOTH
    }

    public static Map<Character,Integer> letters = new HashMap<Character,Integer>();
    public static Map<Character,Integer> asciiletters = new HashMap<Character,Integer>();
    static {
        asciiletters.put('A', 65);asciiletters.put('B', 66);asciiletters.put('C', 67);
        asciiletters.put('D', 68);asciiletters.put('E', 69);asciiletters.put('F', 70);
        asciiletters.put('G', 71);asciiletters.put('H', 72);asciiletters.put('I', 73);
        asciiletters.put('J', 74);asciiletters.put('K', 75);asciiletters.put('L', 76);
        asciiletters.put('M', 77);asciiletters.put('N', 78);asciiletters.put('O', 79);
        asciiletters.put('P', 80);asciiletters.put('Q', 81);asciiletters.put('R', 82);
        asciiletters.put('S', 83);asciiletters.put('T', 84);asciiletters.put('U', 85);
        asciiletters.put('V', 86);asciiletters.put('W', 87);asciiletters.put('X', 88);
        asciiletters.put('Y', 89);asciiletters.put('Z', 90);

        asciiletters.put('a', 97); asciiletters.put('b', 98); asciiletters.put('c', 99);
        asciiletters.put('d', 100);asciiletters.put('e', 101);asciiletters.put('f', 102);
        asciiletters.put('g', 103);asciiletters.put('h', 104);asciiletters.put('i', 105);
        asciiletters.put('j', 106);asciiletters.put('k', 107);asciiletters.put('l', 108);
        asciiletters.put('m', 109);asciiletters.put('n', 110);asciiletters.put('o', 111);
        asciiletters.put('p', 112);asciiletters.put('q', 113);asciiletters.put('r', 114);
        asciiletters.put('s', 115);asciiletters.put('t', 116);asciiletters.put('u', 117);
        asciiletters.put('v', 118);asciiletters.put('w', 119);asciiletters.put('x', 120);
        asciiletters.put('y', 121);asciiletters.put('z', 122);asciiletters.put(' ', 20);

        letters.put('a', 1); letters.put('b', 2); letters.put('c', 3);
        letters.put('d', 4); letters.put('e', 5); letters.put('f', 6);
        letters.put('g', 7); letters.put('h', 8); letters.put('i', 9);
        letters.put('j', 10);letters.put('k', 11);letters.put('l', 12);
        letters.put('m', 13);letters.put('n', 14);letters.put('o', 15);
        letters.put('p', 16);letters.put('q', 17);letters.put('r', 18);
        letters.put('s', 19);letters.put('t', 20);letters.put('u', 21);
        letters.put('v', 22);letters.put('w', 23);letters.put('x', 24);
        letters.put('y', 25);letters.put('z', 26);letters.put(' ', 0);
    }

    public Map<String,Integer> adjs = new HashMap<String,Integer>();
    public Map<String,Integer> advs = new HashMap<String,Integer>();
    public Map<String,Integer> nouns = new HashMap<String,Integer>();
    public Map<String,Integer> verbs = new HashMap<String,Integer>();

    BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
    public Charset useCharset = Charset.ASCII;
    private TextCase useTextCase = TextCase.BOTH;
    private boolean includePhrases = false;
    private static final int MAX_RESULTS = 1024;//Max rows in Excel is 65536
    private int resultCount = 0;

    /**
     * @param args a {@link java.lang.String}[] of program arguments
     */
    public static void main(String[] args) {
        WordWhackerV04 whacker = new WordWhackerV04();
        whacker.driveApp();
    }

    /**
     * Utility method to control the flow of the application
     */
    private void driveApp() {
        String input = "";
        try {
            this.changeCharset();
            this.changeTextCase();
            this.changeIncludePhrases();
            this.createWordlists();

            while(!"0".equals(input)) {
                System.out.println("1. Generate equivalences");
                System.out.println("0. Exit");

                input = stdin.readLine();

                if("1".equalsIgnoreCase(input)) {
                    this.generateEquivalence();
                }
            }
            System.exit(0);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Method to set the charset for seeding
     */
    private void changeCharset() {
        String input = "";
        try {
            System.out.println("Please enter the ID of the charset to use [ASCII]");
            System.out.println("1. ASCII");
            System.out.println("2. Unicode");
            System.out.println("3. Positional");

            input = stdin.readLine();

            if("1".equalsIgnoreCase(input)) {
                this.useCharset = Charset.ASCII;
            } else if("2".equalsIgnoreCase(input)) {
                this.useCharset = Charset.UNICODE;
            } else if("3".equalsIgnoreCase(input)) {
                this.useCharset = Charset.POSITIONAL;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Method to set whether phrases whould be included
     */
    private void changeIncludePhrases() {
        String input = "";
        try {
            System.out.println("Please set whether to include phrases [No]");
            System.out.println("1. Yes");
            System.out.println("2. No");

            input = stdin.readLine();

            if("1".equalsIgnoreCase(input)) {
                this.includePhrases = true;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Method to choose which case of characters to use.
     */
    private void changeTextCase() {
        String input = "";
        try {
            System.out.println("Please set the text case to process to use [BOTH]");
            System.out.println("1. Camel");
            System.out.println("2. Lower");
            System.out.println("3. Both");

            input = stdin.readLine();

            if("1".equalsIgnoreCase(input)) {
                this.useTextCase = TextCase.CAMEL;
            } else if("2".equalsIgnoreCase(input)) {
                this.useTextCase = TextCase.LOWER;
            } else if("3".equalsIgnoreCase(input)) {
                this.useTextCase = TextCase.BOTH;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Read in a file and store in a Map
     */
    private void createWordlists() {
        File dictDir = new File(System.getProperty("user.dir")+"\\dict");
        if(dictDir.exists()) {
            File[] files = dictDir.listFiles(new FilenameFilter() {
                @Override
                public boolean accept(File dir, String name) {
                    boolean accept = false;
                    if(!name.endsWith(".csv") && !name.endsWith(".txt")) {
                        accept = true;
                    }
                    return accept;
                }
            });
            for(File file : files) {
                createStrippedList(file);
            }
        }
    }

    /**
     * Method to generate the equivalent words and phrases
     */
    private void generateEquivalence() {
        String input = "";
        try {
            System.out.println("Enter specific words separated by , (comma): ");
            input = stdin.readLine();

            List<String> specificWords = null;
            if(input.length() > 0) {
                String[] inSpecWords = input.split(",");
                specificWords = new ArrayList<String>(inSpecWords.length*2);
                for(String specWord : inSpecWords) {
                    if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) {
                        specificWords.add(specWord.toLowerCase());
                    }
                    if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) {
                        specificWords.add(String.format("%s%s",
                                                        Character.toUpperCase(
                                                                specWord.charAt(0)),
                                                        specWord.substring(1)
                                                                .toLowerCase()));
                    }
                }
            }

            System.out.println("Please enter the text to generate equivalence for: ");
            input = stdin.readLine();

            int stringValue = getWordValue(input);
            File outFile = new File(input+".csv");
            final CSVPrinter printer = new CSVPrinter(new FileWriter(outFile));
            try {
                this.saveMatchingWords(printer, stringValue);
                this.saveMatchingAdjNoun(printer, specificWords, stringValue);
                this.saveMatchingVerbAdv(printer, specificWords, stringValue);
            } catch(MaxResultsReachedException mrre) {
                System.out.println(mrre.getMessage());
            }
            System.out.println(outFile.getName()+" file created");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Method for saving the matching words to a given value of source word
     *
     * @param printer pointer to the CSV file to write to
     * @param stringValue value of the source word
     * @throws MaxResultsReachedException
     */
    private void saveMatchingWords(CSVPrinter printer, int stringValue)
                                                       throws MaxResultsReachedException {
        this.appendRowData(printer,getKeysByValue(adjs,stringValue),"adj",stringValue);
        this.appendRowData(printer,getKeysByValue(advs,stringValue),"adv",stringValue);
        this.appendRowData(printer,getKeysByValue(nouns,stringValue),"noun",stringValue);
        this.appendRowData(printer,getKeysByValue(verbs,stringValue),"verb",stringValue);
    }

    /**
     * Method to generate Adjective-Noun pairs which match the value of the
     * source word, all results will include one of the provided specificWords
     * or all possible matched if this is empty
     *
     * @param printer pointer to the CSV file to write to
     * @param specificWords a list of words that should appear in the results
     * @param stringValue value of the source word
     * @throws MaxResultsReachedException
     */
    private void saveMatchingAdjNoun(CSVPrinter printer, List<String> specificWords,
                                      int stringValue) throws MaxResultsReachedException {
        this.saveMatchingPair(printer,adjs,nouns,specificWords,stringValue,"adj-noun");
    }

    /**
     * Method to generate Verb-Adverb pairs which match the value of the
     * source word, all results will include one of the provided specificWords
     * or all possible matched if this is empty
     *
     * @param printer pointer to the CSV file to write to
     * @param specificWords a list of words that should appear in the results
     * @param stringValue value of the source word
     * @throws MaxResultsReachedException
     */
    private void saveMatchingVerbAdv(CSVPrinter printer, List<String> specificWords,
                                      int stringValue) throws MaxResultsReachedException {
        this.saveMatchingPair(printer,verbs,advs,specificWords,stringValue,"verb-adv");
    }

    /**
     * Method to save the matching words
     *
     * @param printer pointer to the CSV file to write to
     * @param map1 pointer to the first map of words to use
     * @param map2 pointer to the second map of words to use
     * @param specificWords a list of words that should appear in the results
     * @param stringValue value of the source word
     * @param type String containing the type of word or phrase
     * @throws MaxResultsReachedException
     */
    private void saveMatchingPair(CSVPrinter printer, Map<String, Integer> map1,
                                  Map<String, Integer> map2, List<String> specificWords,
                                  int stringValue, String type)
                                                       throws MaxResultsReachedException {
        if(specificWords != null && specificWords.size()>0) {
            for(String specificWord : specificWords) {
                if(map1.containsKey(specificWord)) {
                    Map<String, Integer> tmpMap = new HashMap<String, Integer>();
                    tmpMap.put(specificWord, map1.get(specificWord));
                    processWordPairs(printer, stringValue,tmpMap,map2,type);
                }
                if(map2.containsKey(specificWord)) {
                    Map<String, Integer> tmpMap = new HashMap<String, Integer>();
                    tmpMap.put(specificWord, map2.get(specificWord));
                    processWordPairs(printer, stringValue,map1,tmpMap,type);
                }
            }
        } else {
            processWordPairs(printer, stringValue,map1,map2,type);
        }
    }

    /**
     * Method for processing word pairs
     *
     * @param printer pointer to the CSV file to write to
     * @param stringValue value of the source word
     * @param map1 pointer to the first map of words to use
     * @param map2 pointer to the second map of words to use
     * @param type String containing the type of word or phrase
     * @throws MaxResultsReachedException
     */
    private void processWordPairs(CSVPrinter printer, int stringValue,
                                  Map<String, Integer> map1, Map<String, Integer> map2,
                                  String type) throws MaxResultsReachedException {
        Set<Map.Entry<String, Integer>> map1Vals = map1.entrySet();
        for(Map.Entry<String, Integer> entry : map1Vals) {
            if(entry.getValue() < stringValue) { //only process if less than
                int remVal = stringValue - entry.getValue();
                Set<String> map2Vals = getKeysByValue(map2,remVal);
                for(String map2Val : map2Vals) {
                    appendRowData(printer, entry.getKey() + " " + map2Val,
                                  type,entry.getValue()+remVal);
                }
            }
        }
    }

    /**
     * Iterates through a set of matches, writing each as a row in the results csv file
     *
     * @param printer pointer to the CSV file to write to
     * @param col1 set of values to write to the csv file
     * @param col2 the type of word/phrase to write to the csv
     * @param stringValue value of the source word
     * @throws MaxResultsReachedException
     */
    private void appendRowData(CSVPrinter printer, Set<String> col1, String col2,
                                      int stringValue) throws MaxResultsReachedException {
        for(String value : col1) {
            appendRowData(printer, value, col2, stringValue);
        }
    }

    /**
     * writes the result as a row in the output csv file
     *
     * @param printer pointer to the CSV file to write to
     * @param col1 result to write to the csv file
     * @param col2 the type of word/phrase to write to the csv
     * @param stringValue value of the source word
     * @throws MaxResultsReachedException
     */
    private void appendRowData(CSVPrinter printer, String col1, String col2,
                                      int stringValue) throws MaxResultsReachedException {
        if(resultCount<MAX_RESULTS) {
            printer.println(new String[]{col1,col2,Integer.toString(stringValue)});
            resultCount++;
        } else {
            throw new MaxResultsReachedException("Maximum number of results reached");
        }
    }

    /**
     * Utility method to get all the matching Keys of a Map by the given value
     *
     * @param <K> the key object
     * @param <V> the value object
     * @param map a Map to search through
     * @param value the value to search for
     *
     * @return a set of keys which match the given value
     */
    private <K, V> Set<K> getKeysByValue(Map<K, V> map, V value) {
         Set<K> keys = new HashSet<K>();
         for (Entry<K, V> entry : map.entrySet()) {
             if (entry.getValue().equals(value)) {
                 keys.add(entry.getKey());
             }
         }
         return keys;
    }

    /**
     * Method to strip the source word list according to the options entered by the user.
     * This will take out phrases and popluate the maps with word numeric values.
     *
     * @param file a link to the source word list file.
     */
    private void createStrippedList(File file) {
        BufferedReader bufferedStream = null;
        try {
            Map<String, Integer> mapPointer = this.getWordMap(file);
            bufferedStream = new BufferedReader(
                             new InputStreamReader(
                             new FileInputStream(file)));
            String line = "";
            while((line = bufferedStream.readLine()) != null) {
                String word = getWord(line);
                if(word.matches("^[a-zA-Z].*")) {
                    if(!this.includePhrases && word.contains("_")) {
                        continue;
                    }
                    if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) {
                        word = word.toLowerCase();
                        mapPointer.put(word, this.getWordValue(word));
                    }
                    if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) {
                        word = String.format("%s%s",Character.toUpperCase(word.charAt(0)),
                                                    word.substring(1).toLowerCase());
                        mapPointer.put(word, this.getWordValue(word));
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if(bufferedStream != null) {
                try {
                    bufferedStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

    /**
     * Method to get the actual word from a source file, this extracts only
     * the pertinent bit from the source line
     *
     * @param sentence a String sentence to process
     *
     * @return the first word in the sentence.
     */
    private static String getWord(String sentence) {
        String[] items = sentence.split(" ");
        return items[0];
    }

    /**
     * Method to get a Map of Strings to equivalent values based on a given source file
     *
     * @param file a pointer to a source word list file
     *
     * @return a Map of words to numeric values.
     */
    private Map<String, Integer> getWordMap(File file) {
        Map<String, Integer> tmpMap = null;
        if(file.getName().endsWith(".adj")) {
            tmpMap=this.adjs;
        } else if(file.getName().endsWith(".adv")) {
            tmpMap=this.advs;
        } else if(file.getName().endsWith(".noun")) {
            tmpMap=this.nouns;
        } else if(file.getName().endsWith(".verb")) {
            tmpMap=this.verbs;
        }
        return tmpMap;
    }

    /**
     * Method to return the numeric value for a given word
     *
     * @param word a {@link java.lang.String} containing the word
     * @return an int representing the words numeric value
     */
    private int getWordValue(String word) {
        int returnable = 0;
        char[] chars = word.toCharArray();
        for(char theChar : chars) {
            Integer charValue = null;
            switch(useCharset) {
                case ASCII:
                    charValue = asciiletters.get(theChar);
                break;
                case UNICODE:
                    charValue = Character.getNumericValue(theChar);
                break;
                case POSITIONAL:
                    charValue = letters.get(Character.toLowerCase(theChar));
                break;
                default:
                break;
            }
            if(charValue != null) {
                returnable = returnable + charValue;
            }
        }
        return returnable;
    }

    /**
     * Exception defined as inner class
     */
    private class MaxResultsReachedException extends Exception {
        private static final long serialVersionUID = 1L;

        public MaxResultsReachedException(String message) {
            super(message);
        }
    }
}

The flow through the application prompts the user to enter their choices and then generates the output file; in the example below we can see that more than 1024 possibilities would have been generated (value of MAX_RESULTS)

Please enter the ID of the charset to use [ASCII]
1. ASCII
2. Unicode
3. Positional
1
Please set the text case to process to use [BOTH]
1. Camel
2. Lower
3. Both
1
Please set whether to include phrases [No]
1. Yes
2. No
2
1. Generate equivalences
0. Exit
1
Enter specific words that should appear separated by , (comma): 

Please enter the text to generate equivalence for: 
Happiness
The maximum number of results has been reached
Happiness.csv file created
1. Generate equivalences
0. Exit
0

Generating on Happiness gives many options for output, most are taken up by single matching words but the result set does cover some pairs, here is an example of the output CSV file:

Tenacious   adj          939
Unlimited   adj          939
Wished-for  adj          939
Gainfully   adv          939
Excitedly   adv          939
Certainly   adv          939
Orchestra   noun         939
Whitetail   noun         939
Foresight   noun         939
Implement   verb         939
Orientate   verb         939
Recapture   verb         939
Lovely Wax  adj-noun     939
Downy Nest  adj-noun     939
Fit Snoopy  adj-noun     939

During my investigations I have found some very funny combinations as profanities and negative words were not removed from the dictionaries. So, while I have been able to prove that my friends company is “Perpetual Happiness” (positional), “Righteous Happiness” (ASCII), and “Phenomenal Happiness” (Unicode) I have also seen some results far from complimentary.

At this point I have stopped developing the script as it has achieved what I wanted it to. There are three builds I have in mind for this for the future, all have a learning aim:

  1. Add the code to a git repo and add to GitHub
  2. Make the processing distributed using Hadoop or similar, this should enable me to create multiple word sentences
  3. Turn this into a webapp using Spring so that I can learn more about this framework
  4. Moved Hadoop version of the webapp into the cloud using Amazon/Cloud Foundry or similar.

Who knows, if I get another 80 minutes free time I might implement them.