Getting a Set of strings from a text file

exercise No. 198

Getting a text's set of words.

Q:

Consider a text file foo.txt containing:

A simple collection of words.
   Some words may appear multiple times.

We would like to retrieve a comma separated list of all words being contained within excluding duplicates.

of, multiple, collection, simple, words, may, Some, times, A, appear

The subsequent rules shall apply:

  • Arbitrary combinations of white space and the characters .,:;?!" shall be treated as word delimiters and are otherwise to be ignored.

  • The order of appearance in the generated result does not matter.

  • Duplicates like words in the current example shall show up only once on output.

Hints:

  1. Your application shall read its input from a given file name provided as a command line argument. Provide appropriate error messages if:

    • The users enters either no arguments at all or more than one command line argument.

    • The file in question cannot be read.

    You may reconsider the section called “Exercises” regarding file read access.

  2. Splitting input text lines at word delimiters .,:;?!" or white space characters may be achieved by means of split(...) and the regular expression String regex = "[ \t\"!?.,'´`:;]+";. This + sign indicates the appearance of a succession of one ore more character element from the set \t\"!?.,'´`:;.

    Thus a text That's it. Next try will be split into a string array {"That", "s", "it", "Next", "try"}.

  3. Write a Junit test which reads from a given input file and compares its result with a hard coded set of expected strings.

A:

The input file smalltest.txt may be used to define a Junit test:

@Test
public void testWordSet() throws FileNotFoundException, IOException {

  final Set<String> expectedStrings =
    new HashSet <String>(Arrays.asList(new String[]{
            "A", "simple", "collection", "of", "words",
            "Some", "may", "appear", "multiple", "times"
  }));

  final TextFileHandler tfh = new TextFileHandler("smalltest.txt");
  Assert.assertTrue(tfh.getWords().equals(expectedStrings));
}