Tutorial 042 Word Frequency in Text Files


The following program builds on those in the previous few tutorials to give an analysis of word frequency in a chosen text file. As an example I have used the program given here to analyse an article saved from the web as a text file. The article was about the Cretan diet, rich in olive oil and supposedly very healthy.

Here is part of the output from the program:

 Full Pathname of this text file :

C:\Documents and Settings\Richard\My Documents\Cretan Diet from the Web TEXT.txt

  Number of words extracted = 1697

               Sorting Words using *Quicksort*...

That took 0.08secs to sort 1697 words

Words arranged alphabetically with frequency in brackets :

.......some omitted.....
cheesecake(1)
cheesemaker(1)
cheeses(1)
cherries(1)
chestnuts(1)
chicken(2)
chickens(1)
chickpeas(1)
chips(1)
choice(2)
chops(1)
chortopites(1)
cold(1)
collards(1)
combination(1)
combinations(1)
combine(1)
commodity(1)
common(1)
commonly(1)
comparisons(1)
complicated(1)
conditions(1)
conduct(1)
considering(1)
consumed(2)
consumption(1)
content(1)
........................
Besides(1)
Braising(1)
Cheese(1)
Cretan(4)
Cretans(4)
Crete(2)
........................
pigs(1)
places(1)
plan(1)
plate(1)
plate-breaking(1)
plenty(1)
point(1)
pomegranate(1)
popular(1)
pork(1)
portion(1)
potato(1)
potatoes(3)
pots(1)
precious(1)
prefer(1)
preference(1)
preparation(2)
prepared(3)
presented(1)
preserved(1)
pretty(2)
price(1)
prime(1)
process(1)
produce(1)
producers(2)
productions(1)
..........................
zucchini(3)


       763 different words counted but e.g. <You> and <you> each count

Press <SPACE> to see results by frequency
65 of : and
64 of : the
40 of : of
36 of : or
32 of : to
28 of : are
28 of : with
27 of : in
25 of : is
18 of : on
16 of : like
15 of : from
15 of : not
13 of : it
12 of : for
11 of : but
10 of : they
10 of : vegetables
10 of : you
9 of : fresh
9 of : here
9 of : olive

Press <SPACE> for next screen


Such an analysis has rather more application in a literary context but you can see from it certain linguistic principles emerging!
(NB the program does not print one-letter words, as it is presently set up).

Having got to this point, understanding the programming itself should not present too many problems. The algorithm is basically:
  1. Open the chosen text file using the by now well-reheased routines
  2. Pull the bits off the text file, identify and store the words
  3. Sort the words into alphabetical order
  4. Go through the list and count the number of instances of each word
  5. Print the alphabetical list of words with frequencies added, keeping note of the maximum frequency met.
  6. Print out the list again but this time in order of decreasing frequency of the words (alphabetic for each frequency)
You can now analyse some of your own word-processing documents and see which words you most spray around.

Don't forget you can save a "Word" file as plain text for use by this program.


Listing :

      REM : Analyses the frequency of words in any text file
      REM : Richard Weston, 15th July 2003
      MODE 8
      t=25000 :REM word capacity
      instance=1
      maxinstance=0
      DIM word$(t),rank(t)
      COLOUR1
      PRINT'" Press <SPACE> to choose a text file whose vocabulary you wish to extract"
      G=GET
      OFF
      :
      DIM of% 75, ff% 18, fn% 255
      !of% = 76
      of%!4 = @hwnd%
      of%!12 = ff%
      of%!28 = fn%
      of%!32 = 256
      of%!52 = 6
      $ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
      :
      SYS "GetOpenFileName", of% TO result%
      IF result% filename$ = FNnulterm$(fn%)
      COLOUR7
      PRINT'" Full Pathname of this text file :"
      COLOUR2
      PRINT'filename$
      :
      fnum=OPENIN filename$
      IF fnum=0 THEN PRINT "No ";filename$;" data": END
      :
      n=0
      COLOUR7
      REPEAT
        finished=FALSE
        word$=""
        REPEAT
          temp=BGET#fnum :REM Read byte
          PROCprocess
        UNTIL finished
        IF LEN(word$)>1 THEN
          n+=1
          PRINTTAB(1,8)" Number of words extracted = ";n
          word$(n)=word$
        ENDIF
      UNTIL  EOF#fnum
      CLOSE#fnum
      :
      DIM freq(n)
      newword=0
      COLOUR9
      PRINT'TAB(15)"Sorting Words using *Quicksort*..."'
      COLOUR7
      :
      TIME=0
      :
      PROCsort
      :
      T=TIME/100
      COLOUR11
      PRINT"That took ";T;"secs to sort ";n;" words"
      COLOUR13
      PRINT'"Words arranged alphabetically with frequency in brackets :"'
      COLOUR7
      FOR i=n TO 1 STEP -1
        :
        PROCmore_to_come
        :
        IF moretocome=FALSE THEN
          newword+=1
          freq(rank(i))=instance
          rec$=word$(rank(i)) + "(" + STR$(instance) + ") "
          instance=1
          IF VPOS>30 THEN
            COLOUR1
            PRINT'"Press <SPACE> for next screen"
            COLOUR7
            G=GET
            CLS
          ENDIF
          PRINT rec$
        ENDIF
      NEXT i
      PRINT'
      COLOUR9
      PRINT newword;" different words counted but e.g. <You> and <you> each count"
      COLOUR3
      PRINT'"Press <SPACE> to see results by frequency"
      G=GET
      COLOUR7
      PROCresultsbyfreq
      COLOUR2
      PRINT'" Press<SPACE> to go again..."
      G=GET
      RUN
      END
      :
      DEF FNnulterm$(P%)
      LOCAL A$
      WHILE ?P% <> 0
        A$ += CHR$?P%
        P% += 1
      ENDWHILE
      = A$
      :
      DEF PROCprocess
      IF temp>64 AND temp<91 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp>96 AND temp<123 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp=45 THEN  word$+=CHR$(temp)
      REM^ hyphen
      IF temp=10 THEN finished=TRUE
      IF temp>31 AND temp<65 THEN finished=TRUE
      IF temp=45 THEN finished=FALSE
      IF temp>90 AND temp<97 THEN finished=TRUE
      IF temp>122 THEN finished=TRUE
      ENDPROC
      :
      :
      DEF PROCmore_to_come
      IF word$(rank(i))=word$(rank(i-1)) THEN
        moretocome=TRUE
        instance+=1
        IF instance>maxinstance THEN maxinstance=instance
      ELSE moretocome=FALSE
      ENDIF
      ENDPROC
      :
      DEF PROCsort:LOCAL I
      FOR I = 1 TO t
        rank(I)=I
      NEXT
      PROCquicksort(1,n)
      ENDPROC
      :
      DEF PROCquicksort(low,high)
      LOCAL left,right,it,dummy
      left=low:right=high
      it$=word$(rank((low+high)DIV 2))
      REPEAT
        IF word$(rank(left))>it$ THEN
          REPEAT left=left+1
          UNTIL word$(rank(left))<=it$
        ENDIF
        IF word$(rank(right))<it$ THEN
          REPEAT right=right-1
          UNTIL word$(rank(right))>=it$
        ENDIF
        IF left<=right THEN
          dummy=rank(left)
          rank(left)=rank(right)
          rank(right)=dummy
          left=left+1
          right=right-1
        ENDIF
      UNTIL left>right
      IF right>low THEN PROCquicksort(low,right)
      IF left<high THEN PROCquicksort(left,high)
      ENDPROC
      :
      DEF PROCresultsbyfreq
      FOR j = maxinstance TO 1 STEP -1
        FOR i = n TO 1 STEP -1
          IF freq(rank(i))=j THEN
            IF VPOS>30 THEN
              COLOUR1
              PRINT'"Press <SPACE> for next screen"
              COLOUR7
              G=GET
              CLS
            ENDIF
            PRINT; j;" of : ";word$(rank(i))
          ENDIF
        NEXT i
      NEXT j
      ENDPROC




Richard Weston's Homepage