Tutorial 042 Word Frequency in Text Files
The following program builds on those in the previous few tutorials to give
an analysis of word frequency in a chosen text file. As an example I have
used the program given here to analyse an article saved from the web as a
text file. The article was about the Cretan diet, rich in olive oil and supposedly
very healthy.
Here is part of the output from the program:
Full Pathname of this text file
:
C:\Documents and Settings\Richard\My Documents\Cretan Diet from the Web
TEXT.txt
Number of words extracted = 1697
Sorting Words using *Quicksort*...
That took 0.08secs to sort 1697 words
Words arranged alphabetically with frequency in brackets :
.......some omitted.....
cheesecake(1)
cheesemaker(1)
cheeses(1)
cherries(1)
chestnuts(1)
chicken(2)
chickens(1)
chickpeas(1)
chips(1)
choice(2)
chops(1)
chortopites(1)
cold(1)
collards(1)
combination(1)
combinations(1)
combine(1)
commodity(1)
common(1)
commonly(1)
comparisons(1)
complicated(1)
conditions(1)
conduct(1)
considering(1)
consumed(2)
consumption(1)
content(1)
........................
Besides(1)
Braising(1)
Cheese(1)
Cretan(4)
Cretans(4)
Crete(2)
........................
pigs(1)
places(1)
plan(1)
plate(1)
plate-breaking(1)
plenty(1)
point(1)
pomegranate(1)
popular(1)
pork(1)
portion(1)
potato(1)
potatoes(3)
pots(1)
precious(1)
prefer(1)
preference(1)
preparation(2)
prepared(3)
presented(1)
preserved(1)
pretty(2)
price(1)
prime(1)
process(1)
produce(1)
producers(2)
productions(1)
..........................
zucchini(3)
763 different
words counted but e.g. <You> and <you> each count
Press <SPACE> to see results by frequency
65 of : and
64 of : the
40 of : of
36 of : or
32 of : to
28 of : are
28 of : with
27 of : in
25 of : is
18 of : on
16 of : like
15 of : from
15 of : not
13 of : it
12 of : for
11 of : but
10 of : they
10 of : vegetables
10 of : you
9 of : fresh
9 of : here
9 of : olive
Press <SPACE> for next screen
Such an analysis has rather more application in
a literary context but you can see from it certain linguistic principles emerging!
(NB the program does not print one-letter words, as it is presently set
up).
Having got to this point, understanding the programming itself should not
present too many problems. The algorithm is basically:
- Open the chosen text file using the by now well-reheased routines
- Pull the bits off the text file, identify and store the words
- Sort the words into alphabetical order
- Go through the list and count the number of instances of each word
- Print the alphabetical list of words with frequencies added, keeping
note of the maximum frequency met.
- Print out the list again but this time in order of decreasing frequency
of the words (alphabetic for each frequency)
You can now analyse some of your own word-processing documents and see
which words you most spray around.
Don't forget you can save a "Word" file as plain text for use by this program.
Listing :
REM : Analyses the
frequency of words in any text file
REM : Richard Weston, 15th July 2003
MODE 8
t=25000 :REM word capacity
instance=1
maxinstance=0
DIM word$(t),rank(t)
COLOUR1
PRINT'" Press <SPACE> to choose a text
file whose vocabulary you wish to extract"
G=GET
OFF
:
DIM of% 75, ff% 18, fn% 255
!of% = 76
of%!4 = @hwnd%
of%!12 = ff%
of%!28 = fn%
of%!32 = 256
of%!52 = 6
$ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
:
SYS "GetOpenFileName", of% TO result%
IF result% filename$ = FNnulterm$(fn%)
COLOUR7
PRINT'" Full Pathname of this text file :"
COLOUR2
PRINT'filename$
:
fnum=OPENIN filename$
IF fnum=0 THEN PRINT "No ";filename$;" data":
END
:
n=0
COLOUR7
REPEAT
finished=FALSE
word$=""
REPEAT
temp=BGET#fnum :REM
Read byte
PROCprocess
UNTIL finished
IF LEN(word$)>1 THEN
n+=1
PRINTTAB(1,8)" Number
of words extracted = ";n
word$(n)=word$
ENDIF
UNTIL EOF#fnum
CLOSE#fnum
:
DIM freq(n)
newword=0
COLOUR9
PRINT'TAB(15)"Sorting Words using *Quicksort*..."'
COLOUR7
:
TIME=0
:
PROCsort
:
T=TIME/100
COLOUR11
PRINT"That took ";T;"secs to sort ";n;" words"
COLOUR13
PRINT'"Words arranged alphabetically with
frequency in brackets :"'
COLOUR7
FOR i=n TO 1 STEP -1
:
PROCmore_to_come
:
IF moretocome=FALSE THEN
newword+=1
freq(rank(i))=instance
rec$=word$(rank(i))
+ "(" + STR$(instance) + ") "
instance=1
IF VPOS>30 THEN
COLOUR1
PRINT'"Press
<SPACE> for next screen"
COLOUR7
G=GET
CLS
ENDIF
PRINT rec$
ENDIF
NEXT i
PRINT'
COLOUR9
PRINT newword;" different words counted but
e.g. <You> and <you> each count"
COLOUR3
PRINT'"Press <SPACE> to see results
by frequency"
G=GET
COLOUR7
PROCresultsbyfreq
COLOUR2
PRINT'" Press<SPACE> to go again..."
G=GET
RUN
END
:
DEF FNnulterm$(P%)
LOCAL A$
WHILE ?P% <> 0
A$ += CHR$?P%
P% += 1
ENDWHILE
= A$
:
DEF PROCprocess
IF temp>64 AND temp<91 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp>96 AND temp<123 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp=45 THEN word$+=CHR$(temp)
REM^ hyphen
IF temp=10 THEN finished=TRUE
IF temp>31 AND temp<65 THEN finished=TRUE
IF temp=45 THEN finished=FALSE
IF temp>90 AND temp<97 THEN finished=TRUE
IF temp>122 THEN finished=TRUE
ENDPROC
:
:
DEF PROCmore_to_come
IF word$(rank(i))=word$(rank(i-1)) THEN
moretocome=TRUE
instance+=1
IF instance>maxinstance THEN
maxinstance=instance
ELSE moretocome=FALSE
ENDIF
ENDPROC
:
DEF PROCsort:LOCAL I
FOR I = 1 TO t
rank(I)=I
NEXT
PROCquicksort(1,n)
ENDPROC
:
DEF PROCquicksort(low,high)
LOCAL left,right,it,dummy
left=low:right=high
it$=word$(rank((low+high)DIV 2))
REPEAT
IF word$(rank(left))>it$ THEN
REPEAT left=left+1
UNTIL word$(rank(left))<=it$
ENDIF
IF word$(rank(right))<it$
THEN
REPEAT right=right-1
UNTIL word$(rank(right))>=it$
ENDIF
IF left<=right THEN
dummy=rank(left)
rank(left)=rank(right)
rank(right)=dummy
left=left+1
right=right-1
ENDIF
UNTIL left>right
IF right>low THEN PROCquicksort(low,right)
IF left<high THEN PROCquicksort(left,high)
ENDPROC
:
DEF PROCresultsbyfreq
FOR j = maxinstance TO 1 STEP -1
FOR i = n TO 1 STEP -1
IF freq(rank(i))=j
THEN
IF VPOS>30
THEN
COLOUR1
PRINT'"Press <SPACE> for next screen"
COLOUR7
G=GET
CLS
ENDIF
PRINT;
j;" of : ";word$(rank(i))
ENDIF
NEXT i
NEXT j
ENDPROC
Richard Weston's Homepage