Tutorial 037 Converting a Text File directly to
a Vocabulary List
By amalgamating the functions of the programs given in Tutorials 34, 35
and 36... the following program enables you to obtain the vocab from any text
file.
Using the following text :
When Buddha became enlightened he acquired knowledge
of the Four Truths:
the fact of pain or evil, and that everything is suffering;
that the cause of suffering is desire;
that the suppression of desire brings about the suppression of suffering;
and that to suppress suffering you must adhere to the Noble Eightfold
Path of right intention, speech, action, right livelihood, effort,
mindfulness and right concentration.
....we get the following output :
Press <SPACE> to choose a text file whose
vocabulary you wish to extract
Full Pathname of this text file :
C:\Documents and Settings\Desktop\Buddha.txt
Number of words extracted = 65
Sorting Words using *Quicksort*...
That took 1.25secs to sort 25000 memory locations
(only 65 used)
If necessary press <SHIFT> to scroll down to see the rest
Buddha Eightfold Four Noble Path Truths When about acquired action adhere
and
became brings cause concentration desire effort enlightened everything
evil
fact he intention is knowledge livelihood mindfulness must of or pain right
speech suffering suppress suppression that the to you
41 different
words counted but e.g. <You> and <you> each count
Press<SPACE> to go again...
You see that of the 65 words extracted only
41 were distinct (not counting repetitions). As given, the program can handle
up to 25,000 (extractable) words in the text file. The following program
probably needs no further explanations than those given in the previous three
tutorials.
If you have a Word file whose vocabulary you would like to extract,
just load the file up into Word then Save it as a Text File as follows
:
- Click "File"
- Click "Save As"
- Edit the file name by for instance by adding TEXT to the existing
filename
- Where it says "Save as Type" click the downward pointing arrow
- Choose "Text Only with Line breaks"
- Click Save
- Find the new text file in the directory in which you saved it
- Double click it to see the result
- You are now ready to extract the vocab using this tutorial's program
Have fun! (and please email me with any problems/bugs/etc.)
Listing :
REM : Converts a text
file to vocab list
REM : Best used with pure prose text
REM : Richard Weston, 1st July 2003
MODE 8
t=25000 :REM word capacity
DIM word$(t),rank(t)
COLOUR1
PRINT'" Press <SPACE> to choose a
text file whose vocabulary you wish to extract"
G=GET
OFF
:
DIM of% 75, ff% 18, fn% 255
!of% = 76
of%!4 = @hwnd%
of%!12 = ff%
of%!28 = fn%
of%!32 = 256
of%!52 = 6
$ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
:
SYS "GetOpenFileName", of% TO result%
IF result% filename$ = FNnulterm$(fn%)
COLOUR7
PRINT'" Full Pathname of this text file
:"
COLOUR2
PRINT'filename$
:
fnum=OPENIN filename$
IF fnum=0 THEN PRINT "No ";filename$;" data":
END
:
n=0
COLOUR7
REPEAT
finished=FALSE
word$=""
REPEAT
temp=BGET#fnum :REM
Read byte
PROCprocess
UNTIL finished
IF LEN(word$)>1 THEN
n+=1
PRINTTAB(1,8)" Number
of words extracted = ";n
word$(n)=word$
ENDIF
UNTIL EOF#fnum
CLOSE#fnum
:
newword=0
VDU14
COLOUR9
PRINT'TAB(15)"Sorting Words using *Quicksort*..."'
COLOUR7
:
TIME=0
:
PROCsort
:
T=TIME/100
COLOUR11
PRINT"That took ";T;"secs to sort ";t;"
memory locations (only ";n;" used)"
PRINT'"If necessary press <SHIFT>
to scroll down to see the rest"'
COLOUR7
FOR i=t-1 TO 1 STEP -1
pos=POS:L=LEN(word$(rank(i)))
:
PROCcheck_already_met
:
IF alreadymet=FALSE THEN
newword+=1
IF L < (80-pos)
THEN
PRINTword$(rank(i))+"
";
ELSE
PRINT'word$(rank(i))+"
";
ENDIF
ENDIF
NEXT i
PRINT'
COLOUR9
PRINT newword;" different words counted
but e.g. <You> and <you> each count"
COLOUR2
PRINT'" Press<SPACE> to go again..."
G=GET
RUN
END
:
DEF FNnulterm$(P%)
LOCAL A$
WHILE ?P% <> 0
A$ += CHR$?P%
P% += 1
ENDWHILE
= A$
:
DEF PROCprocess
IF temp>64 AND temp<91 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp>96 AND temp<123 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp=45 THEN word$+=CHR$(temp)
REM^ hyphen
IF temp=10 THEN finished=TRUE
IF temp>31 AND temp<65 THEN finished=TRUE
IF temp=45 THEN finished=FALSE
IF temp>90 AND temp<97 THEN finished=TRUE
IF temp>122 THEN finished=TRUE
ENDPROC
:
:
DEF PROCcheck_already_met
alreadymet=FALSE
IF word$(rank(i))=word$(rank(i+1)) THEN
alreadymet=TRUE
ENDIF
ENDPROC
:
DEF PROCsort:LOCAL I
FOR I = 1 TO t
rank(I)=I
NEXT
PROCquicksort(1,t)
ENDPROC
:
DEF PROCquicksort(low,high)
LOCAL left,right,it,dummy
left=low:right=high
it$=word$(rank((low+high)DIV 2))
REPEAT
IF word$(rank(left))>it$
THEN
REPEAT left=left+1
UNTIL word$(rank(left))<=it$
ENDIF
IF word$(rank(right))<it$
THEN
REPEAT right=right-1
UNTIL word$(rank(right))>=it$
ENDIF
IF left<=right THEN
dummy=rank(left)
rank(left)=rank(right)
rank(right)=dummy
left=left+1
right=right-1
ENDIF
UNTIL left>right
IF right>low THEN PROCquicksort(low,right)
IF left<high THEN PROCquicksort(left,high)
ENDPROC
Next Tutorial
Richard Weston's Homepage