Tutorial 037 Converting a Text File directly to a Vocabulary List


By amalgamating the functions of the programs given in Tutorials 34, 35 and 36... the following program enables you to obtain the vocab from any text file.

Using the following text :

When Buddha became enlightened he acquired knowledge of the Four Truths:
the fact of pain or evil, and that everything is suffering;
that the cause of suffering is desire;
that the suppression of desire brings about the suppression of suffering;
and that to suppress suffering you must adhere to the Noble Eightfold
Path of right intention, speech, action, right livelihood, effort,
mindfulness and right concentration.

....we get the following output :



 Press <SPACE> to choose a text file whose vocabulary you wish to extract

 Full Pathname of this text file :

C:\Documents and Settings\Desktop\Buddha.txt

  Number of words extracted = 65

               Sorting Words using *Quicksort*...

That took 1.25secs to sort 25000 memory locations (only 65 used)

If necessary press <SHIFT> to scroll down to see the rest

Buddha Eightfold Four Noble Path Truths When about acquired action adhere and
became brings cause concentration desire effort enlightened everything evil
fact he intention is knowledge livelihood mindfulness must of or pain right
speech suffering suppress suppression that the to you

        41 different words counted but e.g. <You> and <you> each count

 Press<SPACE> to go again...


You see that of the 65 words extracted  only 41 were distinct (not counting repetitions). As given, the program can handle up to 25,000 (extractable) words in the text file. The following program probably needs no further explanations than those given in the previous three tutorials.


If you have a Word file whose vocabulary you would like to extract, just load the file up into Word then Save it as a Text File as follows :
  1. Click "File"
  2. Click "Save As"
  3. Edit the file name by for instance by adding TEXT to the existing filename
  4. Where it says "Save as Type" click the downward pointing arrow
  5. Choose "Text Only with Line breaks"
  6. Click Save
  7. Find the new text file in the directory in which you saved it
  8. Double click it to see the result
  9. You are now ready to extract the vocab using this tutorial's program

Have fun! (and please email me with any problems/bugs/etc.)

Listing :

      REM : Converts a text file to vocab list
      REM : Best used with pure prose text
      REM : Richard Weston, 1st July 2003
      MODE 8
      t=25000 :REM word capacity
      DIM word$(t),rank(t)
      COLOUR1
      PRINT'" Press <SPACE> to choose a text file whose vocabulary you wish to extract"
      G=GET
      OFF
      :
      DIM of% 75, ff% 18, fn% 255
      !of% = 76
      of%!4 = @hwnd%
      of%!12 = ff%
      of%!28 = fn%
      of%!32 = 256
      of%!52 = 6
      $ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
      :
      SYS "GetOpenFileName", of% TO result%
      IF result% filename$ = FNnulterm$(fn%)
      COLOUR7
      PRINT'" Full Pathname of this text file :"
      COLOUR2
      PRINT'filename$
      :
      fnum=OPENIN filename$
      IF fnum=0 THEN PRINT "No ";filename$;" data": END
      :
      n=0
      COLOUR7
      REPEAT
        finished=FALSE
        word$=""
        REPEAT
          temp=BGET#fnum :REM Read byte
          PROCprocess
        UNTIL finished
        IF LEN(word$)>1 THEN
          n+=1
          PRINTTAB(1,8)" Number of words extracted = ";n
          word$(n)=word$ 
        ENDIF
      UNTIL  EOF#fnum
      CLOSE#fnum
      :

      newword=0
      VDU14
      COLOUR9
      PRINT'TAB(15)"Sorting Words using *Quicksort*..."'
      COLOUR7
      :
      TIME=0
      :
      PROCsort
      :
      T=TIME/100
      COLOUR11
      PRINT"That took ";T;"secs to sort ";t;" memory locations (only ";n;" used)"
      PRINT'"If necessary press <SHIFT> to scroll down to see the rest"'
      COLOUR7
      FOR i=t-1 TO 1 STEP -1
        pos=POS:L=LEN(word$(rank(i)))
        :
        PROCcheck_already_met
        :
        IF alreadymet=FALSE THEN
          newword+=1
          IF L < (80-pos) THEN
            PRINTword$(rank(i))+" ";
          ELSE
            PRINT'word$(rank(i))+" ";
          ENDIF
        ENDIF
      NEXT i
      PRINT'
      COLOUR9
      PRINT newword;" different words counted but e.g. <You> and <you> each count"
      COLOUR2
      PRINT'" Press<SPACE> to go again..."
      G=GET
      RUN
      END
      :
      DEF FNnulterm$(P%)
      LOCAL A$
      WHILE ?P% <> 0
        A$ += CHR$?P%
        P% += 1
      ENDWHILE
      = A$
      :
      DEF PROCprocess
      IF temp>64 AND temp<91 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp>96 AND temp<123 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp=45 THEN  word$+=CHR$(temp)
      REM^ hyphen
      IF temp=10 THEN finished=TRUE
      IF temp>31 AND temp<65 THEN finished=TRUE
      IF temp=45 THEN finished=FALSE
      IF temp>90 AND temp<97 THEN finished=TRUE
      IF temp>122 THEN finished=TRUE
      ENDPROC
      :
      :
      DEF PROCcheck_already_met
      alreadymet=FALSE
      IF word$(rank(i))=word$(rank(i+1)) THEN
        alreadymet=TRUE
      ENDIF
      ENDPROC
      :
      DEF PROCsort:LOCAL I
      FOR I = 1 TO t
        rank(I)=I
      NEXT
      PROCquicksort(1,t)
      ENDPROC
      :
      DEF PROCquicksort(low,high)
      LOCAL left,right,it,dummy
      left=low:right=high
      it$=word$(rank((low+high)DIV 2))
      REPEAT
        IF word$(rank(left))>it$ THEN
          REPEAT left=left+1
          UNTIL word$(rank(left))<=it$
        ENDIF
        IF word$(rank(right))<it$ THEN
          REPEAT right=right-1
          UNTIL word$(rank(right))>=it$
        ENDIF
        IF left<=right THEN
          dummy=rank(left)
          rank(left)=rank(right)
          rank(right)=dummy
          left=left+1
          right=right-1
        ENDIF
      UNTIL left>right
      IF right>low THEN PROCquicksort(low,right)
      IF left<high THEN PROCquicksort(left,high)
      ENDPROC


Next Tutorial

Richard Weston's Homepage