Tutorial 036 Converting Text files to DATA word  lines

As promised in Tutorial 034, here is a way to prepare a text file for vocabulary extraction. Words are identified and extracted whilst the commas, fulls stops etc are discarded. Then the words are written to another text file.

To use a small example, the text :

CHARLES DARWIN
THE ORIGIN OF SPECIES, 1859

Born at Shrewsbury in 1809, Darwin's father was a leading physician and his
mother was a daughter of Josiah Wedgwood.

When processed by the program becomes :

DATA CHARLES
DATA DARWIN
DATA THE
DATA ORIGIN
DATA OF
DATA SPECIES
DATA Born
DATA at
DATA Shrewsbury
DATA in
DATA Darwin
DATA father
DATA was
DATA leading
DATA physician
DATA and
DATA his
DATA mother
DATA was
DATA daughter
DATA of
DATA Josiah
DATA Wedgwood


You can see that only the upper case and lower case letters have survived...

These DATA lines can be copied and pasted into the program given in Tutorial 034 (See below at ++++)

Listing :
     
      REM : Converts a text file to vocab list in DATA statements
      REM : in a text file called progname-vocab-DATA.txt
      REM : Richard Weston, 30th June 2003
      MODE 8
      COLOUR1
      PRINT'" Press <SPACE> to choose a text file that you wish to convert to DATA statements"
      G=GET
      :
      DIM of% 75, ff% 18, fn% 255
      !of% = 76
      of%!4 = @hwnd%
      of%!12 = ff%
      of%!28 = fn%
      of%!32 = 256
      of%!52 = 6
      $ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
      :
      SYS "GetOpenFileName", of% TO result%
      IF result% filename$ = FNnulterm$(fn%)
      COLOUR2
      PRINT'" Full Pathname of this text file :"
      PRINT'filename$
      :
      fnum=OPENIN filename$
      IF fnum=0 THEN PRINT "No ";filename$;" data": END
      :
      L=LEN(filename$)
      new$=LEFT$(filename$,(L-4))
      A$=new$+"-vocab-DATA.txt"
      X=OPENOUT(A$)
      COLOUR3
      PRINT'" X, the output channel = ";X
      n=0
      COLOUR7
      REPEAT
        finished=FALSE
        word$=""
        REPEAT
          temp=BGET#fnum :REM Read byte
          PROCprocess
        UNTIL finished
        IF LEN(word$)>1 THEN
          n+=1
          PRINTTAB(1,10)"Number of words extracted = ";n
          text$=word$
          PRINT#X, "DATA "+text$
          BPUT#X, 10 : REM: 10 is ASCII for a line feed
        ENDIF
      UNTIL  EOF#fnum
      CLOSE#fnum
      CLOSE#X
      :
      COLOUR4
      PRINT'"You can now look for your new text file :"
      COLOUR5
      PRINT' A$
      COLOUR14
      PRINTTAB(1,28)"To see the new file RIGHT-click on the icon then click OPEN"
      COLOUR9
      PRINT'" Press <SPACE> to see text files"
      G=GET
      PRINTTAB(0,VPOS-1)SPC(40)
      SYS "GetOpenFileName", of% TO result%
      END
      :
      DEF FNnulterm$(P%)
      LOCAL A$
      WHILE ?P% <> 0
        A$ += CHR$?P%
        P% += 1
      ENDWHILE
      = A$
      :
      DEF PROCprocess
      IF temp>64 AND temp<91 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp>96 AND temp<123 THEN
        word$+=CHR$(temp)
      ENDIF
      :
      IF temp=45 THEN  word$+=CHR$(temp)
      REM^ hyphen
      IF temp=10 THEN finished=TRUE
      IF temp>31 AND temp<65 THEN finished=TRUE
      IF temp=45 THEN finished=FALSE
      IF temp>90 AND temp<97 THEN finished=TRUE
      IF temp>122 THEN finished=TRUE
      ENDPROC


Annotated Listing :
     
      REM : Converts a text file to vocab list in DATA statements
      REM : in a text file called progname-vocab-DATA.txt
      REM : Richard Weston, 30th June 2003
      MODE 8
      COLOUR1
      PRINT'" Press <SPACE> to choose a text file that you wish to convert to DATA statements"
      G=GET
      :
      DIM of% 75, ff% 18, fn% 255 *** ff% here needs to be at least 18 -that's the numberof characters in$ff% below
      !of% = 76
      of%!4 = @hwnd%
      of%!12 = ff%
      of%!28 = fn%
      of%!32 = 256
      of%!52 = 6
      $ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0 *** text files only allowed!!
      :
      SYS "GetOpenFileName", of% TO result%
      IF result% filename$ = FNnulterm$(fn%)
      COLOUR2
      PRINT'" Full Pathname of this text file :"
      PRINT'filename$
      :
      fnum=OPENIN filename$ *** OPENS the file you've just chosen
      IF fnum=0 THEN PRINT "No ";filename$;" data": END
      :
      L=LEN(filename$)
      new$=LEFT$(filename$,(L-4)) ***strips off the suffix .txt from the file name
      A$=new$+"-vocab-DATA.txt" *** adds on an identifying tag to show the file has been processed
      X=OPENOUT(A$)  *** Prepares to write the new file to disc
      COLOUR3
      PRINT'" X, the output channel = ";X
      n=0 *** used to count the words extracted
      COLOUR7
      REPEAT *** Processes the text file
        finished=FALSE *** word being extracted not yet finished
        word$="" *** start with a null string
        REPEAT
          temp=BGET#fnum : REM Read byte
          PROCprocess *** Examines each byte and decides what to do with it... etain or ditch
        UNTIL finished *** word now finished - end symbol found, such as a SPACE, comma or full stop
        IF LEN(word$)>1 THEN *** reject single letter words and other unprintable characters
          n+=1
          PRINTTAB(1,10)"Number of words extracted = ";n
          text$=word$
          PRINT#X, "DATA "+text$ *** The guts of the program!
          BPUT#X, 10 : REM: 10 is ASCII for a line feed
        ENDIF
      UNTIL  EOF#fnum  *** end of  input file
      CLOSE#fnum  *** close input file
      CLOSE#X *** close output file
      :
      COLOUR4
      PRINT'"You can now look for your new text file :"
      COLOUR5
      PRINT' A$
      COLOUR14
      PRINTTAB(1,28)"To see the new file RIGHT-click on the icon then click OPEN"
      COLOUR9
      PRINT'" Press <SPACE> to see text files"
      G=GET
      PRINTTAB(0,VPOS-1)SPC(40) *** blanks the line <<PRINT'" Press <SPACE> to see text files">>
      SYS "GetOpenFileName", of% TO result% *** Opens the FileName Window so you can see and open your new file produced
      END
      :
      DEF FNnulterm$(P%) *** Reads memory location
      LOCAL A$
      WHILE ?P% <> 0
        A$ += CHR$?P%
        P% += 1
      ENDWHILE
      = A$
      :
      DEF PROCprocess
      IF temp>64 AND temp<91 THEN *** Capital letters A,B,C...
        word$+=CHR$(temp) *** add letter to word
      ENDIF
      :
      IF temp>96 AND temp<123 THEN *** Lower Case letters a,b,c...
        word$+=CHR$(temp) *** add letter to word
      ENDIF
      :
      IF temp=45 THEN  word$+=CHR$(temp) *** retain hyphens
      REM^ hyphen
      IF temp=10 THEN finished=TRUE *** end of a line of text
      IF temp>31 AND temp<65 THEN finished=TRUE *** SPACE, !, ", # etc
      IF temp=45 THEN finished=FALSE *** hyphen
      IF temp>90 AND temp<97 THEN finished=TRUE *** [, \, ] etc
      IF temp>122 THEN finished=TRUE *** {, |, }, etc
      ENDPROC


 ++++ In the example given above, when the DATA lines are appended to the program of Tutorial 034 we get :

      REM: Quicksort words
      REM : Virgil's Vocab
      REM: Richard Weston after "Simon"
      REM: 26th June 2003
      MODE8:OFF
      t=0
      newword=0
      VDU14
      COLOUR9
      PRINTTAB(15)"Sorting Words using *Quicksort*..."'
      COLOUR7
      REPEAT
        t+=1
        READ word$
      UNTIL word$="*"
      DIM word$(t),rank(t)
      RESTORE
      FOR i=1 TO t-1
        READ word$(i)
      NEXT i
      TIME=0
      :
      PROCsort
      :
      T=TIME/100
      COLOUR11
      PRINT"That took ";T;"secs to sort ";t;" words"
      PRINT'"Press <SHIFT> to scroll down to see the rest"'
      COLOUR7
      FOR i=t-1 TO 1 STEP -1
        pos=POS:L=LEN(word$(rank(i)))
        :
        PROCcheck_already_met
        :
        IF alreadymet=FALSE THEN
          newword+=1
          IF L < (80-pos) THEN
            PRINTword$(rank(i))+" ";
          ELSE
            PRINT'word$(rank(i))+" ";
          ENDIF
        ENDIF
      NEXT i
      PRINT'
      COLOUR9
      PRINT newword;" different words counted but e.g. <You> and <you> each count"
      COLOUR2
      PRINT'" Press<SPACE> to go again..."
      G=GET
      RUN
      END
      :
      DEF PROCcheck_already_met
      alreadymet=FALSE
      IF word$(rank(i))=word$(rank(i+1)) THEN
        alreadymet=TRUE
      ENDIF
      ENDPROC
      :
      DEF PROCsort:LOCAL I
      FOR I = 1 TO t
        rank(I)=I
      NEXT
      PROCquicksort(1,t)
      ENDPROC
      :
      DEF PROCquicksort(low,high)
      LOCAL left,right,it,dummy
      left=low:right=high
      it$=word$(rank((low+high)DIV 2))
      REPEAT
        IF word$(rank(left))>it$ THEN
          REPEAT left=left+1
          UNTIL word$(rank(left))<=it$
        ENDIF
        IF word$(rank(right))<it$ THEN
          REPEAT right=right-1
          UNTIL word$(rank(right))>=it$
        ENDIF
        IF left<=right THEN
          dummy=rank(left)
          rank(left)=rank(right)
          rank(right)=dummy
          left=left+1
          right=right-1
        ENDIF
      UNTIL left>right
      IF right>low THEN PROCquicksort(low,right)
      IF left<high THEN PROCquicksort(left,high)
      ENDPROC
      :
      DATA CHARLES
      DATA DARWIN
      DATA THE
      DATA ORIGIN
      DATA OF
      DATA SPECIES
      DATA Born
      DATA at
      DATA Shrewsbury
      DATA in
      DATA Darwin
      DATA father
      DATA was
      DATA leading
      DATA physician
      DATA and
      DATA his
      DATA mother
      DATA was
      DATA daughter
      DATA of
      DATA Josiah
      DATA Wedgwood
      DATA *


Whose text output is the alphabetical vocab list :
         
Born CHARLES DARWIN Darwin Josiah OF ORIGIN SPECIES Shrewsbury THE Wedgwood and
at daughter father his in leading mother of physician was


Next Tutorial

Richard Weston's Homepage