Tutorial 036 Converting Text files to DATA word
lines
As promised in Tutorial 034, here is a way to prepare a text file for
vocabulary extraction. Words are identified and extracted whilst the commas,
fulls stops etc are discarded. Then the words are written to another text
file.
To use a small example, the text :
CHARLES DARWIN
THE ORIGIN OF SPECIES, 1859
Born at Shrewsbury in 1809, Darwin's father was a leading physician and
his
mother was a daughter of Josiah Wedgwood.
When processed by the program
becomes :
DATA CHARLES
DATA DARWIN
DATA THE
DATA ORIGIN
DATA OF
DATA SPECIES
DATA Born
DATA at
DATA Shrewsbury
DATA in
DATA Darwin
DATA father
DATA was
DATA leading
DATA physician
DATA and
DATA his
DATA mother
DATA was
DATA daughter
DATA of
DATA Josiah
DATA Wedgwood
You can see that only the upper case and lower
case letters have survived...
These DATA lines can be copied and pasted into the program given in Tutorial
034 (See below at ++++)
Listing :
REM : Converts a
text file to vocab list in DATA statements
REM : in a text file called progname-vocab-DATA.txt
REM : Richard Weston, 30th June 2003
MODE 8
COLOUR1
PRINT'" Press <SPACE> to choose
a text file that you wish to convert to DATA statements"
G=GET
:
DIM of% 75, ff% 18, fn% 255
!of% = 76
of%!4 = @hwnd%
of%!12 = ff%
of%!28 = fn%
of%!32 = 256
of%!52 = 6
$ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
:
SYS "GetOpenFileName", of% TO result%
IF result% filename$ = FNnulterm$(fn%)
COLOUR2
PRINT'" Full Pathname of this text file
:"
PRINT'filename$
:
fnum=OPENIN filename$
IF fnum=0 THEN PRINT "No ";filename$;"
data": END
:
L=LEN(filename$)
new$=LEFT$(filename$,(L-4))
A$=new$+"-vocab-DATA.txt"
X=OPENOUT(A$)
COLOUR3
PRINT'" X, the output channel = ";X
n=0
COLOUR7
REPEAT
finished=FALSE
word$=""
REPEAT
temp=BGET#fnum
:REM Read byte
PROCprocess
UNTIL finished
IF LEN(word$)>1 THEN
n+=1
PRINTTAB(1,10)"Number
of words extracted = ";n
text$=word$
PRINT#X, "DATA
"+text$
BPUT#X, 10 : REM:
10 is ASCII for a line feed
ENDIF
UNTIL EOF#fnum
CLOSE#fnum
CLOSE#X
:
COLOUR4
PRINT'"You can now look for your new text
file :"
COLOUR5
PRINT' A$
COLOUR14
PRINTTAB(1,28)"To see the new file RIGHT-click
on the icon then click OPEN"
COLOUR9
PRINT'" Press <SPACE> to see text
files"
G=GET
PRINTTAB(0,VPOS-1)SPC(40)
SYS "GetOpenFileName", of% TO result%
END
:
DEF FNnulterm$(P%)
LOCAL A$
WHILE ?P% <> 0
A$ += CHR$?P%
P% += 1
ENDWHILE
= A$
:
DEF PROCprocess
IF temp>64 AND temp<91 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp>96 AND temp<123 THEN
word$+=CHR$(temp)
ENDIF
:
IF temp=45 THEN word$+=CHR$(temp)
REM^ hyphen
IF temp=10 THEN finished=TRUE
IF temp>31 AND temp<65 THEN finished=TRUE
IF temp=45 THEN finished=FALSE
IF temp>90 AND temp<97 THEN finished=TRUE
IF temp>122 THEN finished=TRUE
ENDPROC
Annotated Listing :
REM : Converts a
text file to vocab list in DATA statements
REM : in a text file called progname-vocab-DATA.txt
REM : Richard Weston, 30th June 2003
MODE 8
COLOUR1
PRINT'" Press <SPACE> to choose
a text file that you wish to convert to DATA statements"
G=GET
:
DIM of% 75, ff% 18, fn% 255 *** ff% here
needs to be at least 18 -that's the numberof characters in$ff% below
!of% = 76
of%!4 = @hwnd%
of%!12 = ff%
of%!28 = fn%
of%!32 = 256
of%!52 = 6
$ff% = "Text Files"+CHR$0+"*.txt"+CHR$0+CHR$0
*** text files only allowed!!
:
SYS "GetOpenFileName", of% TO result%
IF result% filename$ = FNnulterm$(fn%)
COLOUR2
PRINT'" Full Pathname of this text file
:"
PRINT'filename$
:
fnum=OPENIN filename$ *** OPENS the file
you've just chosen
IF fnum=0 THEN PRINT "No ";filename$;"
data": END
:
L=LEN(filename$)
new$=LEFT$(filename$,(L-4)) ***strips
off the suffix .txt from the file name
A$=new$+"-vocab-DATA.txt" *** adds on
an identifying tag to show the file has been processed
X=OPENOUT(A$) *** Prepares to write
the new file to disc
COLOUR3
PRINT'" X, the output channel = ";X
n=0 *** used to count the words extracted
COLOUR7
REPEAT *** Processes the text file
finished=FALSE *** word being
extracted not yet finished
word$="" *** start with a
null string
REPEAT
temp=BGET#fnum
: REM Read byte
PROCprocess ***
Examines each byte and decides what to do with it... etain or ditch
UNTIL finished *** word now
finished - end symbol found, such as a SPACE, comma or full stop
IF LEN(word$)>1 THEN ***
reject single letter words and other unprintable characters
n+=1
PRINTTAB(1,10)"Number
of words extracted = ";n
text$=word$
PRINT#X, "DATA
"+text$ *** The guts of the program!
BPUT#X, 10 : REM:
10 is ASCII for a line feed
ENDIF
UNTIL EOF#fnum *** end of
input file
CLOSE#fnum *** close input file
CLOSE#X *** close output file
:
COLOUR4
PRINT'"You can now look for your new text
file :"
COLOUR5
PRINT' A$
COLOUR14
PRINTTAB(1,28)"To see the new file RIGHT-click
on the icon then click OPEN"
COLOUR9
PRINT'" Press <SPACE> to see text
files"
G=GET
PRINTTAB(0,VPOS-1)SPC(40) *** blanks the
line <<PRINT'" Press <SPACE> to see text files">>
SYS "GetOpenFileName", of% TO result%
*** Opens the FileName Window so you can see and open your new file produced
END
:
DEF FNnulterm$(P%) *** Reads memory location
LOCAL A$
WHILE ?P% <> 0
A$ += CHR$?P%
P% += 1
ENDWHILE
= A$
:
DEF PROCprocess
IF temp>64 AND temp<91 THEN ***
Capital letters A,B,C...
word$+=CHR$(temp) *** add
letter to word
ENDIF
:
IF temp>96 AND temp<123 THEN ***
Lower Case letters a,b,c...
word$+=CHR$(temp) *** add
letter to word
ENDIF
:
IF temp=45 THEN word$+=CHR$(temp)
*** retain hyphens
REM^ hyphen
IF temp=10 THEN finished=TRUE *** end
of a line of text
IF temp>31 AND temp<65 THEN finished=TRUE
*** SPACE, !, ", # etc
IF temp=45 THEN finished=FALSE *** hyphen
IF temp>90 AND temp<97 THEN finished=TRUE
*** [, \, ] etc
IF temp>122 THEN finished=TRUE ***
{, |, }, etc
ENDPROC
++++ In the example given above, when the
DATA lines are appended to the program of Tutorial 034 we get :
REM: Quicksort words
REM : Virgil's Vocab
REM: Richard Weston after "Simon"
REM: 26th June 2003
MODE8:OFF
t=0
newword=0
VDU14
COLOUR9
PRINTTAB(15)"Sorting Words using *Quicksort*..."'
COLOUR7
REPEAT
t+=1
READ word$
UNTIL word$="*"
DIM word$(t),rank(t)
RESTORE
FOR i=1 TO t-1
READ word$(i)
NEXT i
TIME=0
:
PROCsort
:
T=TIME/100
COLOUR11
PRINT"That took ";T;"secs to sort ";t;"
words"
PRINT'"Press <SHIFT> to scroll down
to see the rest"'
COLOUR7
FOR i=t-1 TO 1 STEP -1
pos=POS:L=LEN(word$(rank(i)))
:
PROCcheck_already_met
:
IF alreadymet=FALSE THEN
newword+=1
IF L < (80-pos)
THEN
PRINTword$(rank(i))+"
";
ELSE
PRINT'word$(rank(i))+"
";
ENDIF
ENDIF
NEXT i
PRINT'
COLOUR9
PRINT newword;" different words counted
but e.g. <You> and <you> each count"
COLOUR2
PRINT'" Press<SPACE> to go again..."
G=GET
RUN
END
:
DEF PROCcheck_already_met
alreadymet=FALSE
IF word$(rank(i))=word$(rank(i+1)) THEN
alreadymet=TRUE
ENDIF
ENDPROC
:
DEF PROCsort:LOCAL I
FOR I = 1 TO t
rank(I)=I
NEXT
PROCquicksort(1,t)
ENDPROC
:
DEF PROCquicksort(low,high)
LOCAL left,right,it,dummy
left=low:right=high
it$=word$(rank((low+high)DIV 2))
REPEAT
IF word$(rank(left))>it$
THEN
REPEAT left=left+1
UNTIL word$(rank(left))<=it$
ENDIF
IF word$(rank(right))<it$
THEN
REPEAT right=right-1
UNTIL word$(rank(right))>=it$
ENDIF
IF left<=right THEN
dummy=rank(left)
rank(left)=rank(right)
rank(right)=dummy
left=left+1
right=right-1
ENDIF
UNTIL left>right
IF right>low THEN PROCquicksort(low,right)
IF left<high THEN PROCquicksort(left,high)
ENDPROC
:
DATA CHARLES
DATA DARWIN
DATA THE
DATA ORIGIN
DATA OF
DATA SPECIES
DATA Born
DATA at
DATA Shrewsbury
DATA in
DATA Darwin
DATA father
DATA was
DATA leading
DATA physician
DATA and
DATA his
DATA mother
DATA was
DATA daughter
DATA of
DATA Josiah
DATA Wedgwood
DATA *
Whose text output is the alphabetical vocab list
:
Born CHARLES DARWIN Darwin Josiah OF ORIGIN SPECIES
Shrewsbury THE Wedgwood and
at daughter father his in leading mother of physician was
Next Tutorial
Richard Weston's Homepage