Tutorial 034 "Virgil's Vocab" - Sorting Words using Quicksort

In Tutorial 032 I gave you a program to count the words in a text file. The following program orders words (alphabetically) that it finds in its own DATA statements. (In a subsequent tutorial I will show you how to automate the production of the required data statements, by stripping out the words from any text file, which could be a whole "booksworth" downloaded from the net).

For example, the following is part of the output produced by processing the 20,957 words stripped out from a text file of the same translation of  VIRGIL'S GEORGICS mentioned in Tutorial 032 . Here's the first "screensworth" :


               Sorting Words using *Quicksort*...

That took 1.06secs to sort 20,957 words

Press <SHIFT> to scroll down to see the rest


About Above Abydos Acanthus According Acerrae Achelous Acheron Achilles Acorns
Acte Aegypt Aegyptus Aesculus Aether Aethiop Afric Again Against Ah Alack
Alburnus Alcinous Alders Alive All Allays Allotted Allure Alone Along Alpheus
Alps Alternately Amazement Amerian Amid Amidst Aminaean Among Amphrysian
Amyclae Amyclaean Amythaon An And Anio Aonian Apollo Appear Apples Aquarius
Arabs Arcady Arched Arctos Arcturus Are Arethusa Arethuse Argitis Aristaeus
Armed Arms Around As Ascanian Ascra Asia Asian Asilus Assaracus Assyrian At Ate
Athos Athwart Atlas Attend Aurora Ausonian Auster Autumn Avenged Avernian Ay
Aye Bacchanalian Bacchus Bactria Baffled Baked Bakes Balearic Bare Barred
Barren Bathed Batter Be Bears Beasts Bees Before Begins Begrudge Behind Behold
Belated Belgian Benacus Bend Beneath Beroe Beside Bespake Besport Best Besteads
Betwixt Bide Binding Bisaltic Black Blesses Blest Bloom Blush Bones Boon Bootes
Boreas Borne Both Brass Brave Breathed Breed Breeds Bridge Brief Bright Brim
Bring Bristle Britain Broad Brooks Brother Brought Bruised Brushing Build
Bumastus Burn Burst Bursts Busiris But Butcher By Caerulean Caesar Caicus
Calabrian Came Camilli Can Canopus Capua Caressing Carpathian Carry Cast
Castalia Cattle Caucasus Cayster Cecropian Cecrops Cedar Celeus Centaurs
Centaury Ceos Ceraunian Cerberus Ceres Challenge Chalybs Change Chaonian Chaos
Charmed Chestnut Chestnuts Chides Chiefly Chiron Choose Churned Ciconian Cinyps
Cithaeron Clamour Clanian Clasped Claws Clear Clear-orbed Clings Clio Clitumnus
Close-pent Closes Clutch Clutching Clymene Coal-black Cocytus Coeus Come Comes
Community Confide Congeals Conspicuous Convulsed Corn-heap Corroded Corycus
Could Counted Country Crams Cretan Cries Cross-wise Crouch Crustumian Curb
Curetes Cyclopes Cyclops Cydippe Cyllarus Cyllenian Cynthus Cyrene Cytisus
Cytorus Dacian Dance Danube Dark Dark-eddying Dark-threatening Darken Dash
Daughter Dauntless Dawn Dawned Dead Deal Death-cold Decii Deemed Deep .....


and the last (21st!) screen comes up as:

walls walnut walnut-fruit wan wand wander wandered wandereth wandering wane
wanton war war-god war-horse war-tried war-trump ward wardship ware warlike
warm warmth warn warned warneth warning warnings warp warrior warriors wars was
wash washes waste wastes watch watches water water-courses water-side
water-snake water-trough water-worms waters watery wave wave-top wavers waves
waving wax wax-flower waxed waxen waxes way way-wildered wayfarer we weakling
wealth weapons wearers weary weave webs wed wedges weed weeds weedy weeping
weeps weevil weigh-up weight weighty-wise welcome welcomed welfare well
well-beloved well-drilled well-known well-nigh well-springs wend went wept were
west westward wet what whate whatso wheat-ear wheaten wheel wheel-spokes wheels
whelps when whenas whence where whereby wherefore wherefrom whereof whereso
whereto wherewith wherries whether whets whetting whey which while whilst whirl
whirled whirling whirls whirlwind whirr whit white white-glistening whiten
whitened whiteness whither whittled who whole wholesome wholly whom whose why
wicker-ware wide wide-spread wield wielder wife wild wiles will willed willing
willow willow-bands willow-branches willow-leaves willow-scythe willows wills
wilt wilt- win wind winding windless window-slits winds windy wine wine-cup
wine-god wine-infuriate wine-juice wine-press winepress wines winged winging
wings winnow winnowing winnowing-time wins winter winter-suns winters wintry
wise wit with withal withholds withies within without witless woebegone wolf
wolf-kin wolves womb won wonder wonderment wondrous wondrous-pale wont wonted
wood wooded woodland woods wooed wool wool-clad woolly word words work world
worm worse worser worships worst-hued worsted worthy wot would wouldst wound
wounds wove woven wrap wrath wreak wreathe wreathed wreathes wreaths wretched
wretches wrong wrongs wrought yawn yawns ye yea yeaning year yearly yearning
years yellow yellow-haired yellowing yet yew-tree yews yield yielding yields
yoke yon yore you young younglings your youth youthful zeal zoned zones


      5738 different words counted but -- <You> and <you> each count

 Press<SPACE> to go again...


You can see its done a fantastic job of producing an index or vocab list, call it what you will, and it indictaes what makes Virgil Virgil (not forgetting his translators!). You will also see that capital letters have higher priority than lower case ones, and thus all proper nouns and first words in the verse lines appear before the more "normal" ( lowercase-only) words.

The program avoids printing the same word more than once, when it occurs repeatedly in the text. For instance you expect a lot of "the"s etc. At present there may be a  capitalised version and a non-capitalised version of some words. I leave it as an exercise, for you, to solve that problem!

The following program listing gives a version of the above but with only 44 words in DATA,  to illustrate the action; the sort is so fast it takes less than 0.01sec to run, and there's no need here for the scrolling facility. Its also small enough to run on the free (trial) version of BB4W...


Output :
               Sorting Words using *Quicksort*...

That took 0secs to sort 44 words

Press <SHIFT> to scroll down to see the rest

Maecenas Of Or Such What are bees beneath cattle-keeping cornfield elm for how
is it makes marry meet my or pains patient proof serves smile sod star steer
tend the themes thrifty to trial turn vine what with

        38 different words counted but e.g. <You> and <you> each count

 Press<SPACE> to go again...


In the following you can just read down the words in the DATA statements to see the story starting ...

Listing :
    
      REM: Quicksort words
      REM : Virgil's Vocab
      REM: Richard Weston after "Simon"
      REM: 26th June 2003
      MODE8:OFF
      t=0
      newword=0
      VDU14
      COLOUR9
      PRINTTAB(15)"Sorting Words using *Quicksort*..."'
      COLOUR7
      REPEAT
        t+=1
        READ word$
      UNTIL word$="*"
      DIM word$(t),rank(t)
      RESTORE
      FOR i=1 TO t-1
        READ word$(i)
      NEXT i
      TIME=0
      :
      PROCsort
      :
      T=TIME/100
      COLOUR11
      PRINT"That took ";T;"secs to sort ";t;" words"
      PRINT'"Press <SHIFT> to scroll down to see the rest"'
      COLOUR7
      FOR i=t-1 TO 1 STEP -1
        pos=POS:L=LEN(word$(rank(i)))
        :
        PROCcheck_already_met
        :
        IF alreadymet=FALSE THEN
          newword+=1
          IF L < (80-pos) THEN
            PRINTword$(rank(i))+" ";
          ELSE
            PRINT'word$(rank(i))+" ";
          ENDIF
        ENDIF
      NEXT i
      PRINT'
      COLOUR9
      PRINT newword;" different words counted but e.g. <You> and <you> each count"
      COLOUR2
      PRINT'" Press<SPACE> to go again..."
      G=GET
      RUN
      END
      :
      DEF PROCcheck_already_met
      alreadymet=FALSE
      IF word$(rank(i))=word$(rank(i+1)) THEN
        alreadymet=TRUE
      ENDIF
      ENDPROC
      :
      DEF PROCsort:LOCAL I
      FOR I = 1 TO t
        rank(I)=I
      NEXT
      PROCquicksort(1,t)
      ENDPROC
      :
      DEF PROCquicksort(low,high)
      LOCAL left,right,it,dummy
      left=low:right=high
      it$=word$(rank((low+high)DIV 2))
      REPEAT
        IF word$(rank(left))>it$ THEN
          REPEAT left=left+1
          UNTIL word$(rank(left))<=it$
        ENDIF
        IF word$(rank(right))<it$ THEN
          REPEAT right=right-1
          UNTIL word$(rank(right))>=it$
        ENDIF
        IF left<=right THEN
          dummy=rank(left)
          rank(left)=rank(right)
          rank(right)=dummy
          left=left+1
          right=right-1
        ENDIF
      UNTIL left>right
      IF right>low THEN PROCquicksort(low,right)
      IF left<high THEN PROCquicksort(left,high)
      ENDPROC
      :
      DATA What
      DATA makes
      DATA the
      DATA cornfield
      DATA smile
      DATA beneath
      DATA what
      DATA star
      DATA Maecenas
      DATA it
      DATA is
      DATA meet
      DATA to
      DATA turn
      DATA the
      DATA sod
      DATA Or
      DATA marry
      DATA elm
      DATA with
      DATA vine
      DATA how
      DATA tend
      DATA the
      DATA steer
      DATA What
      DATA pains
      DATA for
      DATA cattle-keeping
      DATA or
      DATA what
      DATA proof
      DATA Of
      DATA patient
      DATA trial
      DATA serves
      DATA for
      DATA thrifty
      DATA bees
      DATA Such
      DATA are
      DATA my
      DATA themes
      DATA "*"


Annotated Listing :
     
      REM: Quicksort words
      REM : Virgil's Vocab
      REM: Richard Weston after "Simon"
      REM: 26th June 2003
      MODE8:OFF
      t=0
      newword=0
      VDU14 *** Paged Mode on
      COLOUR9
      PRINTTAB(15)"Sorting Words using *Quicksort*..."'
      COLOUR7
      REPEAT
        t+=1 *** totals up the words found
        READ word$
      UNTIL word$="*" *** end marker
      DIM word$(t),rank(t) *** store the data in these arrays
      RESTORE *** sends DATA pointer back to the start of the DATA
      FOR i=1 TO t-1    *** (t-1) avoids the asterisk being counted as a word
        READ word$(i)
      NEXT i
      TIME=0 *** sets the time to zero at this point in the program
      :
      PROCsort
      :
      T=TIME/100 ***TIME gives the time in one hundredths of a second
      COLOUR11
      PRINT"That took ";T;"secs to sort ";t;" words"
      PRINT'"Press <SHIFT> to scroll down to see the rest"'
      COLOUR7
      FOR i=t-1 TO 1 STEP -1 *** read backwards or z comes first
        pos=POS : L=LEN(word$(rank(i))) *** needed to see if there's room on the line to print another word
        :
        PROCcheck_already_met  *** checks to see if word already met and printed
        :
        IF alreadymet=FALSE THEN
          newword+=1 *** count the new words
          IF L < (80-pos) THEN *** 80 characters per line in MODE 8
            PRINTword$(rank(i))+" "; ***add a space to separate the words printed"
          ELSE
            PRINT ' word$(rank(i))+" "; *** the ' prints a new line, because the next word won't fit on the present line
          ENDIF
        ENDIF
      NEXT i
      PRINT'
      COLOUR9
      PRINT newword;" different words counted but e.g. <You> and <you> each count"
      COLOUR2
      PRINT'" Press<SPACE> to go again..."
      G=GET
      RUN
      END
      :
      DEF PROCcheck_already_met
      alreadymet=FALSE
      IF word$(rank(i))=word$(rank(i+1)) THEN *** see if the previous word in the sorted list is the same
        alreadymet=TRUE
      ENDIF
      ENDPROC
      :
      DEF PROCsort:LOCAL I
      FOR I = 1 TO t
        rank(I)=I
      NEXT
      PROCquicksort(1,t)
      ENDPROC
      :
      DEF PROCquicksort(low,high) *** This PROC has been adapted to handle strings instead of numbers....
      LOCAL left,right,it,dummy
      left=low:right=high
      it$=word$(rank((low+high)DIV 2)) *** eg it becomes it$ ; value becomes word$
      REPEAT
        IF word$(rank(left))>it$ THEN
          REPEAT left=left+1
          UNTIL word$(rank(left))<=it$
        ENDIF
        IF word$(rank(right))<it$ THEN
          REPEAT right=right-1
          UNTIL word$(rank(right))>=it$
        ENDIF
        IF left<=right THEN
          dummy=rank(left)
          rank(left)=rank(right)
          rank(right)=dummy
          left=left+1
          right=right-1
        ENDIF
      UNTIL left>right
      IF right>low THEN PROCquicksort(low,right)
      IF left<high THEN PROCquicksort(left,high)
      ENDPROC
      :
      *** These DATA were produced automatically as a separate text file (using a program which I give in a later tutorial)
      *** They were then pasted onto the end of the modified, amalgamated  programs given in Tutorials 032 and 033...

      DATA What
      DATA makes
      DATA the
      DATA cornfield
      DATA smile
      DATA beneath
      DATA what
      DATA star
      DATA Maecenas
      DATA it
      DATA is
      DATA meet
      DATA to
      DATA turn
      DATA the
      DATA sod
      DATA Or
      DATA marry
      DATA elm
      DATA with
      DATA vine
      DATA how
      DATA tend
      DATA the
      DATA steer
      DATA What
      DATA pains
      DATA for
      DATA cattle-keeping
      DATA or
      DATA what
      DATA proof
      DATA Of
      DATA patient
      DATA trial
      DATA serves
      DATA for
      DATA thrifty
      DATA bees
      DATA Such
      DATA are
      DATA my
      DATA themes
      DATA "*"




Richard Weston's Homepage