HTM Strip

HTMSTRIP is a program written to strip HTML codes from bulk .HTM or .HTML files stored in one subdirectory, to give the corresponding text files which will be stored in the same sub-directory. HTMSTRIP is NOT in the Public Domain. The ownership of this program is here asserted by the author, John White, of Wokingham, England. Please note that HTMSTRIP was designed to strip particularly Latin texts obtained in HTML form from the Latin Library at http://www.theLatinLibrary.com/, and contains some extra coding for this purpose. These should not affect its normal operation for most other purposes. System Requirements: HTMSTRIP program requires 2.5 Mbytes of RAM to function properly. If you are using HTMSTRIP as part of the Blitz Latin package, then your computer will certainly be able to operate HTMSTRIP safely provided that you close down Blitz Latin first.

How to use HTMSTRIP

1. Place all your HTM/HTML files to be stripped into one convenient folder/subdirectory. Let’s call it ‘C:\HTM’.
2. Run HTMSTRIP in a different folder! Click on the menu option FILE/OPEN and navigate to C:\HTM. Open this folder (nothing will be seen in it, despite all your HTM files present). Click on OK. You have now set the directory to search, and the program has generated a file ‘DIRHTM.TXT’ containing all the relevant HTM files to strip.
3. Click on EDIT/STRIP (CTRL+S). This reads in and processes all the files previously created in ‘DIRHTM.TXT’. You do NOT have to keep re-creating DIRHTM.TXT with FILE/OPEN. Regenerate the file lists only when you change the files in ‘C:\HTM’.
4. The process is completely automatic.
5. The output files are saved as file.TXT in the folder ‘C:\HTM’. Multiple filename extenders are replaced with underscores. That is, a filename “abcdef.ghi.HTM” (or .HTML) will be renamed as “abcdef_ghi.TXT”.

How HTMSTRIP works

1. The complete .HTM file is read into memory. For those very rare HTML files that exceed 1 MByte a warning will be given and processing will be skipped on that file. Such files are better converted to text with your web browser.
2. All HTML codes encased within < and > are removed, as are all items encased within [ and ]. (The latter are not HTML codes, but are usefully removed from virtually all text).
3. The surviving text is now searched for HTML codes beginning with an ampersand (&) and terminating with a semi-colon (;). These codes are replaced with the correct items according to the following table:
&gt; or &raquo; = > &lt; or &laquo; = < &amp; = & &#abc; = (chr)abc &quot; = “ &nbsp; = space &tilde; = ~ &aelig; = ae &auml; = a &euml; = e &iuml; = i &ouml; = o &uuml; = u &grave; = [ignored] &acute; = [ignored] &circ; = [ignored]. Other ampersand codes are ignored.
4. The modified text is now searched line by line (between carriage returns) to remove the following: All lines that contain nothing, or only control codes and/or spaces.

Example HTML file to test

The included file “ testhtml.htm” provides an example of the type of HTML file that HTMSTRIP can handle. The file is a mixture of problems taken from different real HTML files, so poses a considerable examination of HTMSTRIP. Test this file first in a web browser so that you can see what is intended. Then place this file in its own sub-directory and convert it to text with HTMSTRIP. The expected result (“ testhtml.txt”, stored in the same sub-directory as “ testhtml.htm”) is shown in the program’s help file.

Registration/Licensing of HTMSTRIP

HTMSTRIP is supplied as part of the Blitz Latin package. Although HTMSTRIP is intended to be used on its own, without Blitz Latin running, it will only operate if the licence for Blitz Latin is still valid (up to 10 free uses, and then a licence must be purchased). Consult the documentation for Blitz Latin in order to register the product (and HTMSTRIP).

Click shop to buy Bllitz Latin and download to try it for free.