[K12OSN] OT -- Not LTSP, but Linux Scripting Question

Tue Mar 22 14:44:01 UTC 2005

I suggest you create two scripts: one to check the text file for any errors--spaces, 
non-three digit numbers, whatever--and the other to actually do the pdftk stuff once you 
have a clean text file.  I do alot of this kind of stuff and I've found that it is much 
more efficient to make sure you've got a clean file to work with, fix any problems 
beforehand, and then do your 'batch' process, than it is to try to write one script that 
will do merging for 50 lines, discover a problem, skip that line but be able to tell you 
it had a problem, or bail/die, in which case you have to fix the problem, start all over 
with the merging except you don't want to do the lines that you got through successfully 
on the first run, so you have to figure out where the error was... and so on; you get 
the idea.  So, with that in mind, I wrote two quick & dirty perl scripts that should do 
most of what you want. It could probably be done in a shell script, but it would be 
harder (which is how perl came about).

Script 1 looks for errors in the text file, as you described:

#!/usr/bin/perl -w
# script written for Keven Squire on the K12LTSP list
# make sure no lines have any spaces

$errors = 0;
while (<>) {
   chomp;
   if ($_ =~ / /) {                      # check for spaces
      print "$_ has a space in it\n";
      $errors++;
   }
   (@array) = split(',',$_);
   for ($i=1; $i <= $#array; $i++) {
     if (length($array[$i]) != 3) {      # check for numbers that aren't 3 digits
       print "Field $i on line $_ is not three digits\n";
       $errors++;
     }
   }
   # Uncomment the next line if you want to display a running tally of the errors
   # print "Errors is $errors\n";
}
($errors > 0) && print "There were $errors errors\n";

########### end script1 ################

Run this against your text file ('script1 textfile.txt') and it will tell you of any 
errors it finds and where.  You could put in a line counter to make locating the errors 
a bit easier to get to.  Fix the errors, run this script again, and repeat until you get 
no errors.  Then run script 2:

#!/usr/bin/perl -w

while (<>) {
   $sourcefiles = "";
   chomp;
   (@array) = split(',',$_);
   for ($i=1; $i <= $#array; $i++) {
     $sourcefiles = $sourcefiles." ".$array[0]."_.".$i."pdf";
   }
   print "The input string will be $sourcefiles\n";
   # system("/path/to/pdftk $sourcefiles cat output $array[0].pdf");
}

For each line in the text file, this will split up the fields, create the pdf file 
names, and put it all into one string for use with the pdftk command.  I have the last 
line commented out because you should do a dry run with this first to make sure the 
$sourcefiles string will be what you want.  I don't have pdftk so I couldn't really test 
it, but the print command on the penultimate line will show what will be passed to 
pdftk.  HTH

Petre

Kevin Squire wrote:
> First, I apologize for the OT nature of the post, but I am sure many of
> you will know / have done something like this.  Also, I really did not
> know where else to post the question.  If you know somewhere better,
> feel free to let me know. :-)
> 
> The Asst. Prin. has asked me do something very tedious (I did set myself
> up for it, but I could use the "brownie points"), and I need some help
> with the script that I am writing to make it less tedious.  I have done
> a fair bit of scripting, but nothing this advanced, so I need some help.
> 
> Some general info:  Each teacher right now has a single MS Word document
> with every one of his/her students progress reports.  (i.e. I have one
> file called squire_pr.doc that is 116 pgs for 56 students).  The AP
> whats them to be a single document (2 or 3 pgs) per student. He does not
> care if they stay in .doc format or not, as long as they still look the
> same.  
> 
> So I have taken my squire_pr.doc and printed it to PDF (squire_pr.pdf)
> so that I could use a program called pdftk (
> http://freshmeat.net/projects/pdftk/ ) and split it up into 116 single
> page documents (each one called squire_###.pdf). Then I can use the same
> program to join the appropriate pages back together again (so
> squire_001.pdf and squire_002.pdf becomes smithJ_04-05_mid.pdf).
> 
> I want to put together a script that will automate this stuff (to a
> certain point).  The teacher sends me two files, the 1 large pdf file
> and a text file with student name and the page numbers of the PDF file
> that make that student's report.  Usually it will be 2 pages, but
> sometimes it will be 3 or maybe even 4.  The text file would look
> something like:
> 
> smithJ,001,002
> mouseM,003,004
> gatesW,005,006,007
> 
> I have already done the basics on the script -- setting up variables,
> assigning directories, making sure the correct files exist already, etc.
>  But I don't know how to (1) get the script to read from the text file,
> (2) verify that the text file has now spaces and all numbers are in ###
> format (3) assign variable to each field in the text file (4) repeat for
> every line in the text file.
> 
> Some info/example/notes from my script:
> =============================================
>   $inputfile is the 1 big PDF file
>   $tempdir/$teachername_%03d.pdf part creates a bunch of single PDF's
>        with the name squire_001.pdf, squire_002.pdf, etc.
>   $parsefile is the text file with student names, and page numbers 
>        from the PDF that make up there report
> ==========
> pdftk $inputfile burst $tempdir/$teachername_%03d.pdf
> 
> # Now the hard part :-)
> # Need to read the $parsefile and verify that:
> #   there are no spaces and that all numbers are in ### format
> #   if not just give an error of $prasefile has error (adding a line
> #   number would be nice but not necessary
> # and then assign the following:
> #   $studentname from field 1
> #   $stupg1      from field 2
> #   $stupg2      from field 3
> #   $stupg3      from field 4 for those that have 4 fields
> #   $stupg4      from field 5 for those that have 5 fields
> 
> # then run the command 'pdftk INPUTFILES cat output COMBINEDFILE'
> # for every single line in the text file
> # where INPUTFILES would be $tempdir/$teachername_$stupgN.pdf where
> # N could be 1,2,3 or 4 depending on what was found in text file
> 
> NOTE -- sorry this got so long, I hope it all makes sense.  And Thank
> you in advance for you effort. 
> 
>