Perl tutorial

Perl is a quick, easy scripting language that can be used to perform otherwise complicated tasks in very simple ways.  Some of its main strengths are file manipulation (i.e. reading files, manipulating columns, etc.) and searching for patterns within a file.  It's much better at math than C-shell scripting.


Creating and Running a Perl Script

To use Perl, open a text editor and at the top type:
#!/usr/bin/perl -w
The #!/usr/bin/perl tells Perl where the Perl interpreter is located (so the code can be run as Perl code), and the -w is an optional flag that tells Perl to give you warnings and error messages.

Once you have your program written in the file, save it as something like myprogram.pl and on the Unix command line, type chmod +x myprogram.pl (chmod means "change mode" and the +x says make it executable).  To run the program, then, just type myprogram.pl.

Some of the key things to learn:
- variables: $scalar, @array, $array[$i]  (<-- this last one is an element of an array, which is itself a scalar, so it gets a $ out front.)
- reading from the command line (whether arguments are entered while running the program or prompted for later)
- if/else syntax - pretty standard
- filehandles, reading in files line by line with a while statement, and the default variable $_ where the lines are stored
- pattern matching and grabbing parts of the patterns (i.e. a particular column, or something that starts with 'astr' or whatever) into variables
- search and replace - basically just pattern matching with an extra piece that tells it to replace what it's found with something else
- subroutines - ask me about later if you like
- system command - interface with Unix anytime you want!




The Ultimate Example!  Contains (just about) all of the above!

#!/usr/bin/perl -w
# readwrite.pl
#
# Description: This program reads in $numfiles number of files called in1.txt,
# in2.txt, etc., line by line and writes them out again into files out1.txt,
# out2.txt, etc. in new directories called dir1, dir2, etc.  It switches two of
# the columns of the original file when writing out the new file.
#
# Instructions: The files to be read in should be called in#.txt, where # is a
# number beginning at 1, and the new files will be called out#.txt.
#
# cmcgleam, 9/28/05


# Run the subroutine main - which, here, is really just the entire program.  
&main;

sub main
{

    # Declare variables.  "My" makes them specific to the enclosing block.
    my($numlines, $numfiles);

  
# The user can enter the number of files on the command line when they run
    # the program, or else they will be prompted to enter it.

    # $#ARGV is the index of the last element in the array @ARGV that's
    # automatically created when the user enters arguments when running the
    # program.  The array begins at 0, so $#ARGV + 1 is the number of
    # arguments.  For example, here, the user would type "readwrite.pl 3" if
    # there are three files to read in and write out.  

    if ($#ARGV +1 != 1)
      {
         print "Enter the number of files you wish to read in: \n";
         $numfiles = <STDIN>;
      }
    else
      {
          $numfiles = $ARGV[0];
      }

   
# Read in the files line by line, snatch the line entries into variables,
     # and print them out into new files, switching around columns 2 and 3.
    for ($i=1; $i<=$numfiles; $i++)
     {
       
# We'll count the number of lines in the file.  Initialize this to 0.
        $numlines = 0;

       
# Tell the program where your files are, and set what's called a
          # filehandle that you can use for referencing the files in the future.
          # These can be called anything; here I call them FILE and OUTFILE.
          # Note the use of Unix commands for file input and output, and also
          # the system command, which can be used in general whenever you
          # want to execute a Unix command.

        open(FILE, "more in$i.txt |");
        system("mkdir dir$i");
        open(OUTFILE, "> dir$i/out$i.txt");

     
   # The while <(FILE)> statement tells Perl to read the file line by
         # line, "while" it exists.  It does the stuff in { } to each line
         # before moving on to the next.
        while (<FILE>)
         {
    
           # PATTERN MATCHING
                # =~ means find the specified pattern in $_, the default variable
                # which currently refers to the line you're on in the file.
                # / is put at the beginning and end of your complete search pattern
                # ^ means at the beginning of the line
                # \ separates different things you're looking for
                # \s matches whitespace (spaces, tabs)
                # \S matches non-whitespace
                # * means zero or more times
                # + means one or more times
                # . matches anything (at the end, I have .* which means any
                #   character, zero or more times)
                # parentheses ( ) put the thing that's matched into a variable.
                #   these automatically number themselves, $1, $2, $3... etc.

                # Find at least 4 columns and save the first 4 into $1-$4.
             $_ =~ /^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\.*/;
          
    # Print to the outfile, switching two columns.
             print OUTFILE "$1 $3 $2 $4 \n";

             $numlines ++;
        
          }

  
      # This will print to the command line.  You can insert variable names
        # into a print statement and Perl will replace them with the actual
        # variable values.  The \n means put a "newline" after the sentence.
       print "There are $numlines lines in file in$i.txt. \n";

  
      # Close the input and output files.
       close(FILE);
       close(OUTFILE);
    }

}
 




The same code, without all the comments:

#!/usr/bin/perl -w
# readwrite2.pl
#
# Description: This program reads in $numfiles number of files called in1.txt,
# in2.txt, etc., line by line and writes them out again into files out1.txt,
# out2.txt, etc. in new directories called dir1, dir2, etc.  It switches two of
# the columns of the original file when writing out the new file.
#
#
# Instructions: The files to be read in should be called in#.txt, where # is a
# number beginning at 1, and the new files will be called out#.txt.
#
# cmcgleam, 9/28/05


&main;

sub main
{
    my($numlines, $numfiles);   
   
    if ($#ARGV +1 != 1)
     {
        print "Enter the number of files you wish to read in: \n";
        $numfiles = <STDIN>;
     }
    else
     {
        $numfiles = $ARGV[0];
     }


    for ($i=1; $i<=$numfiles; $i++)
     {   
        $numlines = 0;

        open(FILE, "more in$i.txt |");
        system("mkdir dir$i");
        open(OUTFILE, "> dir$i/out$i.txt");

        while (<FILE>)
        {      
            $_ =~ /^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\.*/;       
            print OUTFILE "$1 $3 $2 $4 \n";

            $numlines ++;
        }

        print "There are $numlines lines in file in$i.txt. \n";
        close(FILE);
        close(OUTFILE);
     }

}