Add-Innovation Home

About Add-Innovation

Contact Us

  SAS Institute logo.    Reduce fields read on Merged datasets.


When input files need to be merged and they are already sorted by a common key, a great deal of space can be saved by merging them using a data view. The downside of merging data held in different files is that subsetting cannot be performed against combinations of data held in different files until those files are merged. Usually this would not cause an undue amount of extra processing, but if a high number of additional fields need to be read from each file, it would be advantageous to be able to merge within the same step.
For example, the Type 63 - VRUF2 IOLAIRE file consists of a single line per account, each having up to 100 products held in a repeating portion of the account. If these accounts needed to be subset to only include live accounts, then it would need to be merged with another file to add on the live / final indicator. These files could be merged efficiently (in terms of space) by using a data view, but all of the repeating sections of the product section would need to be read regardless of whether the account was live or final.
A more desirable way of merging the files would be to read the two files in a single step using two infile statements, as in the following example ;

Data Accts(Drop=T63Acc Active Offset LivFinal);
   Retain T63Acc;
   Infile GenBase;
   Input @28 LivFinal $1.@;
   If LivFinal=’L’;
   Input @1 DistAcc $10.@;
   Infile Type63 End=EOF63;
   Do Until(T63Acc>=DistAcc or EOF63);
      Input / @3 T63Acc $10.@@;
   End;
   If T63Acc=DistAcc;
   Input @71 Active pd2.@;
   Do OffSet=77 To 77+(Active-1)*24 By 24;
      Input @(Offset+10) ProdCode $6.@;
      Output;
   End;

In the above example, the generic base file is searched for the first live account. Once found the Type 63 file is searched within the Do Until ... End loop until the same or next account is read. The rest of the record is only read if the district and account codes match. The double ampersand on the input statement for T63Acc is required to hold the input line pointer at the current place in the Type 63 file ready for the next loop. If that district / account does not match the account returned from the generic base file, then it may match the next record, so the value needs to be retained and the line pointer position held. Since the double ampersand is used, the ‘/’ (read next line) operator also needs to be used.

A more sophisticated example is shown in the example below.  In this example, accounts were read from ten generations of the Call Guard detail files. For an account to be included in the output file, it had to be live in all ten generations, and either the account number needed to be defined in the $KEEPAC. format or the outer part of the postcode needed to be included in the $OUTERP format. Once matched, the total minutes from the type 10 segment (a repeating segment) needed to be calculated and the result held in an individual array element per generation (i.e. per month).
The same principle applied above can be used, although the code is a little more complicated since ten input files are read in the same step. However, since the same principle is applied to each input file each of which has the same structure (almost !), a macro can be used to reduce the code listed. Unfortunately some additional care needs to be taken since the last two generations have a slightly different record layout. The code for this step is as follows ;

%Macro Grabit;                                                         
Data Storage.CallG(Drop=NumT10Sg Offset LivFinal ThisAc1-ThisAc9 Mins);
   Array totmin(10) 8;                                                 
   Retain ThisAc1-ThisAc9 '          ';                                
   Infile CallG10;                                                     
   Input @36 LivFinal $1.@;                                            
   If LivFinal='L';                                                    
   Input @1  DistAcc $10.                                              
         @19 PostCode $4.@;                                            
   If Put(DistAcc,$KeepAc.) or Put(PostCode,$OuterP.);                 
   Input @37 NumT10Sg pd3.@;                                           
   Do Offset=57 To 57+(NumT10Sg-1)*44 By 44;                           
      Input @Offset Mins pd7.2 @;                                      
      totmin(10)=Sum(totmin(10),Mins);                                 
   End;                                                                
%Do MonRef=9 %To 1 %By -1;                                             
   Infile CallG&MonRef;                                                
   Do Until(ThisAc&MonRef ge DistAcc);                                 
      Input / @1  ThisAc&MonRef $10.@@;                                
   End;                                                                
   If ThisAc&MonRef=DistAcc;                                           
   %if &MonRef<3 %Then Input @35 LivFinal $1.@@;;                      
   %if &MonRef>2 %Then Input @36 LivFinal $1.@@;;                      
   If LivFinal='L';                                                    
   %if &MonRef<3 %Then Input @36 NumT10Sg pd3.@@;;                     
   %if &MonRef>2 %Then Input @37 NumT10Sg pd3.@@;;                     
   %if &MonRef<3 %Then Do Offset=56 To 56+(NumT10Sg-1)*44 By 44;;      
   %if &MonRef>2 %Then Do Offset=57 To 57+(NumT10Sg-1)*44 By 44;;      
      Input @Offset Mins pd7.2 @@;                                     
      totmin(&MonRef)=Sum(totmin(&MonRef),Mins);                       
   End;                                                                
%End;                                                                  
Run;                                                                   
%Mend;                                                                 
%GrabIt;

In this example a different value of district and account needs to be maintained for each input file, hence the variable names THISAC1 - THISAC9 defined by macro statements. The district and account record is only output at the end of the data step, so if it is not found or is not live, at any point in the 10 GDG generations, then it will be deleted.