Re: nawk: out of space in tostring on ...



On Aug 3, 9:38 am, Ed Morton <mor...@xxxxxxxxxxxxxx> wrote:
Mensur wrote:
On Aug 3, 3:18 am, Juergen Kahrs <Juergen.KahrsDELETET...@xxxxxxxxx>
wrote:

Mensur wrote:

On Aug 2, 2:17 pm, Ed Morton <mor...@xxxxxxxxxxxxxx> wrote:

Mensur wrote:

I have a flat file with over 3 million rows. File is being processed
with nawk program that does some merging on the file and outputs the
result in a different file. After processing around 2.5 million rows,
nawk fails with:
nawk: out of space in tostring on 8273|8273_2|2000747680|35|35-44|
35-44|N|N|N|X|N|N|X|NJ||1630676556|Y|1|7243|||||||||||||||||||||
What can cause this error? Looking at the source code, the error is
caused by failed C malloc function to obtain memory. I am using awk
arrays to do the merging - I am not sure if that has anything to do
with it.

It is not surprising that you script fails when
processing large files. It accumulates all lines
in the memory of the AWK intepreter:

{

...

# Add currently read line to list of records that belong to same
client
++DATA_SIZE
DATA[DATA_SIZE] = CURRENT_LINE
}

Try to avoid storing all lines.
Ed's recommended using gawk. Using gawk when processing
large files is generally a good idea. gawk avoids memory
limitations where most other implementations _have_ limitations.

It is strange to me that it is happening because I do reset the
DATA_SIZE varibale to 0 after accumulating all records belonging to
one client. So in essence, I allocate maybe at most 50 array elements
(lines in the file) in memory and keep overwriting them. Therefore,
the memory should not grow more than it takes to store those 50 lines.
However, I looked at memory usage while the process is running, and it
keep growing constantly.

I would love to use gawk, but like I mentioned, our compnay does not
have it installed and I have no say so in that.

Can you come up with a small, complete script that reproduces the
problem and post that?

Ed.- Hide quoted text -

- Show quoted text -

Here is the whole script. Like I said, it is really long. Again, thank
you guys for helping.


BEGIN{
# Set initial values of variables
OFS=FS;
ERROR_CD = 0;
CIN_POS = -1;
MG_ID_POS = -1;
CG_ID_POS = -1;
AMS_CG_ID_POS = -1;
AMS_FIRST_NAME_POS = -1;
CURR_CIN_ID = -1;
PREV_CIN_ID= - 1;
DATA_SIZE = 0;
DATA[1] = "";
LOOKUP_CIN_ID = -1;
LOOKUP_MG_ID = -1;
LOOKUP_CG_ID = -1;

# Process header record to read in header record, find positions of
variables, output header record and read in next line
if (process_header_record() != 0)
{
log_message("Error inherited from process_header_record
function.")
ERROR_CD = 1;
exit 1
}
}
#######################################################################
# MAIN PROCESSING
#######################################################################
{
# IF error occured in BEGIN section, detect it here
if (ERROR_CD != 0)
{
log_message("Error occured in BEGIN section. Exiting ....")
exit 1
}

# Skip empty lines
if (trim($0) == "")
{
log_message("Found empty record: " FNR ". Skipping.")
next
}

# Log which record we are processing
if (FNR % 10000 == 3)
{
log_message("Processing record: " FNR)
}

# Get line that was read in
CURRENT_LINE = $0

# Get current CINERGY ID
CURR_CIN_ID = $CIN_POS

# If Cinergy IDs do not match, we have read all records that belong
to a single client - MERGE
if (PREV_CIN_ID != CURR_CIN_ID)
{
# IF either MG ID or CG ID are included on the file, we will use
lookup file to find
# MG ID with earliest establishment date and CG ID with hjighest
assets
if (MG_ID_POS != -1 || CG_ID_POS != -1)
{
# Merge using lookup file
if(dedup_merge() != 0)
{
log_message("Error inherited from dedup_merge function.")
ERROR_CD = 1
exit 1
}
}
else
{
# Merge by picking first valid value for each column
if(merge_first_valid_value() != 0)
{
log_message("Error inherited from merge_first_valid_value
function.")
ERROR_CD = 1
exit 1
}
}

# Reset Cinergy ID
PREV_CIN_ID = CURR_CIN_ID
}

# Add currently read line to list of records that belong to same
client
++DATA_SIZE
DATA[DATA_SIZE] = CURRENT_LINE
}
# END SECTION
END{
# Detect error from MAIN section
if (ERROR_CD != 0)
{
log_message("Error occured in MAIN section. Exiting...")
exit 1
}

# Merge last client
log_message("Merging last client...")

# IF either MG ID or CG ID are included on the file, we will use
lookup file to find
# MG ID with earliest establishment date and CG ID with hjighest
assets
if (MG_ID_POS != -1 || CG_ID_POS != -1)
{
if(dedup_merge() != 0)
{
log_message("Error inherited from dedup_merge function.")
exit 1
}
}
else
{
# Merge by picking first valid value for each column
if(merge_first_valid_value() != 0)
{
log_message("Error inherited from merge_first_valid_value
function.")
exit 1
}
}
}

###########################################################################################################
#
# Function name: process_header_record
#
# Purpose: Function reads in the header record, extracts
field positions from the header, outputs header
# record to an output file. Then it reads another
record, initilized DATA array, DATA_SIZE
# and CURR_CIN_ID and PREV_CIN_ID variables.
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: $0, NF, FNR are reset
#
###########################################################################################################
function process_header_record()
{
# Explicitly read in header record from campaign file
RETURN_CODE = getline
if (RETURN_CODE <= 0)
{
log_message("There are no more campaign records to read or error
reading campaign file!")
return 1
}

# Get position of Cinergy Client Id field
CIN_POS = get_field_pos("Cinergy Client Id")
if (CIN_POS == -1)
{
log_message("Error trying to get position of Cinergy Client Id")
return 1
}

# Get position of Marketing Group Identifier
MG_ID_POS = get_field_pos("Marketing Group Identifier")
if (MG_ID_POS == -1)
{
log_message("MG ID not present on the file.")
}

# Get position of Client Group ID
Marketing Group Identifier
CG_ID_POS = get_field_pos("Client Group ID")
if (CG_ID_POS == -1)
{
log_message("CG ID not present on the file.")
}

# Get position of AMS Client Group ID
AMS_CG_ID_POS = get_field_pos("AMS Client Group ID")
if (AMS_CG_ID_POS == -1)
{
log_message("AMS CG ID not present on the file.")
}

# Get position of AMS Crew First Name
AMS_FIRST_NAME_POS = get_field_pos("AMS Crew First Name")
if (AMS_FIRST_NAME_POS == -1)
{
log_message("AMS Manager First Name not present on the file.")
}

# Output header record
print $0 > OUT_FILE

# Read in first client record from campaign file
RETURN_CODE = getline
if (RETURN_CODE <= 0)
{
log_message("There are no more campaign records to read or error
reading campaign file!")
return 1
}

# Set variables
CURR_CIN_ID = $CIN_POS
PREV_CIN_ID = $CIN_POS
DATA_SIZE=1
DATA[1] = $0
}

###########################################################################################################
#
# Function name: get_field_pos
#
# Purpose: Function retruns the position of field FIELD_TEXT
from $0 or -1 if the field doesn't exist.
#
# Parameters: FIELD_TEXT
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: $0 must be set
#
###########################################################################################################
function get_field_pos(FIELD_TEXT)
{
# Loop for each field in $0 and find position of field with text
FIELD_TEXT
for (i = 1; i <= NF; ++i)
{
# If field exists, return position
if (trim($i) == FIELD_TEXT)
{
return i
}
}
return -1
}

###########################################################################################################
#
# Function name: find_first_non_empty_line
#
# Purpose: Function returns first non-empty line in DATA
array or empty string if all lines are
# empty.
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: DATA and DATA_SIZE must be set
#
###########################################################################################################
function find_first_non_empty_line()
{
# Find first non-empty line
for (i = 1; i <= DATA_SIZE; ++i)
{
if (DATA[i] != "")
{
return DATA[i]
}
}
return ""
}

###########################################################################################################
#
# Function name: merge_first_valid_value
#
# Purpose: Merge columns in DATA by taking the first value
that is not empty, UNKNOWN or Default.
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: DATA and DATA_SIZE must be set
#
###########################################################################################################
function merge_first_valid_value()
{
# If AMS is included, take a first valid AMS client
if (AMS_FIRST_NAME_POS != -1)
{
# Find first valid AMS
if (dedup_field_by_valid_value(AMS_FIRST_NAME_POS) != 0)
{
log_message("Error inherited from dedup_field_by_valid_value
function when deduping on AMS.")
return 1
}
}

# Read in and set first line
FIRST_LINE = find_first_non_empty_line();
if (FIRST_LINE == "")
{
log_message("Error reading line. All lines are empty!")
return 1
}
$0 = FIRST_LINE

# Get number of fields in the line
FIELD_COUNT = NF

# Look for each field in the line and replace any fields that have
empty, unknown or default value
# from remaining records belonging to the same client
for (i = 1; i <= FIELD_COUNT; ++i)
{
# Need to always set the first line as we are changing $0 down
in the code
$0 = FIRST_LINE

# If field is empty, UNKNOWN or default, find a replacement
if (trim($i) == "" || tolower($i) == "unknown" || tolower($i) ==
"default")
{
# Set field to empty value in case no replacement is found
$i = ""
FIRST_LINE = $0

# Find a replacement from remaining records
for (j = 1; j <= DATA_SIZE; ++j)
{
# If replacement is empty, keep going
if (DATA[j] == "")
{
continue
}
$0 = DATA[j]

# Find a replacement
if (trim($i) != "" && tolower($i) != "unknown" &&
tolower($i) != "default")
{
# Perfrom replace
REPLACEMENT_FIELD=$i
$0 = FIRST_LINE
$i = REPLACEMENT_FIELD
FIRST_LINE = $0
break;
}
}
}
}

# Once the replacements are made, output record
output(FIRST_LINE)

# Reset data size so it can be used again
DATA_SIZE=0
}


###########################################################################################################
#
# Function name: dedup_merge
#
# Purpose: Merges client records using lookup file
#
# Parameters: None
#
# Return: Function returns 0 on success and 1 on failure
#
# Side effects: DATA and DATA_SIZE must be set
#
###########################################################################################################
function dedup_merge()
{
# Get next lookup values
LAST_PROCESSED_LOOKUP_ID = LOOKUP_CIN_ID
if (get_next_lookup_values() != 0)
{
log_message("Error inherited from get_next_lookup_values
function.")
log_message("Last processed LOOKUP_CINERGY_ID is: "
LAST_PROCESSED_LOOKUP_ID)
log_data_records()
return 1
}

# If Cinergy IDs do not match, we are out of sync
if (PREV_CIN_ID != LOOKUP_CIN_ID)
{
log_message("Cinergy ID from campaign file: " PREV_CIN_ID "
does not match lookup file Cinergy ID: " LOOKUP_CIN_ID)
log_message("Last processed LOOKUP_CINERGY_ID is: "
LAST_PROCESSED_LOOKUP_ID)
log_message("This ERROR can occur for 4 different reasons:")
log_message("1. Lookup file returned by the stored procedure
contains duplicate Cinergy Ids which in turn means that campaign
contains recipients with same Cinergy Ids (Campaign may contain Leads,
Purged or Expired clients). THIS IS SCENARIO IS VERY LIKELY. In this
case modify the campaign to exclude those clients.")
log_message("2. Output format of the file contains a date
(client birth date or other date) that points to the date dimesnion
but the date value is past 12/31/2010. THIS IS SCENARIO IS VERY
LIKELY. ")
log_message("3. Lookup file returned by the stored procedure
contains Cinergy Id that is not present in the export file.")
log_message("4. Export file contains Cinergy Id that is not
present in the Lookup file returned by the stored procedure.")


log_data_records()
return 1
}

# Dedup on MG
if (LOOKUP_MG_ID != "" && MG_ID_POS != -1)
{
if (dedup_field_by_value(MG_ID_POS, LOOKUP_MG_ID, "MG ID") != 0)
{
log_message("Error inherited from dedup_field_by_value
function when deduping on MG.")
log_data_records()
return 1
}
}

# Dedup on CG
if (LOOKUP_CG_ID != "" && CG_ID_POS != -1)
{
if (dedup_field_by_value(CG_ID_POS, LOOKUP_CG_ID, "CG ID") != 0)
{
log_message("Error inherited from dedup_field_by_value
function when deduping on CG.")
log_data_records()
return 1
}
}

# Merge other records by picking first valid value
if (merge_first_valid_value() != 0)
{
log_message("Error inherited from merge_first_valid_value.")
return 1
}

return 0
}

###########################################################################################################
#
# Function name: trim
#
# Purpose: Trims string
#
# Parameters: None
#
# Return: Trimmed string
#
# Side effects: None
#
###########################################################################################################
function trim(string)
{
sub(/^[ \t]+/, "", string); sub(/[ \t]+$/, "", string) #trim left
and right
return string
}


###########################################################################################################
#
# Function name: log_message
#
# Purpose: Logs message to a log file
#
# Parameters: None
#
# Return: None
#
# Side effects: None
#
###########################################################################################################
function log_message(string)
{
"date +%m/%d/%Y_%H:%M:%S" | getline dt
close("date +%m/%d/%Y_%H:%M:%S")
print dt " " string >> LOG_FILE
}


###########################################################################################################
#
# Function name: output
#
# Purpose: outputs string to output file
#
# Parameters: None
#
# Return: None
#
# Side effects: None
#
###########################################################################################################
function output(string)
{
print string >> OUT_FILE
}


###########################################################################################################
#
# Function name: get_next_lookup_values
#
# Purpose: retrieved Cinergy ID, MG ID and CG ID from lookup
file
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: Populates LOOKUP_CIN_ID, LOOKUP_MG_ID,
LOOKUP_CG_ID variables
#
###########################################################################################################
function get_next_lookup_values()
{
# Read line from lookup file
RETURN_CODE = getline LINE < LOOKUP_FILE
if (RETURN_CODE <= 0)
{
log_message("There are no more lookup records to read or error
reading lookup file!")
return 1
}

# get number of fields from read line and split it into fileds
FIELD_COUNT = split(LINE, F, FS)
if (FIELD_COUNT != 3)
{
# Invalid format of lookup file
log_message("Record on lookup file: " LINE " does not have 3
fields!!!")
return 1
}

# Assign values
LOOKUP_CIN_ID = F[1]
LOOKUP_MG_ID = F[2]
LOOKUP_CG_ID = F[3]

return 0
}


###########################################################################################################
#
# Function name: dedup_field_by_value
#
# Purpose: Removes array elements (DATA) that at field
position FIELD_POS do not match FIELD_VALUE.
# After, the removal is done, at least one array
element must remain.
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: Modifies DATA array, modifies $0
#
###########################################################################################################
function dedup_field_by_value(FIELD_POS, FIELD_VALUE, FIELD_NAME)
{
# Go through each array element
VALUE_FOUND = "false"
COUNTER = 0
for (i = 1; i <= DATA_SIZE; ++i)
{
# Get array element
LINE = DATA[i]
$0 = LINE

# If field is not empty and doesn't match FIELD_VALUE, remove it
(set it to empty string)
if (trim($FIELD_POS) != "" && $FIELD_POS != FIELD_VALUE)
{
DATA[i] = ""
++COUNTER
}
else if ($FIELD_POS == FIELD_VALUE)
{
VALUE_FOUND = "true"
}
}

# if all records were removed - error condition
if (VALUE_FOUND != "true")
{
log_message("All records were removed while searching for field
value: " FIELD_VALUE " of field: " FIELD_NAME " at position: "
FIELD_POS)
return 1
}

return 0
}


###########################################################################################################
#
# Function name: log_data_records
#
# Purpose: outputs records that currently being processed to
log file.
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: None
#
###########################################################################################################
function log_data_records()
{
log_message("Currently processing following client records:")
for (i = 1; i <= DATA_SIZE; ++i)
{
log_message(DATA[i])
}
}


###########################################################################################################
#
# Function name: dedup_field_by_valid_value
#
# Purpose: Removes array elements (DATA) that at field
position FIELD_POS are empty, UNKNOWN or
# Default. After valid field is found, remove
remaining elements.
#
# Parameters: None
#
# Return: 0 if success, 1 if failure
#
# Side effects: Modifies DATA array, modifies $0
#
###########################################################################################################
function dedup_field_by_valid_value(FIELD_POS)
{
# loop for each array element
FOUND="false"
for (i = 1; i <= DATA_SIZE; ++i)
{
# Get array element
LINE = DATA[i]
$0 = LINE

# If array element is not empty, no valid field value is found
so far, and this field is valid - mark field as found
if (DATA[i] != "" && FOUND != "true" && trim($FIELD_POS) != ""
&& tolower($FIELD_POS) != "unknown" && tolower($FIELD_POS) !=
"default")
{
FOUND = "true"
continue
}

# If array element is not empty, and we already found a valid
value, remove array element
if (DATA[i] != "" && FOUND == "true" && trim($FIELD_POS) != "")
{

DATA[i] = ""
}
}

return 0
}








.



Relevant Pages

  • Re: nawk: out of space in tostring on ...
    ... with nawk program that does some merging on the file and outputs the ... who have a unique id but can have multiple rows per client (think ... every client in list file there is an entry in the lookup file. ... ACCOUNT TYPE #1 | Account balance for ACCOUNT TYPE #1 ...
    (comp.lang.awk)
  • Re: Exception.StackTrace how can i serialize it to client app
    ... the string returned by StackTrace on client site is empty. ... The workaround I have done for my exceptions is serialization StackTrace ... On server site in GetObjectDataI am serializing it. ...
    (microsoft.public.dotnet.framework.remoting)
  • Re: AND, IF, and SUM IF Functions
    ... also note that if C2 is empty and any entry in A is empty than that ... and then for sheet2 to automatically total the hours according to the ... client and under the right month in which the services were performed. ...
    (microsoft.public.excel.worksheet.functions)
  • FMP8.5 Hiding an object
    ... There is one section of the layout that the client would like to be ... It doesn't need to move up, just create an "empty" space. ... the object doesn't disappear completely [it's fields and boxes and the boxes ...
    (comp.databases.filemaker)
  • Re: Exception.StackTrace how can i serialize it to client app
    ... The server's stack trace should come along automatically from the server to ... When you say the stack trace is empty, ... the stack trace from the client side (including all the proxy methods and ...
    (microsoft.public.dotnet.framework.remoting)

Loading