Embedding AWK in a C Program

Embedding AWK in a C Program #

This article introduces an embedded AWK interpreter that can be called from a C/C++ program. AWK language

Introduction #

AWK is a language of considerable influence and prestigious lineage. For the shortest (and funniest) of introductions check the above comic of fellow Montrealer Julia Evans. AWK is an easy to use scripting language created by Alfred Aho (of Dragon Book fame), Peter Weinberger, and Brian Kernighan (of K&R fame). With such illustrious parents it’s not surprising that AWK quickly became one of the most popular scripting languages. It also promoted innovative concepts, like associative arrays, way before they became mainstream.

AWK is at its best when you have some data that you have to “massage” to fit into a database. Of course you could run the data through AWK before passing it to the database engine but if you want to use a small embedded database engine like SQLITE and you want to have everything in the same program, would be nice to have the scripting language embedded in your C/C++ program. I couldn’t find an AWK interpreter that could be embedded in C so I decided to make my own. This article describes the resulting code. A passing familiarity with AWK helps in following along but it’s not required. If you want to brush-off your AWK skills you can check this tutorial.

Sample Usage #

Let’s start with a short example as an “executive summary” of what you can do with the code.

The popular Unix wc program counts the lines, words and bytes in a file. Here is a stripped-down implementation using my AWK library:

#include <awklib.h>

int main (int argc, char **argv)
{
  AWKINTERP* interp = awk_init (NULL);			//initialize an interpreter object
  awk_setprog (interp,							//set the AWK program
    "{wc += NF; bc += length($0)}\n"
    "END {print NR, wc, bc, ARGV[1]}");
  awk_compile (interp);							//compile the program
  if (argc &gt; 1)
    awk_addarg (interp, argv[1]);				//add our argument as an interpreter argument
  awk_exec (interp);							//execute AWK code
  awk_end (interp);								//cleanup everything
}

The program outlines the basic steps to have to follow to use the AWK library:

  • First you must create an interpreter object. All the other API functions take an interpreter as the first argument.
  • You pass a program to the interpreter. As we will see there are more ways of passing programs to the interpreter, but this is the simplest one.
  • The program must be “compiled”. The interpreter doesn’t really compile it but it produces a syntax tree that will be used in the execution phase.
  • You can add input files to be processed by the AWK program. In this case we add argv[1], the first argument that was received by the C program.
  • The AWK program gets executed. By default, the output is sent to stdout.
  • In the end, the interpreter object is deleted.

Design Decisions #

The first question I had to answer was what version of AWK code should I use. There are many variants of awk: gawk, mawk, etc. In the end I decided to use Dr. Kernighan’s original onetrueawk. The code can be found at: https://github.com/onetrueawk/awk and it gives the sensation of being in a computer science museum. Just an example: one of the test files seems to be the etc/passwd file of a Unix system. It goes like this:

/dev/rrp3:

17379	mel
16693	bwk	me
16116	ken	him	someone else
...
8895	dmr

Assuming bwk stands for Brian W. Kernighan, I’ll let you figure who ken and dmr might be 😊. For added antique flavor, /dev/rrp3: indicates the file was residing on a DEC RP03, RP04 or RP06. The RP06 were huge disks for their time with 128MB of data. Yey!

Given the historical quality of the code, I would have liked to keep it as a pure C project. Unfortunately, this was not possible mainly due to error handling issues. For a stand-alone program it is perfectly acceptable to bail out in case of error; an embedded interpreter doesn’t have this option. My solution was to wrap large portions of code in try...catch blocks. This idea of keeping it as a C product, at least on the outside, is the reason why I didn’t organize the API as a C++ object.

One part that had to be completely rewritten was the function call mechanism.

API Description #

The API is not very large; I’ve tried to keep it to a minimum and the plan is to add new functions only if really needed. It is structured around an opaque AWKINTERP object representing the AWK language interpreter. Its life cycle follows a series of irreversible state transitions: initialization, program compilation, program execution and destruction.

Access to interpreter’s variables is done through an awksymb structure:

struct awksymb {	
   const char *name;        //variable name
   const char *index;       //array index
   unsigned int flags;      //variable type flags
   double fval;             //numerical value
   char *sval;              //string value
};

The same structure is used to pass parameters to an AWK callable C function (see awk_setfunc API call). In AWK, variables don’t have a defined type or, better said, they are all strings and sometimes get converted to numbers if they are need for a numerical operation. In the awksymb structure, the flags member indicates if the symbol is a string, in which case sval member is valid, or a number with the value in fval.

Arrays are also special in AWK. As I said before, all arrays are associative and are “indexed” by a character string. If the flag AWK_ARRAY is set in the flags member, the variable is an array and the index member represents the array index.

Following is a brief description of each API function.

awk_init #

AWKINTERP* awk_init (const char **vars);

The function initializes a new AWK interpreter object. It takes an array of variable definitions with the same format as the -v command-line arguments of stand-alone AWK interpreter. The array is terminated with a NULL string.

awk_setprog #

int awk_setprog (AWKINTERP* pi, const char *prog);

Set the program text for an interpreter. This function can be called only once for an interpreter.

awk_addprogfile #

int awk_addprogfile (AWKINTERP* pi, const char *progfile);

Adds the content of a file as AWK program. The functionality is equivalent with the -f switch on the command line of stand-alone interpreter. Just like the -f switch, this function can be called repeatedly to add multiple programs.

awk_compile #

int awk_compile (AWKINTERP* pi);

Compiles the AWK language program(s) that have been specified using awk_setprog or awk_addprogfile functions.

awk_addarg #

int awk_addarg (AWKINTERP* pi, const char *arg);

Add a new argument to the interpreter. The argument can be an input file name or a variable definition, if it has the syntax var=value. Arguments can be added at any time before starting execution of the AWK program.

Example:

AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print pass+1 \"-\" NR, $0}");
awk_compile (pi);
awk_addarg (pi, "infile.txt");
awk_addarg (pi, "pass=1");
awk_addarg (pi, "infile.txt");

The output is (assuming infile.txt has 25 lines):

1 - 25
2 - 25

awk_exec #

int awk_exec (AWKINTERP* pi);

Execute a compiled program. The function returns the value returned by exit statement or a negative error code if something went wrong. If a program terminates without an exit statement, the returned value is 0. Small negative values should be considered reserved for error conditions.

Example:

AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print NR, $0}");
awk_compile (pi);
awk_addarg (pi, "infile.txt");
awk_exec (pi);

awk_run #

int awk_run (AWKINTERP* pi, const char *progfile);

This function combines in one call the calls to awk_setprog, awk_compile and awk_exec functions.

If a program terminates without an exit statement, the returned value is 0. Otherwise the function returns the value specified in the exit statement. Small negative values should be considered reserved for error conditions. If the program requires any arguments, they can be added using awk_addarg function before calling awk_run.

Example:

AWKINTERP *pi = awk_init (NULL);
awk_addarg (pi, "infile.txt");
awk_run (pi, "{print NR, $0}");

awk_end #

void awk_end (AWKINTERP* pi);

Releases all memory allocated by the interpreter object.

awk_setinput #

int awk_setinput (AWKINTERP* pi, const char *fname);

Forces interpreter to read input from a file. By default, an interpreter reads from stdin. This function redirects the input to another file.

awk_infunc #

void awk_infunc (AWKINTERP* pi, inproc fn);

Change the input function with a user-defined function. The default input function is getc or fgetc. The inproc function has the same signature as getc:

typedef int (*inproc)();

and it returns the next character or EOF if there are no more characters.

Here is an example of how to use AWK to process some in-memory data:

std::istrstream instr{
  "Record 1\n"
  "Record 2\n"
};

AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print NR, $0}");
awk_compile (pi);
awk_infunc (pi, []()->;int {return instr.get (); });

awk_setoutput #

int awk_setoutput (AWKINTERP* pi, const char *fname);

Redirect interpreter output to a file. By default, the interpreter output goes to stdout. Using this function, you can redirect it to a different file.

Example:

AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "BEGIN {print \"Output redirected\"}");
awk_compile (pi);
awk_setoutput (pi, "results.txt");
awk_exec (pi);

awk_outfunc #

void awk_outfunc (AWKINTERP* pi, outproc fn);

Change the output function with a user-defined function. The outproc function signature is:

typedef int (*outproc)(const char *buf, size_t len);

Example:

std::ostringstream out;
int strout (const char *buf, size_t sz)
{
  out.write (buf, sz);
  return out.bad ()? - 1 : 1;
}
//...
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "BEGIN {print \"Output redirected\"}");
awk_compile (pi);
awk_outfunc (pi, strout);

awk_getvar #

int awk_getvar (AWKINTERP *pi, awksymb* var);

Retrieves the value of an AWK variable. The function returns 1 if successful or a negative error code otherwise.

If the variable is an array and the index member is NULL, the function returns AWK_ERR_ARRAY error code.

For string variables, the AWKSYMB_STR flag is set and the function allocates the memory needed for the string by calling malloc. The user has to release the memory by calling free.

Example:

AWKINTERP *pi = awk_init (NULL);
awksymb var{ "NR" };

awk_setprog (pi, "{print NR, $0}\n");
awk_compile (pi);
awk_getvar (pi, &amp;var);

awk_setvar #

int awk_setvar (AWKINTERP *pi, awksymb* var);

Changes the value of an AWK variable. The function takes a pointer to an awksymb structure with information about the variable. The user must set the flags member of the awksymb structure to indicate which values are valid (string or numerical). In addition, for array members, the user must specify the index and set the `AWKSYMB_ARR flag.

If the variable does not exist, it is created.

Example:

AWKINTERP *pi = awk_init (NULL);
awksymb v{ "myvar", NULL, AWKSYMB_NUM, 25.0 };
awk_setprog (pi, "{myvar++; print myvar}\n");
awk_compile (interp);

awk_compile (pi);
awk_setvar (pi, &amp;v);
awk_exec (pi);  //output is "26"

awk_addfunc #

Adds a user defined function to the interpreter.

int awk_addfunc (AWKINTERP *pi, const char *name, awkfunc fn, int nargs);

Parameters: pi - pointer to an interpreter object
name - function name
fn - pointer to function. See [awkfunc] for prototype.
nargs - number of function arguments

The function returns 1 if successful or a negative error code otherwise.

External user-defined functions can be called from AWK code just like any AWK user-defined function. The nargs parameter specifies the expected number of parameters but, like with any AWK function, the number of actual arguments can be different. The interpreter will provide null values for any missing parameters.

The function can return a value by setting it into the ret variable and setting the appropriate flags. String values must be allocated using malloc.

It should be called only after the AWK program has been compiled.

Example:

void fact (AWKINTERP *pi, awksymb* ret, int nargs, awksymb* args)
{
  int prod = 1;
  for (int i = 2; i &lt;= args[0].fval; i++)
    prod *= i;
  ret->fval = prod;
  ret->flags = AWKSYMB_NUM;
}
//...
awk_setprog (pi, " BEGIN {n = factorial(3); print n}");
awk_compile (pi);
awk_addfunc (pi, "factorial", fact, 1);
awk_exec (pi);

Final thoughts #

The source code has been compiled with VisualStudio 2020. There is also a small makefile for gcc. The syntax analyzer uses YACC so you will need a YACC compiler if you want to do a full rebuild. I have included however the files generate by YACC (ytab.cpp and ytab.h) so you can build it even if you don’t have a YACC compiler.

This concludes the presentation of my embedded AWK interpreter. It can be easily incorporated into a C/C++ program and has a good communication with host program. The host can access any interpreter variable and the interpreter can call external functions defined by host program. Size-wise, the interpreter is very small. You can expect an overhead of about 100 KB which is a decent number when compared with other interpreters (Lua takes about twice as much).

I will continue to improve the embedded AWK interpreter. If you want to contribute to this project or just get the latest version you can find it at: https://github.com/neacsum/awk.

History #

6-Apr-2020 Initial version