Embedding AWK in a C Program #
This article introduces an embedded AWK interpreter that can be called from a C/C++ program.
Introduction #
AWK is a language of considerable influence and prestigious lineage. For the shortest (and funniest) of introductions check the above comic of fellow Montrealer Julia Evans. AWK is an easy to use scripting language created by Alfred Aho (of Dragon Book fame), Peter Weinberger, and Brian Kernighan (of K&R fame). With such illustrious parents it’s not surprising that AWK quickly became one of the most popular scripting languages. It also promoted innovative concepts, like associative arrays, way before they became mainstream.
AWK is at its best when you have some data that you have to “massage” to fit into a database. Of course you could run the data through AWK before passing it to the database engine but if you want to use a small embedded database engine like SQLITE and you want to have everything in the same program, would be nice to have the scripting language embedded in your C/C++ program. I couldn’t find an AWK interpreter that could be embedded in C so I decided to make my own. This article describes the resulting code. A passing familiarity with AWK helps in following along but it’s not required. If you want to brush-off your AWK skills you can check this tutorial.
Sample Usage #
Let’s start with a short example as an “executive summary” of what you can do with the code.
The popular Unix wc program counts the lines, words and bytes in a file. Here is a stripped-down implementation using my AWK library:
#include <awklib.h>
int main (int argc, char **argv)
{
AWKINTERP* interp = awk_init (NULL); //initialize an interpreter object
awk_setprog (interp, //set the AWK program
"{wc += NF; bc += length($0)}\n"
"END {print NR, wc, bc, ARGV[1]}");
awk_compile (interp); //compile the program
if (argc > 1)
awk_addarg (interp, argv[1]); //add our argument as an interpreter argument
awk_exec (interp); //execute AWK code
awk_end (interp); //cleanup everything
}
The program outlines the basic steps to have to follow to use the AWK library:
- First you must create an interpreter object. All the other API functions take an interpreter as the first argument.
- You pass a program to the interpreter. As we will see there are more ways of passing programs to the interpreter, but this is the simplest one.
- The program must be “compiled”. The interpreter doesn’t really compile it but it produces a syntax tree that will be used in the execution phase.
- You can add input files to be processed by the AWK program. In this case we add
argv[1]
, the first argument that was received by the C program. - The AWK program gets executed. By default, the output is sent to stdout.
- In the end, the interpreter object is deleted.
Design Decisions #
The first question I had to answer was what version of AWK code should I use. There are many variants of awk: gawk, mawk, etc. In the end I decided to use Dr. Kernighan’s original onetrueawk. The code can be found at: https://github.com/onetrueawk/awk and it gives the sensation of being in a computer science museum. Just an example: one of the test files seems to be the etc/passwd
file of a Unix system. It goes like this:
/dev/rrp3:
17379 mel
16693 bwk me
16116 ken him someone else
...
8895 dmr
Assuming bwk stands for Brian W. Kernighan, I’ll let you figure who ken and dmr might be 😊. For added antique flavor, /dev/rrp3: indicates the file was residing on a DEC RP03, RP04 or RP06. The RP06 were huge disks for their time with 128MB of data. Yey!
Given the historical quality of the code, I would have liked to keep it as a pure C project. Unfortunately, this was not possible mainly due to error handling issues. For a stand-alone program it is perfectly acceptable to bail out in case of error; an embedded interpreter doesn’t have this option. My solution was to wrap large portions of code in try...catch
blocks. This idea of keeping it as a C product, at least on the outside, is the reason why I didn’t organize the API as a C++ object.
One part that had to be completely rewritten was the function call mechanism.
API Description #
The API is not very large; I’ve tried to keep it to a minimum and the plan is to add new functions only if really needed. It is structured around an opaque AWKINTERP
object representing the AWK language interpreter. Its life cycle follows a series of irreversible state transitions: initialization, program compilation, program execution and destruction.
Access to interpreter’s variables is done through an awksymb
structure:
struct awksymb {
const char *name; //variable name
const char *index; //array index
unsigned int flags; //variable type flags
double fval; //numerical value
char *sval; //string value
};
The same structure is used to pass parameters to an AWK callable C function (see awk_setfunc
API call).
In AWK, variables don’t have a defined type or, better said, they are all strings and sometimes get converted to numbers if they are need for a numerical operation. In the awksymb
structure, the flags
member indicates if the symbol is a string, in which case sval
member is valid, or a number with the value in fval
.
Arrays are also special in AWK. As I said before, all arrays are associative and are “indexed” by a character string. If the flag AWK_ARRAY
is set in the flags
member, the variable is an array and the index
member represents the array index.
Following is a brief description of each API function.
awk_init
#
AWKINTERP* awk_init (const char **vars);
The function initializes a new AWK interpreter object. It takes an array of variable definitions with the same format as the -v
command-line arguments of stand-alone AWK interpreter. The array is terminated with a NULL string.
awk_setprog
#
int awk_setprog (AWKINTERP* pi, const char *prog);
Set the program text for an interpreter. This function can be called only once for an interpreter.
awk_addprogfile
#
int awk_addprogfile (AWKINTERP* pi, const char *progfile);
Adds the content of a file as AWK program. The functionality is equivalent with the -f
switch on the command line of stand-alone interpreter. Just like the -f
switch, this function can be called repeatedly to add multiple programs.
awk_compile
#
int awk_compile (AWKINTERP* pi);
Compiles the AWK language program(s) that have been specified using awk_setprog
or awk_addprogfile
functions.
awk_addarg
#
int awk_addarg (AWKINTERP* pi, const char *arg);
Add a new argument to the interpreter. The argument can be an input file name or a variable definition, if it has the syntax var=value
. Arguments can be added at any time before starting execution of the AWK program.
Example:
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print pass+1 \"-\" NR, $0}");
awk_compile (pi);
awk_addarg (pi, "infile.txt");
awk_addarg (pi, "pass=1");
awk_addarg (pi, "infile.txt");
The output is (assuming infile.txt
has 25 lines):
1 - 25
2 - 25
awk_exec
#
int awk_exec (AWKINTERP* pi);
Execute a compiled program. The function returns the value returned by exit statement or a negative error code if something went wrong. If a program terminates without an exit statement, the returned value is 0. Small negative values should be considered reserved for error conditions.
Example:
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print NR, $0}");
awk_compile (pi);
awk_addarg (pi, "infile.txt");
awk_exec (pi);
awk_run
#
int awk_run (AWKINTERP* pi, const char *progfile);
This function combines in one call the calls to awk_setprog
, awk_compile
and awk_exec
functions.
If a program terminates without an exit statement, the returned value is 0. Otherwise the function returns the value specified in the exit statement. Small negative values should be considered reserved for error conditions. If the program requires any arguments, they can be added using awk_addarg
function before calling awk_run
.
Example:
AWKINTERP *pi = awk_init (NULL);
awk_addarg (pi, "infile.txt");
awk_run (pi, "{print NR, $0}");
awk_end
#
void awk_end (AWKINTERP* pi);
Releases all memory allocated by the interpreter object.
awk_setinput
#
int awk_setinput (AWKINTERP* pi, const char *fname);
Forces interpreter to read input from a file. By default, an interpreter reads from stdin. This function redirects the input to another file.
awk_infunc
#
void awk_infunc (AWKINTERP* pi, inproc fn);
Change the input function with a user-defined function. The default input function is getc
or fgetc
. The inproc
function has the same signature as getc
:
typedef int (*inproc)();
and it returns the next character or EOF if there are no more characters.
Here is an example of how to use AWK to process some in-memory data:
std::istrstream instr{
"Record 1\n"
"Record 2\n"
};
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "{print NR, $0}");
awk_compile (pi);
awk_infunc (pi, []()->;int {return instr.get (); });
awk_setoutput
#
int awk_setoutput (AWKINTERP* pi, const char *fname);
Redirect interpreter output to a file. By default, the interpreter output goes to stdout. Using this function, you can redirect it to a different file.
Example:
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "BEGIN {print \"Output redirected\"}");
awk_compile (pi);
awk_setoutput (pi, "results.txt");
awk_exec (pi);
awk_outfunc
#
void awk_outfunc (AWKINTERP* pi, outproc fn);
Change the output function with a user-defined function. The outproc
function signature is:
typedef int (*outproc)(const char *buf, size_t len);
Example:
std::ostringstream out;
int strout (const char *buf, size_t sz)
{
out.write (buf, sz);
return out.bad ()? - 1 : 1;
}
//...
AWKINTERP *pi = awk_init (NULL);
awk_setprog (pi, "BEGIN {print \"Output redirected\"}");
awk_compile (pi);
awk_outfunc (pi, strout);
awk_getvar
#
int awk_getvar (AWKINTERP *pi, awksymb* var);
Retrieves the value of an AWK variable. The function returns 1 if successful or a negative error code otherwise.
If the variable is an array and the index
member is NULL, the function returns AWK_ERR_ARRAY
error code.
For string variables, the AWKSYMB_STR
flag is set and the function allocates the memory needed for the string by calling malloc
. The user has to release the memory by calling free
.
Example:
AWKINTERP *pi = awk_init (NULL);
awksymb var{ "NR" };
awk_setprog (pi, "{print NR, $0}\n");
awk_compile (pi);
awk_getvar (pi, &var);
awk_setvar
#
int awk_setvar (AWKINTERP *pi, awksymb* var);
Changes the value of an AWK variable. The function takes a pointer to an awksymb
structure with information about the variable. The user must set the flags
member of the awksymb
structure to indicate which values are valid (string or numerical). In addition, for array members, the user must specify the index and set the `AWKSYMB_ARR flag.
If the variable does not exist, it is created.
Example:
AWKINTERP *pi = awk_init (NULL);
awksymb v{ "myvar", NULL, AWKSYMB_NUM, 25.0 };
awk_setprog (pi, "{myvar++; print myvar}\n");
awk_compile (interp);
awk_compile (pi);
awk_setvar (pi, &v);
awk_exec (pi); //output is "26"
awk_addfunc
#
Adds a user defined function to the interpreter.
int awk_addfunc (AWKINTERP *pi, const char *name, awkfunc fn, int nargs);
Parameters:
pi
- pointer to an interpreter objectname
- function namefn
- pointer to function. See [awkfunc] for prototype.nargs
- number of function arguments
The function returns 1 if successful or a negative error code otherwise.
External user-defined functions can be called from AWK code just like any AWK user-defined function. The nargs
parameter specifies the expected number of parameters but, like with any AWK function, the number of actual arguments can be different. The interpreter will provide null values for any missing parameters.
The function can return a value by setting it into the ret
variable and setting the appropriate flags. String values must be allocated using malloc
.
It should be called only after the AWK program has been compiled.
Example:
void fact (AWKINTERP *pi, awksymb* ret, int nargs, awksymb* args)
{
int prod = 1;
for (int i = 2; i <= args[0].fval; i++)
prod *= i;
ret->fval = prod;
ret->flags = AWKSYMB_NUM;
}
//...
awk_setprog (pi, " BEGIN {n = factorial(3); print n}");
awk_compile (pi);
awk_addfunc (pi, "factorial", fact, 1);
awk_exec (pi);
Final thoughts #
The source code has been compiled with VisualStudio 2020. There is also a small makefile for gcc. The syntax analyzer uses YACC so you will need a YACC compiler if you want to do a full rebuild. I have included however the files generate by YACC (ytab.cpp
and ytab.h
) so you can build it even if you don’t have a YACC compiler.
This concludes the presentation of my embedded AWK interpreter. It can be easily incorporated into a C/C++ program and has a good communication with host program. The host can access any interpreter variable and the interpreter can call external functions defined by host program. Size-wise, the interpreter is very small. You can expect an overhead of about 100 KB which is a decent number when compared with other interpreters (Lua takes about twice as much).
I will continue to improve the embedded AWK interpreter. If you want to contribute to this project or just get the latest version you can find it at: https://github.com/neacsum/awk.
History #
6-Apr-2020 Initial version