Advanced grepping through directory trees with binary data

When reverse engineering stuff you often get a directory tree with a whole bunch of files (both binaries and text files) and you want to quickly find all occurrences of keywords you are interested in. Typical examples for this problem are program directories of applications, extracted Apps or the root filesystem of an embedded system.

When reverse engineering Linux-based firmware images you typically start by extracting the root filesystem (or initrd) so that you can analyze the userspace programs, scripts and configuration files. There are already some good tutorials ([1] [2] [3]) and tools like binwalk and firmware-mod-kit which automate many steps of finding/extracting the root filesystem from a binary firmware image. However, once you've got the root filesystem, you often find a whole bunch of files and it can be quite difficult to find the interesting stuff to analyze. For instance, you may find a juicy configuration variable in /etc and want to find all references to this configuration variable in the firmware. Using the standard grep utility does a good job at analyzing text files but it isn't nearly as useful for binary files, which may still contain the keyword you are looking for. By default grep only says whether the keyword is there or not and it doesn't display the context around the keyword (as it does for text files). Forcing grep to treat binaries as text files using the -a option also doesn't solve the problem either since grep will then output a whole bunch of binary data before and after the match until the next newline and you probably don't want to see this binary data in your terminal.

But luckily there are a lot of useful standard tools available on a Linux system and you can cleverly combine them to overcome this limitation. I've come up with the following command for grepping through directory trees:

find . -type f -print0|xargs -0 strings -a --print-file-name|grep -i -E ':.*your_keyword_here'|less -S

The find command just searches the current directory for files and prints the filenames to standard output separated by a null byte. Using a null byte instead of a newline makes sure that it doesn't fail if filenames in the tree contain special characters such as a spaces or newlines. Using the filter "-type f" makes sure that it only finds regular files and not directories, symlinks, devices or unix domain sockets, which may exist in your directory as well and would cause problems with the following tools.

The output of find is piped to xargs, which will call the command strings for all files found by the find command. The option "-0" tells xargs that the input is separated by null bytes instead of newlines. The program strings looks through the file and outputs all sequences of at least 4 printable characters. Since grep processes the output of strings and not the actual files, grep can't show the filename of a match (as it does when using grep to recursively search in a directory). Since you typically want to know in which files your search results are, you can use the option --print-file-names of strings so that the output contains the filename as well. The -a option of strings tells it to parse the whole file and not only certain sections of ELF files.

The next step is to use grep to filter the output of strings in order to search for a specific keyword. If you don't want to search case-insensitively, you'll have to remove the -i option of grep. Using the pattern ':.*' before the actual keyword makes sure that it won't flood your search results with all strings of a file if the filename (which is prepended by the --print-file-name of strings) already contains the keyword you are searching for.

Last but not least I recommend piping the results to less -S so that less will only use one line of the screen per result. This makes the results easier to interpret especially if you have really long lines in the results (which occasionally happens with firmware images) and you don't want to have a hundred lines of wrapped text for one single search result. You can still see the full output lines by scrolling horizontally in less (or just use the search function of less to navigate to the actual keyword).

The search can take some time especially for large directory trees. In that case you can easily speed up the process by saving the output of strings to a file:

find . -type f -print0|xargs -0 strings -a --print-file-name > /tmp/strings.txt

This intermediate results can then be used for many searches:

cat /tmp/strings.txt|grep -i -E ':.*your_keyword_here'|less -S

A test with the 2.1 GB /usr/lib/ directory on my notebook created a 1.2 GB strings.txt and searching this file takes some 10 seconds given that it is still cached in memory.

The same commands can also be used for other reversing tasks such as program directories, extracted apps or even web applications (which may also include binary files like sqlite databases).

If you expect other character encodings such as utf16 (wich is quite common for Windows applications), you will need to use the -e option of strings. The following command tries ascii/utf8, utf16 and utf32:

for enc in S l L;do find . -type f -print0|xargs -0 strings -e $enc --print-file-name;done > /tmp/strings.txt
cat /tmp/strings.txt|grep -i -E ':.*your_keyword_here'|less -S