Many Intel ME 11.x modules are stored in Flash memory in compressed form . Two different compression methods are used: LZMA and Huffman encoding . LZMА can be decompressed using publicly available tools , but reversing Huffman encoding is a much more difficult challenge. Unpacking of Huffman encoding in ME is implemented in hardware, so writing the equivalent code in software is a far from trivial task—assuming it can be accomplished at all.
Previous ME versions
By reviewing sources from the unhuffme project , it is easy to see that previous versions of ME had two sets of Huffman tables, with each set containing two tables. The existence of two different tables (one for code and the other for data) is probably due to the fact that the code and data have very different statistical properties.
Other observed properties include:
- The range of codeword lengths is different among the sets of tables (7–19 versus 7–15 bits inclusive).
- Every sequence encodes a whole number of bytes (from 1 to 15 inclusive).
- Both sets use Canonical Huffman Coding  (which allows quickly determining the length of a codeword during unpacking).
- Within a given set, the lengths of encoded values for any codeword are the same in both tables (code table and data table).
We can assume that Huffman tables for ME 11.x have retained the latter three properties. So to fully recover the tables, what we need to find is:
- Range of lengths of codewords
- Shape (boundaries of values of codewords of the same length)
- Lengths of encoded sequences for each codeword
- Encoded values for each codeword in both tables
Splitting compressed data into pages
To learn about individual modules, we can start with what we already know about the internal structures of the firmware .
By looking at the Lookup Table, which is a part of the Code Partition Directory, we can determine which modules are affected by Huffman encoding, where their unpacked data starts, and what size the module will be once unpacked.
We can parse the Module Attributes Extension for a specific module to extract the size of the compressed/unpacked data and SHA256 value for the unpacked data.
A cursory look at several ME 11.x firmwares shows that the size of data after Huffman decoding is always a multiple of the size of a page (4096 == 0x1000 bytes). The start of the packed data contains an array of four-byte whole-number values. The number of array elements corresponds to the number of pages in the unpacked data.
For example, for a module measuring 81920 == 0x14000 bytes, the array will occupy 80 == 0x50 bytes and will consist of 20 == 0x14 elements.
The two most significant bits of each of the Little Endian values contain the table number (0b01 for code, 0b11 for data). The remaining 30 bits store the offset of the start of the compressed page relative to the end of the offset array. In the screenshot above, we see information describing 20 pages:
Offsets for the packed data for each page are arranged in ascending order; the size of the packed data for each page does not come into play directly. In the example above, the packed data for each particular page starts at a boundary that is a multiple of 64 = 0x40 bytes, with unused regions filled with zeroes. But judging by the other modules, we see that such alignment is not mandatory. This gives reason to suspect that the unpacker stops when the amount of data in the unpacked page reaches 4096 bytes.
Since we know the total size of the packed module (from the Module Attributes Extension), we can split the packed data into separate pages and work with each page individually. The start of the packed page is defined in the array of offsets; the page size is defined by the offset of the start of the next page or total size of the module. The packed data may be padded with an arbitrary number of bits (theoretically these bits could have any value, but in practice are usually zeros).
As seen in the screenshot, the last compressed page (starting at offset 0xFA40) consists of byte 0xB2 == 0b10110010, which is followed by 273 bytes with the value 0xEC == 0b11101100, and then zeroes. Since the bit sequence 11101100 (or 01110110) is repeated 273 times, we can assume that it encodes 4096/273 == 15 identical bytes (most likely with the value 0x00 or 0xFF). In this case, the bit sequence 10110010 (or 1011001) encodes 4096-273*15 == 1 byte.
This is consistent with the idea that each code sequence encodes a whole number of bytes (from 1 to 15 inclusive). But such an approach will not suffice for completely recovering the Huffman tables.
Finding the Rosetta Stone
As shown previously , identically named modules in different versions of ME 11 firmware can be compressed with different algorithms. If we take the Module Attributes Extension for identically named modules that have been compressed with LZMA and Huffman encoding, and then extract the SHA256 value for each module version, we find that there is no LZMA/Huffman module pair that has the same hash values.
But one should remember that for modules compressed with LZMA, SHA256 is usually computed from compressed data. If we calculate SHA256 for modules after LZMA decompression, a large number of pairs appears. Each of these module pairs yields several pairs of matching pages, both with Huffman encoding and in unpacked form.
Shape, Length, Value
Having a large set of pages in both compressed and uncompressed form (separately for code and for data) allows recovering all of the code sequences used in those pages. The methods needed for this task combine linear algebra and search optimization. While it would be likely feasible to create a rigorous mathematic model taking all the relevant constraints into account, because this is a one-time task it was simpler to do part of the job manually and part with automated methods.
The main thing is to at least approximate the shape (points of change of code sequence lengths). For example, 7-bit sequences have values from 1111111 to 1110111, 8-bit sequences have values from 11101101 to 10100011, and so on. Since Canonical Huffman Coding is used, if we know the shape, we then can predict the length of the next code sequence (the shortest sequence consists of only ones, the longest one consists of only zeroes—the lower the value, the longer the sequence).
Not knowing the exact size of the compressed data, we can discard all trailing sequences consisting only of null bits. These sequences are the longest, and it is unlikely that the rarest code sequence would occur in the final position.
When each compressed page can be represented as a set of code sequences, we can start determining the lengths of the values that are encoded by them. The lengths of encoded values for each page should add up to 4096 bytes. With this knowledge, we can set up a system of linear equations: the unknowns are the lengths of encoded values, the coefficients are the number of times a particular code sequence is found in the compressed page, and the constant equals 4096. Code and data pages can both be \”plugged in\” at the same time, since for identical code sequences, the lengths of encoded values should be the same.
Once we have enough pages (and equations), Gaussian elimination provides the one valid solution. And once we have the uncompressed plaintext, length of each value, and their order, we easily derive which sequence codes for which value.
After running these methods on all pages for which we had both encoded and plaintext equivalents, we could recover up to 70% of sequences from the code table and 68% from the data table. Lengths were now known for approximately 92% of sequences. Meanwhile, the shape remained a bit of a mystery: in some places, either one value of shorter length or two values of a longer length could have been present, making it impossible to determine the boundary until one of the values is encountered in the unpacked data.
With this knowledge in hand, we can proceed to recover values for code sequences for which we do not have the matching plaintext.
If we have a sequence of unknown length, another row is added to our system of equations and we can quickly determine its length. But if we don\’t have the plaintext, how can we determine the value?
Verification and brute force to the rescue
Fortunately for us, the metadata contains the SHA256 value for the unpacked module. So if we correctly guess all unknown code sequences on all the pages that make up a module, the SHA256 value should match the value from the Module Attributes Extension.
When the total length of unknown sequences is 1 or 2 bytes, simple brute-forcing is enough to \”crack the code.\” This method can also work with 3 or even 4 unknown bytes (especially if they are located close to the end of the module), but brute-forcing can take several hours or days on a single CPU core (that said, computation can easily be split among multiple cores or machines). No attempts were made to brute-force 5 or more bytes.
This method was able to recover several more code sequences (and several modules). This left only the modules in which the total number of unknown bytes exceeded the capabilities of brute force.
Since many modules are only slightly different from each other, we can apply several heuristics in order to decode more sequences by analyzing the differences.
Use of second Huffman table
Since the unpacker has two Huffman tables, the packer tries to compress data with both tables, discarding the version that takes up more space. So the division of code versus data is not clear-cut. And if a part of a page changes, the other table may be more efficient (yield a smaller result). So by looking at other versions of the same module, we can find identical fragments that were compressed by the other table, thus recovering unknown bytes.
When a single code sequence is found many times in a single module or in multiple modules (such as in code and in data), it is often easy to figure out the constraints governing the unknown values.
Constants and tables with offsets
The module data often contains constants and offsets (such as for text strings and functions). Constants may be module-unique or shared among modules (for example, hashing and encryption functions). Meanwhile, offsets are unique to each module version but must refer to the same (or very similar) fragments of data or code. As a result, the values can easily be recovered.
String constants from open-source libraries
Some fragments of ME code were obviously borrowed from open-source projects (such as wpa_supplicant), which makes it easy to reconstruct fragments of text strings based on context and available source code.
Code from open-source libraries
If we look at the source code and find the text of a function whose compiled code contains several unknown bytes, we can simulate the compiler to guess which values fit.
Similar functions in other module versions
Since versions of the same module may be only slightly different, sometimes we can find the equivalent function in another version of the module and, based on the code, figure out what the unknown bytes should mean.
Similar functions in prior ME versions
When code has not been taken from public sources, if we have a fragment that is unknown in all available versions of a module and is not found in any other module (which was the case with the amt module), we can find the same place in previous versions (such as in ME 10), pick apart the logic of the function, and then see how it works in the unknown spot in ME 11.x.
Finishing the job
Starting with the modules containing the highest percentage of known sequences, and combining the described heuristics, we gradually were able to improve our coverage of the Huffman tables (each time testing our work against the SHA256 hash). Modules that originally contained dozens of unknown sequences now had only a few remaining. The process looked to be soon complete—except for that pesky amt.
As the largest module (around 2 megabytes, or 30% of the total), amt contains many codewords that are not found in any other module but occur in all versions of amt. We were highly confident of several sequences, but the only way to be sure would be to guess all of them correctly (so as to match the SHA256 hash). Thanks to the invaluable assistance of Maxim Goryachy, we were able to bring down this barrier as well. We could now unpack any module in the firmwares available to us that had been compressed with Huffman encoding.
New firmware versions appeared over time, containing all-new codewords. But in each case, one of the above heuristics succeeded in solving the module\’s mysteries, therefore further improving our coverage of the tables.
As of mid-June 2017, we were successful in recovering approximately 89.4% of sequences for the code table and 86.4% for the data table. But the chances of successfully obtaining 100% table coverage in a reasonable period of time based on analysis of new modules were slim at best.
On June 19, user IllegalArgument published Huffman table fragments on GitHub , covering 80.8% of the code table and 78.1% of the data table. Most likely, the author (or authors) used a similar approach based on analysis of available firmware versions. Thus the published tables did not provide any new information.
On July 17, Mark Ermolov and Maxim Goryachy made a breakthrough and could now find plaintext values for any compressed data. We prepared four compressed pages (two pages for each of the Huffman tables), recovering all 1,519 sequences for both tables .
In the process, one quirk came to light. In the Huffman table for data, the value 00410E088502420D05 corresponds to both 10100111 (8 bits long) and 000100101001 (12 bits long). This is a clear case of redundancy, most likely caused by an oversight.
The final shape of the data is as follows:
Author: Dmitry Sklyarov, Head of Application Research Unit, Positive Technologies