Wednesday, March 14, 2007

Pattern matching HTML

One sometimes stumbles across web security systems where at the core an HTML pattern matching engine lives. The snake oil alarm should ring at this time. Filtering or pattern matching HTML for any security purpose is incredibly hard (if not impossible) to do. Take for example the following html snippet:

<BDO DIR="rtl">rat</BDO>'.'<BDO DIR="rtl">parc<HRM></BDO>

Can anybody just by looking at it tell me what would a browser display? If if you can; then tell if I change one single character of the above what would you see in a browser?

<BDO DIR="rtl">rat</BDO>'o'<BDO DIR="rtl">parc<HRM></BDO>

Part of the problem resides in the fact that most browsers are purposely lax when parsing html because a lot of pages out there were broken, i.e did not have valid (compliant) html and the burden was put on the browsers to show them. For example, it is not a problem to forget the HTML or BODY tag as the previous examples show.

The other reason is that HTML is trying to be too many things at once because it got too popular too fast and many special interest groups decided to piggyback their needs on top of it. This has resulted in a (markup) language that is like an art piece: open to interpretation.

All that and we haven't even begin to talk about DHTML. So if you are considering some regex engine for your next web security product, please, pretty please don't.

No comments: