Towards Multi-Modal Multi-Document Understanding Capabilities in Foundation Models