Application-aware on-line failure recovery for extreme-scale HPC environments