The availability of user-generated multi-modal content, including videos and images, in abundance makes it easy for users to use it as a reference and source of information. However, several hours may be required for the consumption of this large corpus of data. Particularly, for authors and content creators abstracting out information from videos to then representing it in a textual format is a tedious task. The challenges are multiplied due to the diversity and the variety introduced when there are several videos associated with a given query or topic of interest. We present, Videos2Doc, a machine learning-based framework for automated document generation from a collection of procedural videos. Videos2Doc enables author-guided document generation for those looking for authoring assistance and an easy consumption experience for those preferring the text or document media over videos. Our proposed interface allows the users to choose several visual and semantic preferences for the output document allowing the generation of custom documents and webpage templates from a given set of inputs. Empirical and qualitative evaluations establish the utility of Videos2Doc as well as the superiority over the current benchmarks. We believe, Videos2Doc will ease the task of making multimedia accessible through automation in conversion to alternate presentation modes.