Name: SORT IT: Build a PDF Processor
Start: 2020-04-28T15:00:00Z
End: 2020-04-28T15:00:41.000Z
Location: BrightTALK
Rating: 4.75

Presented by

Adam Jelley, Data Scientist

About this talk

As the world moves ever more digital, many businesses have a need for automated processing of documents. In this webinar, we’ll walk through an example end-to-end project for extracting, classifying and summarising PDF documents, and show how you can use a combination of cutting-edge open-source technologies, together with your own in-house expertise and requirements, to build you own PDF Processor with Dataiku DSS. PDF2Image (https://pypi.org/project/pdf2image/) Tesseract OCR (https://tesseract-ocr.github.io/tessdoc/Home.html) Pytesseract (https://pypi.org/project/pytesseract/#description) The Plugin Store (https://www.dataiku.com/product/plugins/) The Text Summarisation Plugin (https://www.dataiku.com/product/plugins/text-summarization/) Sci-kit Learn 20 Newsgroups Dataset (https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#) "Surprising Findings in Document Classification" (https://towardsdatascience.com/surprising-findings-in-document-classification-7a79e30f1666) Webinar (tomorrow): How to Reduce Data Labelling Costs (+ Increase Data Quality) With Active Learning (https://www.brighttalk.com/webcast/17108/394533?utm_campaign=channel-feed&utm_source=brighttalk-portal&utm_medium=web)