Progress
Thus far, we have scanned three college catalogs for each college type:
- For liberal arts colleges, we scanned Carleton College, Lewis and Clark College, and Reed College
- For public colleges, we scanned UCLA, Georgia State University, and San Francisco State University
- For HBCUs, we scanned Howard University, Hampton University, and Tuskegee University
- For community colleges, we scanned Compton Community College, Tri-County Community College, and Windward Community College
For each college, we separated our scans into 5 sections:
- Scans containing general information on the college
- Scans containing information on the Art curriculum
- Scans containing information on the Chemistry curriculum
- Scans containing information on the English curriculum
- Scans containing information on the Political Science curriculum
After scanning all of the colleges, we performed OCR using Adobe Acrobat. Right now we are in the process of verifying that the text from our scans have been adequately identified so that we can use it for sentiment analysis.
Problems
Originally, we had planned to scan the box of microfiche located in the digital archives which contained information on gifts given to the college. After discussing with Nat Wilson, director of the digital archives, it became apparent that we wouldn’t be able to remediate the contents of the fiche due to privacy restrictions.
We went back to the drawing board and decided to compile microfiche containing college catalogs. Our intention is to collect college catalogs from a variety of colleges- liberal arts colleges, public schools, HBCUs, and community colleges- and compare sentiment analysis of the college types.
Emulator wanted to use a Forge.js to create the hotspots we wanted, but it was too confusing so instead, using HTML image map functionality to make it happen instead.
Tools and Techniques
- We have used Crazy Frog, the machine on 4th libe, to scan our microfiche. Crazy Frog is hooked up to a Windows computer in the library which visualizes the fiche using ScanPro 2000 software.
- We used Adobe Acrobat DC Pro to perform OCR on our scans.
- Emulator created from scratch w/ HTML & CSS
Deliverables
- Thursday, 3/3:
- Verify scans are adequate for OCR
- Identify sentiment analysis software to use
- Potential suitors are Python package, lingmotif
- Make tutorial for class
- Idea: how to use Crazy Frog/scan microfiche
- Create function skeletons for emulator operations
- Print(Download), Rotate, ZoomIn/Out
- Tuesday 3/8:
- Have all scans scanned, OCRed
- Print/Rotate/Zoom Completed
- Manually transcribe intros from Carleton, Howard, UCLA, Windward
- Test OCR using Tesseract, abbyy, omnipage
- Thursday 3/10:
- Upload OCR analysis to website
- Upload transcriptions to website
- End of term:
- Complete final writeup of research
- Add finishing touches to UI
References:
- Microfiche: https://en.wikipedia.org/wiki/Microform
- Adobe: https://helpx.adobe.com/document-cloud/help/using-ocr-exportpdf.html
- Tesseract: https://programminghistorian.org/en/lessons/retired/OCR-with-Tesseract-and-ScanTailor
- Abbyy: abbyy.com
- Omnipage: https://www.kofax.com/products/omnipage
- Forge.js: https://forgejs.org/
Team Fiche, this is a solid plan that shows adaptability and an inspiring interest in tackling the medium from multiple angles. I love that you’ll have an emulator alongside the remediated text and subsequent analysis to give readers a sense of the whole analog to digital process.
As you move to the analysis stage, try to nail down hypotheses for each step. What do you expect sentiment analysis of the different institutions’ introductions to show? Different departments? Be open to other forms of text analysis if sentiment falls short. Simple word clouds could be enlightening.
http://gassele.sites.carleton.edu/uncategorized/progress/