First off, Drupal is made for lots of content -- 16,000 articles is no problem. Taxonomy, views, search -- so much of what you need for handling big content is baked into the Drupal platform. Add Apache Solr for faceted search, and you have a scalable, flexible web publishing platform. 

From a developer perspective, Drupal is a pleasure to work with. Getting content out of PDF, though -- that's not so fun.

The problem is not so much reading the text -- lots of open source and free tools can extract text from PDF. (Though drop capitals and drop quotes do complicate this.) The trick is doing it in the proper order so that the content makes sense. Following columns of text to get articles in proper reading order is tough. Differentiating between columnar text and table text can be even tougher.

Since many of these issues are layout-specific, you really need to look at samples of the PDFs you will be working with. Beyond that, here's what I'd suggest. 

The slides below talk about how this all worked for, a publisher of specialized water industry news. For the OOSKAnews project, we used both Xpdf and ABBYY FineReader, depending on the article layout. Once we had the text, a bunch of custom sed scripts converted it to individual articles for import into Drupal.

It wasn't easy, but in the end it worked great. 

Need help doing this kind of thing? Have a better way to do it? Then by all means let me know in the comments -- or drop us a line.