NLP Resources
Contentshttp://www-nlp.stanford.edu/links/images/ball.white.gifTools: Machine Translation, POS Taggers, NP chunking, Sequence models,Parsers, Semantic Parsers/SRL, NER, Coreference,Language models, Concordances, Summarization,Otherhttp://www-nlp.stanford.edu/links/images/ball.white.gifCorpora:Large collections, Particular languages, Treebanks,Discourse,WSD,Literature,Acquisitionhttp://www-nlp.stanford.edu/links/images/ball.white.gifSGML/XMLhttp://www-nlp.stanford.edu/links/images/ball.white.gifDictionarieshttp://www-nlp.stanford.edu/links/images/ball.white.gifLexical/morphological resourceshttp://www-nlp.stanford.edu/links/images/ball.white.gifCourses, Syllabi, and other Educational Resourceshttp://www-nlp.stanford.edu/links/images/ball.white.gifMailing listshttp://www-nlp.stanford.edu/links/images/ball.white.gifOther stuff on the Web:General, IR, IE/Wrappers, People, SocietiesTools
Machine Translation systems
Instructions
http://www-nlp.stanford.edu/links/images/ball.purple.gifBuilding a baseline statistical phrase MT systemWonderful pages about how to download a bunch of tools and some dataand put themtogether to build a very competent baseline statistical MT system:NAACL 2006WMt or2009 WMT.Freely downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifEGYPT systemSystem from 1999 JHU workshop.Mainly of historical interest.http://www-nlp.stanford.edu/links/images/ball.purple.gifGIZA++ and mkclsFranz Och.C++.GPL.http://www-nlp.stanford.edu/links/images/ball.purple.gifThotPhrase-based model building kithttp://www-nlp.stanford.edu/links/images/ball.purple.gifPhramerAn Open-Source Java Statistical Phrase-Based MT Decoderhttp://www-nlp.stanford.edu/links/images/ball.purple.gifMoses A new open-sourcephrase-based MT decoder with functionality beyond Pharaoh.http://www-nlp.stanford.edu/links/images/ball.purple.gifSyntax Augmented MachineTranslation via Chart ParsingAndreas Zollmann and Ashish VenugopalFree, but getting them requires hassle
http://www-nlp.stanford.edu/links/images/ball.purple.gifPharaohdecoderPhilip Koehn, ISI.http://www-nlp.stanford.edu/links/images/ball.purple.gifMTTKMachine Translation Tool Kit.Deng and Byrne.Part of Speech Taggers
Freely downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifStanford POStaggerLoglinear tagger in Java (by Kristina Toutanova)http://www-nlp.stanford.edu/links/images/ball.purple.gifhunposAn HMM tagger with models available for English and Hungarian.Areimplementation of TnT (see below) in OCaml.pre-compiled models.Runs on Linux, Mac OS X, and Windows.http://www-nlp.stanford.edu/links/images/ball.purple.gifMBT: Memory-based TaggerBased on TiMBLhttp://www-nlp.stanford.edu/links/images/ball.purple.gifTreeTaggerA decision tree based tagger from the University of Stuttgart(Helmut Scmid).It'slanguage independent, but comes complete with parameter files forEnglish, German, Italian, Dutch, French, Old French, Spanish, Bulgarian,and Russian.(Linux, Sparc-Solaris, Windows, and Mac OS X versions.Binary distribution only.)Page has links to sites where you can run it online.http://www-nlp.stanford.edu/links/images/ball.purple.gifSVMToolPOS Tagger based on SVMs (uses SVMlight).LGPL.http://www-nlp.stanford.edu/links/images/ball.purple.gifACOPOST (formerlyICOPOST)Open source C taggers originally written by by Ingo Schröder.Implements maximum entropy, HMM trigram, andtransformation-based learning.C source available under GNU public license.http://www-nlp.stanford.edu/links/images/ball.purple.gifMXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech taggerJava POS tagger.A sentenceboundary detector (MXTERMINATOR) is also included.Original version wasonly JDK1.1; later version worked with JDK1.3+. Class files, not source.http://www-nlp.stanford.edu/links/images/ball.purple.giffnTBLA fast and flexible implementation of Transformation-Based Learning in C++.Includes a POS tagger, but also NP chunkingand general chunking models. http://www-nlp.stanford.edu/links/images/ball.purple.gifmu-TBLAn implementation of a Transformation-based Learner (a la Brill),usable for POS tagging and other things by Torbjörn Lager.Webdemo also available.Prolog.http://www-nlp.stanford.edu/links/images/ball.purple.gifYamChaSVM-based NP-chunker, also usable for POS tagging, NER, etc.C/C++open source.Won CoNLL 2000 shared task.(Less automatic than a specialized POStagger for an end user.)http://www-nlp.stanford.edu/links/images/ball.purple.gifQTAGPart of speech tagger An HMM-based Java POS tagger from Birmingham U. (Oliver Mason).English and German parameter files. http://www-nlp.stanford.edu/links/images/ball.purple.gifThe TOSCA/LOB tagger.Currently available for MS-DOS only.But the decision to make thisfamous system available is very interesting from an historicalperspective, and for software sharing in academia more generally.LOB tag set.http://www-nlp.stanford.edu/links/images/ball.purple.gifThe venerable Brill's Transformation-based learning Tagger A symbolic tagger, written in C. It's no longer available from acanonical location, but you might find a version from theWikipedia page or you could try a reimplementation suchas fnTBL.http://www-nlp.stanford.edu/links/images/ball.purple.gifOriginal Xerox TaggerA common lisp HMM tagger available byftp.http://www-nlp.stanford.edu/links/images/ball.purple.gifLingua-EN-TaggerPerl POS tagger by Maciej Ceglowski and Aaron Coburn.Version0.11.(A bigram HMM tagger.)Free, but require registration
http://www-nlp.stanford.edu/links/images/ball.purple.gifTATOOThe ISSCO tagger.HMM tagger.Need to register to download.http://www-nlp.stanford.edu/links/images/ball.purple.gifPoSTech Koreanmorphological analyzer and taggerOnline registration.http://www-nlp.stanford.edu/links/images/ball.purple.gifTnT - A StatisticalPart-of-Speech TaggerTrainable for various languages, comes with English and Germanpre-compiled models.Runs on Solaris and Linux.Usable by email or on the web, but not distributed freely
http://www-nlp.stanford.edu/links/images/ball.purple.gifMemory-based taggerFrom ILK group, Catholic University Brabant (Jakub Zavrel/WalterDaelemans).Does Dutch, English, Spanish, Swedish, Slovene.Other MBLdemos are also available.http://www-nlp.stanford.edu/links/images/ball.purple.gifBirmingham taggerAccepts only plain ASCIIemail message contents.The tagset used is similar to the Brown/LOB/Penn set.http://www-nlp.stanford.edu/links/images/ball.purple.gifCLAWS taggerThe UCREL CLAWS tagger is available for trial use on the web.(It'slimited to 300 words though -- this site is more of an advertisement forlicensing the real thing -- available as software for Suns or as a paidservice.)You can also find info on CLAWS tagsets,though that page doesn't seem to link to the C7 tagset.http://www-nlp.stanford.edu/links/images/ball.purple.gifTheAMALGAM taggerThe AMALGAMProject also has various other useful resources, in particular a webguide to different tag sets in common use.The tagging is actuallydone by a (retrained) version of the Brill tagger (q.v.).http://www-nlp.stanford.edu/links/images/ball.purple.gifXeroxXRCE MLTT Part Of Speech TaggersTags any of 14 languages (European and Arabic), online on the web.http://www-nlp.stanford.edu/links/images/ball.purple.gifPortuguese taggers on the web: ProjectoNatura and a QTAG adaptation.Not free
http://www-nlp.stanford.edu/links/images/ball.purple.gifLingsoft Lingsoft in Finland has (symbolic)analysis tools for many European languages.More information can beobtained by emailing info@lingsoft.fi.Thereis an online demo.http://www-nlp.stanford.edu/links/images/ball.purple.gifConexor Conexor in Finland hasdemonstrations of EngCG-style taggers and parsers, for English, Swedish,and Spanish.http://www-nlp.stanford.edu/links/images/ball.purple.gifXerox Xerox hasmorphological analyzers and taggers for many languages.There are demos of some of their tools on the web.More information can beobtained by contacting Daniella Russo.http://www-nlp.stanford.edu/links/images/ball.purple.gifInfogisticsInfogistics, anEdinburgh spinoff has a tagging and NP/Verb group chunkeravailable commercially, including an evaluation version.No longer available
http://www-nlp.stanford.edu/links/images/ball.purple.gifLT POS and LT TTTThe Edinburgh Language Technology Group tagger and text tokenizer (andsentence splitter were binary-only Solaris tools which no longer seem tobe available.NP chunking
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifYamChaSVM-based NP-chunker, also usable for POS tagging, NER, etc.C/C++open source.WonCoNLL 2000 shared task.(Less automatic than a specialized POStagger for an end user.)http://www-nlp.stanford.edu/links/images/ball.purple.gifMarkGreenwood's Noun Phrase ChunkerA Java reimplementation of Ramshaw and Marcus (1995).http://www-nlp.stanford.edu/links/images/ball.purple.giffnTBLA fast and flexible implementation of Transformation-Based Learning in C++.Includes a POS tagger, but also NP chunkingand general chunking models. Generic sequence models
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifCRF++Generic CRF-based model in C++.Open source.By the author of YamCha.http://www-nlp.stanford.edu/links/images/ball.purple.gifCarafeGeneric CRF-based sequence models in O-CaML. Open source.By BenWellner.http://www-nlp.stanford.edu/links/images/ball.purple.gifFreeLingA largesuite of language analyzers.Written in C++.Covers text preprocessing, morphology, NER, POS tagging, parsing.Parsers
Information on available probabilistic parsers can be found on theFSNLP: probabilistic parsing links page.
Semantic Parsers
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifASSERTPropBank semantic roles (and opinions, etc.) by Sameer Pradhan.http://www-nlp.stanford.edu/links/images/ball.purple.gifShalmaneserFrameNet-based by Katrin Erk.http://www-nlp.stanford.edu/links/images/ball.purple.gifTreeKernels in SVMlight by Alessandro Moschitti.A general package, but ithas particularly been used for SRL.Named Entity Recognition
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifStanford NamedEntity RecognizerA Java Conditional Random Field sequence model with trained modelsfor Named Entity Recognition.Java. GPL.By Jenny Finkel.http://www-nlp.stanford.edu/links/images/ball.purple.gifLingPipeTools include statistical named-entity recognition, a heuristic sentenceboundary detector, and a heuristic within-document coreferenceresolution engine.Java.GPL.By Bob Carpenter, Breck Baldwin and co.http://www-nlp.stanford.edu/links/images/ball.purple.gifYamChaSVM-based NP-chunker, also usable for POS tagging, NER, etc.C/C++open source.WonCoNLL 2000 shared task.(Less automatic than a specialized POStagger for an end user.)Coreference (Anaphora) Resolution
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifBARTA Beautiful Anaphora Resolution Toolkit.Java.By YannickVersley and many others.Java.Apache with GPL components.http://www-nlp.stanford.edu/links/images/ball.purple.gifGuitarJava. GPL.Language modeling toolkits
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifIRSTLM ToolkitCompatible with SRILM, suitable for very large language models.LGPL.By Marcello Federico, Nicola Bertoldi et al.http://www-nlp.stanford.edu/links/images/ball.purple.gifCMU-CambridgeStatistical Language Modeling toolkitDownloadable, but requires registration
http://www-nlp.stanford.edu/links/images/ball.purple.gifThe SRI LanguageModeling toolkitby Andreas Stolcke is another good system forbuilding language models, freely available for research purposes.Not yet classified
http://www-nlp.stanford.edu/links/images/ball.purple.gifLextools A package of tools for creating weighted finite-statetransducers (WFST) from high-level linguistic descriptions.Lextools binaries are available free for non-commercial useat: http://www.research.att.com/sw/tools/lextools/.Supported platforms are: linux (i686), sgi (mips2) and sun4.Lextools is built on top of, and requires, the AT&T WFSTtoolkit (version 3.6), available free for non-commercial usefrom: http://www.research.att.com/sw/tools/fsm/Friendly concordancing and text analysis tools
http://www-nlp.stanford.edu/links/images/ball.purple.gifWordsmith Tools (Mike Scott)The thing to get if you are working in the Windows world.Text summarization tools
http://www-nlp.stanford.edu/links/images/ball.purple.gifA prototype JavaSummarisation applet (System Quirk)http://www-nlp.stanford.edu/links/images/ball.purple.gifMEADA public domain portable multi-document summarizationsystem. (Dragomir Radev and others.)Other
Downloadable
http://www-nlp.stanford.edu/links/images/ball.purple.gifTilburg University's TiMBLTilburg's Memory Based Learner by Walter Daelemans et al.A generalnear-neighbour-based machine learning package, but optimized for statistical NLPapplications.http://www-nlp.stanford.edu/links/images/ball.purple.gifTimeExpression taggersTIMEX2 standard taggers (site at Mitre).http://www-nlp.stanford.edu/links/images/ball.purple.gifNLTKAn open source Python package for NLP application development withtools such as tokenization, POS TAGGINGand parsers by Ed Loper and Steven Bird.http://www-nlp.stanford.edu/links/images/ball.purple.gifTed Pedersen's codeNgram StatisticsPackage: Perl code that implements: Fisher's exact test, thelikelihood ratio, Pearson's chi squared test,the Dice Coefficient, and Mutual Information; Duluth Senseval-2 wordsense disambiguation systems; Senseval-1 data in Senseval-2format; various other WSD datasets in Senseval formats, andsemantic distances derived via WordNet.http://www-nlp.stanford.edu/links/images/ball.purple.gifISIPtoolsThe main aim is a publically available speech recognitionsystem (alpha release available), but along the way there are alsotoolkits for discrete HMMs and statistical decision trees, andfor various aspects of signal processing.http://www-nlp.stanford.edu/links/images/ball.purple.gifMem.A Perlimplementation of Generalized and Improved Iterative Scaling by Hugo WL ter Doest.http://www-nlp.stanford.edu/links/images/ball.purple.gifAutomorphologyA system (for Windows) for automatically learning the morphologicalforms of words in a corpus by John Goldsmith.http://www-nlp.stanford.edu/links/images/ball.purple.gifWordnet Wordnet is available by ftp,compiled for a variety of machine types.For money, one can also get EuroWordNet for variousEuropean languages, an Italian/English/Spanish MultiWordNetand there's now a site forGlobal Wordnet.(See also Mappingsbetween WordNet versions and PerlWordNet-Similarity module by Ted Pedersen, andWordNet Domains (coarse-grained sense topic classifications).) http://www-nlp.stanford.edu/links/images/ball.purple.gifPenn XTAG projectA wide-coverage tree-adjoining grammar written in a mixture of Cand Common Lisp.Also includes a large coverage morphologicalanalyzer.Now includes more tools such as TCL/Tk tree viewer.http://www-nlp.stanford.edu/links/images/ball.purple.gifDan Melamed'sAssorted ToolsA collection of various tools including a simulated annealling program, apost-processor for English stemming for the Penn XTAG morphologysystem, Good-Turing smoothing software, general text processing tools,text statistics tools and bitext geometry tools (mainly written in Perl 5).http://www-nlp.stanford.edu/links/images/ball.purple.gifMULTEXTConstructing corpora and tools for processing multilingual corpora.Contact: Jean Veronis veronis@univ-aix.fr.Some stuffincluding a multilingual text editor is downloadable.MULTEXT EAST has parallel versionsof Orwell's 1984 available free (upon registration) for a numberof Central European languages.http://www-nlp.stanford.edu/links/images/ball.purple.gifNaiveBayes algorithmSoftware from the Rainbow/Libbow software package that implementsseveral algorithms for text categorization, including naive Bayes,TF.IDF, and probabilistic algorithms.Accompanies Tom Mitchell's ML text.http://www-nlp.stanford.edu/links/images/ball.purple.gifHDDIText Data Mining API from Lehigh University.http://www-nlp.stanford.edu/links/images/ball.purple.gifEmdros: a text database engine for linguistic analysis and researchhttp://www-nlp.stanford.edu/links/images/ball.purple.gifChasenJapanese morphological analyzer.Descendent of JUMAN.Free, but require registration
http://www-nlp.stanford.edu/links/images/ball.purple.gifStuttgart's IMSCorpus Workbench (CWB)A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic.Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.http://www-nlp.stanford.edu/links/images/ball.purple.gifGateUniversity of Sheffield's General Architecture for Text Engineering.Primarily an Information Extraction system.http://www-nlp.stanford.edu/links/images/ball.purple.gifMITRE'sAlembic WorkbenchA workbench for the development of tagged corpora.Includes atagger based on Brill's TBL approach.http://www-nlp.stanford.edu/links/images/ball.purple.gifSNoWSNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).Unsure
http://www-nlp.stanford.edu/links/images/ball.purple.gifINTEXa finite-state transducer analysis system for English, French, andItalian that runs under NextStep.Contact:Max Silberztein silberz@ladl.jussieu.frThe PennToolspage collects information on a variety of NLP systems, many of which areavailable externally.
Corpora
Large collections aimed at the NLP community
http://www-nlp.stanford.edu/links/images/ball.green.gifLDC (LinguisticData Consortium) and its catalogue by year.Email: ldc@ldc.upenn.edu.Provides the largest range ofcorpora on CD-ROM.Cost ranges from cheap (e.g., ACL-DCI disk) to pricey.CDs can be purchased individually; institutions can become members andreceive discounts on CDs.There's anLDC Online service forsearches over the web (mainly intended for members, but there are samplersavailable).http://www-nlp.stanford.edu/links/images/ball.green.gifEuropean LanguageResources Association and its catalogue.Distribution agency is ELDA.Rapidly growing collection of materials in European languages.http://www-nlp.stanford.edu/links/images/ball.green.gifICAME(International Computer Archive of Modern English)Sells various corpora (includingBrown and London-Lund).Information on corpora on the web, by sending themessage help to fileserv@nora.hd.uib.no, by ftp tonora.hd.uib.no.Also,manuals forthese corpora.http://www-nlp.stanford.edu/links/images/ball.green.gifReuters @NISTReuters corpora are now distributed by NIST.http://www-nlp.stanford.edu/links/images/ball.green.gifTRACTORTELRI Research Archive of Computational Tools and Resource.Corpora, many multilingual, in European community languages.Small feefor joining in order to be able to get corpora (unless you havecontributed corpora).http://www-nlp.stanford.edu/links/images/ball.green.gifCLR (Consortium for LexicalResearch) Email: lexical@nmsu.edu.Focuses more on languageprocessing tools and lexicons, but does have some corpora.As of Feb 1996,you can get most of their stuff by anonymous ftp to clr.nmsu.edu. Their catalog isavailable as a postscript file.http://www-nlp.stanford.edu/links/images/ball.green.gifOTA (Oxford Text Archive)Provides mainly literary texts.Has a bright new website.Email: info@ota.ahds.ac.uk.Most materials are available on the web or by anonymous ftp toota.ox.ac.uk.Some require negotiations with the providers.http://www-nlp.stanford.edu/links/images/ball.green.gifLeipzig Corpora CollectionSentence collections in MySQL database for 17 mainly European languages.http://www-nlp.stanford.edu/links/images/ball.green.gifBNC (British National Corpus)A 100 million word corpus of British English. Youcan search it online from their simple webinterface or via View, a muchbetter interface by Mark Davies, and there is an index togenres by David Lee.And now, an XML edition.http://www-nlp.stanford.edu/links/images/ball.green.gifEuropean CorpusInitiative Multilingual Corpus I (ECI/MCI) A 98 million word corpus, covering most of the majorEuropean languages, as well as Turkish, Japanese, Russian, Chinese, andMalay.Cheap.Need to sign a license agreement available at either theWWW site. Also available from the LDC.http://www-nlp.stanford.edu/links/images/ball.green.gifSurvey of English UsageAt the Department of English Language andLiterature at University College London.Includes the British part ofICE, the InternationalCorpus of English project.Now availabletagged, and parsed for function.83,419 sentences.Includes ICECUP,dedicated retrieval software. Also, DiachronicCorpus of Present-Day Spoken English (800,000 words, tagged andparsed, half from ICE-GB and half from London-Lund). http://www-nlp.stanford.edu/links/images/ball.green.gifInternational Corpus of English (ICE)Million word collections of English from various world Englishes: ICE-NZ,ICE-HK, ICE-East Africa, etc.Severalof them are downloadable from this site.http://www-nlp.stanford.edu/links/images/ball.green.gifCorporaheld by Lancaster UniversityThis link provides its own annotations.http://www-nlp.stanford.edu/links/images/ball.green.gifThe European LanguageActivity NetworkPromises a uniform query language for accessing corpora in all EUlanguages -- but isn't quite there yet.http://www-nlp.stanford.edu/links/images/ball.green.gifTalkbank.Rich video and transcripts.Particular languages
English
English language corpora available from the sites above are not repeatedhere.
http://www-nlp.stanford.edu/links/images/ball.green.gifCorpora by Geoffrey Sampson's teamThe SUSANNE corpusand the CHRISTINEcorpus (SUSANNE markup of a speech corpus).http://www-nlp.stanford.edu/links/images/ball.green.gifMichigan Corpus of AcademicSpoken English (MICASE).1.7 million words from 1997-2001.http://www-nlp.stanford.edu/links/images/ball.green.gifPenn-Helsinki Parsed Corpus ofMiddle EnglishA syntactically annotated corpus of the Middle English prosesamples in the Helsinki Corpus of Historical English, withadditions.1.3 million words. $200.http://www-nlp.stanford.edu/links/images/ball.green.gifCorpus of Professional, SpokenAmerican-English (CPSA)2 million words from faculty and committee meetings and White Housepress conferences (50K work sample free on internet).http://www-nlp.stanford.edu/links/images/ball.green.gifLancaster Parsed Corpushttp://www-nlp.stanford.edu/links/images/ball.green.gifDialogue DiversityCorpus (Bill Mann)http://www-nlp.stanford.edu/links/images/ball.green.gifAmerican NationalCorpusChinese
English language corpora available from the sites above are not repeatedhere.
http://www-nlp.stanford.edu/links/images/ball.green.gifThe Lancaster Corpus of Mandarin Chinese (LCMC)By Tony McEnery and Richard Xiao.Distinguished by being a balancedcorpus, and freely available.Multilingual
http://www-nlp.stanford.edu/links/images/ball.green.gifJRC-AcquisA parallel corpus of EU documents across all member states.8 million words or more in each of 20 languages.http://www-nlp.stanford.edu/links/images/ball.green.gifEMILLE/CIILMonolingual written corpus data for 14 SouthAsian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri,Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu).Orthographically transcribed spoken data and parallelcorpus data for five South Asian languages (Bengali, Gujarati, Hindi,Punjabi and Urdu). In addition, the parallel corpus contains the Englishoriginals from which the translations stored in the corpus were derived.All data in the corpus is CES and Unicode compliant. The EMILLE corpustotals some 94 million words.Downloadable.http://www-nlp.stanford.edu/links/images/ball.green.gifOPUSAn open source parallel corpus, aligned, in many languages, based onfree Linux etc. manuals.http://www-nlp.stanford.edu/links/images/ball.green.gifWorldHealth Organization Computer Assisted Translation page.Also includes a good selection of links on Computer AssistedTranslation.(See also thecopyright page.)http://www-nlp.stanford.edu/links/images/ball.green.gifSearchableCanadian Hansard French-English parallel texts (1986-1993)From the Laboratoirede Recherche Appliquée en Linguistique Informatique,Universite de Montréalhttp://www-nlp.stanford.edu/links/images/ball.green.gifEuropean Union web serverParallel text in all EU languages.(In particular tryEuropean legislation.)http://www-nlp.stanford.edu/links/images/ball.green.gifTELRI CD-ROMsParallel and other text in central and eastern european languages.Bosnian
http://www-nlp.stanford.edu/links/images/ball.green.gifThe Oslo Corpusof Bosnian Texts.Czech
http://www-nlp.stanford.edu/links/images/ball.green.gifParallelCzech-EnglishLiterature translations in Czech and Englishhttp://www-nlp.stanford.edu/links/images/ball.green.gifCzech National Corpus project:SYN2000100 million words of contemporary Czech.French
http://www-nlp.stanford.edu/links/images/ball.green.gifAssociation des BibliophilesUniversels Various French literary works.http://www-nlp.stanford.edu/links/images/ball.green.gifAmerican andFrench Research on the Treasury of the French Language (ARTFL) 150 million word corpus of various genres of French.You have to be amember to use it (but membership is fairly cheap). German
http://www-nlp.stanford.edu/links/images/ball.green.gifCOSMASCorpusLarge (over a billion words!) online-searchable German and Austriancorpora.This is the publically available part of the 1.85billion word Mannheimer CorpusCollectionhttp://www-nlp.stanford.edu/links/images/ball.green.gifNEGRACorpusSaarland University Syntactically Annotated Corpus of GermanNewspaper Texts.Available free of charge to academics.20,000sentences, tagged, and with syntactic structures.Free for academic use.Russian
http://www-nlp.stanford.edu/links/images/ball.green.gifRussian National Corpus150 million words, 5 million words POS-tagged, some in dependencytreebank.http://www-nlp.stanford.edu/links/images/ball.green.gifLibrary ofRussian Internet LibrariesVarious literary works.Slovene
http://www-nlp.stanford.edu/links/images/ball.green.gifSlovene-English parallel corpus1 M words, free to download + on-line concordances.http://www-nlp.stanford.edu/links/images/ball.green.gifComing soon: Slovene referencecorpus of 100 M wordsSpanish and Portuguese
http://www-nlp.stanford.edu/links/images/ball.green.gifTychoBrahe Parsed Corpus of Historical PortugueseOver a million words ofPortuguese from different historical periods, some of itmorphologically analyzed/tagged.Free.http://www-nlp.stanford.edu/links/images/ball.green.gifInformation about MarkDavies' collection of (mainly historical Spanish and Portuguese.It's not clear what their availability is.http://www-nlp.stanford.edu/links/images/ball.green.gifThe CUMBRE corpus.Contact ProfessorAquilino Sánchezhttp://www-nlp.stanford.edu/links/images/ball.green.gifThe CRATER Spanish corpusMorphosyntactically tagged telecommunicationmanuals) is available by ftp.http://www-nlp.stanford.edu/links/images/ball.green.gifCorpusresources for PortugueseIn total about 70 million words, available free, from varioussources (newswire, etc.)http://www-nlp.stanford.edu/links/images/ball.green.gifFolha de S. Paulo newspaper4 annual CDROMs with full text.http://www-nlp.stanford.edu/links/images/ball.green.gifCOMPARAPortuguese-English parallel corpus.(In general, various resourcesat Linguateca site.http://www-nlp.stanford.edu/links/images/ball.green.gifSee also under ELRA, above.Swedish
http://www-nlp.stanford.edu/links/images/ball.green.gifSpraakdata, Departmentof Swedish, Göteborgs University.Has various searcable part of speechtagged Swedish corpora (Parole, Bank of Swedish, etc.), and somematerial in Zimbabwean languages.Treebanks
Name Language Size Availability CommentsPenn TreebankUS English2 million + wordsAvailable (distributed by LDC)1 million WSJ, 1 million speech, surface syntax (1970s TG)BLLIP WSJ corpusUS English30 million wordsAvailable (distributed by LDC)WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference markingICE-GBUK English1 million words (83,394 sentences)Available; c. 500 poundsBritish part ofICE, the International Corpus of English project.Tagged and parsedfor function. Half spoken material.NEGRA CorpusGerman20,000 sentencesAvailable free of charge to academics on completion of license agreement.Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures.TIGER corpusGerman700,000 wordsAvailable free of charge for research purposes on completion of license agreement.German newspaper text (FrankfurterRundschau). Semi-automatically parsed.They also have a good treebank search tool, TIGERSearch.Alpino Dependency TreebankDutch150,000 wordsFreely downloadableAssorted subcorpora.By far the largest isthe full cdbl (newspaper) part of the Eindhoven corpus.The Prague DependencyTreebank 1.0Czech500,000 wordsFree on completion of license agreement (available through LDC).Analyzed at thelevels of parts of speech, syntactic functions (and, in the future,semantic roles) level in a dependency framework.Text from newspapers and weekly magazines.TUT:Turin University TreebankItalian2,400 sentencesFree download.Morhpological analysis and dependency analysis. Penn Treebank translation.Civil law and newspaper texts.Bulgarian TreebankBulgariann/aPOS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement)An under construction Bulgarian HPSG treebank.Penn Chinese TreebankChinese100,000 wordsAvailable (LDC)Based on Xinhua news articles.1980s-style GB syntax.Danish Dependency Treebank 1.0Danish100,000 wordsAvailable free under the GPL.Built on a portion of the Parole corpus.Floresta Sintá(c)ticaPortuguese168,000 words hand-corrected; 1,000,000 words automatically parsedHand corrected part is free web download; automatically parsed part available through email contactText from CETEMPúblicocorpus.Phrase structure and dependency representations. Available in several formats, including Penn Treebank format.Talbanken05Swedish300,000 wordsFree downloadResurrects and modernizes an early treebank from the 1970s.http://www-nlp.stanford.edu/links/images/ball.green.gifVerbmobil Tübingen: under construction treebanked corpus of German,English, and Japanese sentences from Verbmobil (appointmentscheduling) datahttp://www-nlp.stanford.edu/links/images/ball.green.gifSyntactic Spanish Database (SDB)University of Santago de Compostela. 160,000 clauses / 1.5 million words.http://www-nlp.stanford.edu/links/images/ball.green.gifCKIP ChineseTreebank (Taiwan).Based on Academia Sinica corpus.(There's also a 100sentence Chinese treebank at U. Maryland.)http://www-nlp.stanford.edu/links/images/ball.green.gifLDC Korean Treebank.http://www-nlp.stanford.edu/links/images/ball.green.gifDublin-EssexTreebank projectDeriving Linguistic Resources from Treebanks.Treebanks
CSTBank:Cross-document Structure Theory: marking sentence functionalrelationships across related documents.
Resources for Word Sense Disambiguation
http://www-nlp.stanford.edu/links/images/ball.dullgreen.gifThe Senseval web site Has acomprehensive selection of resources for WSD, including a goodlist of WSDdata resources, but not yet the new SEMCOR. http://www-nlp.stanford.edu/links/images/ball.dullgreen.gifTed Pedersen's codeIncludes various WSD systems.http://www-nlp.stanford.edu/links/images/ball.dullgreen.gifSenseClustersOpen source package for unsupervised discovery of word senses by clusteringtogether instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based onboth context vectors and similarity matrices, and including links toSVDPACKC and CLUTO.Ted Pedersen and Amruta Purandare.http://www-nlp.stanford.edu/links/images/ball.dullgreen.gifEvocationWordNet synset similarity judgmentsJudgments on how similar the meanings of synsets are and how commonthey are in the BNC from Jordan Boyd-Graber.Literature
There are now quite large collections of online literature, available invarious languages (though the majority are in English, of course).Beloware pointers to some of the main collections:
Entirely or mainly English
http://www-nlp.stanford.edu/links/images/ball.green.gifAlex: A Catalogueof Electronic Texts on the InternetSeems to have one of the largest collection.Searching and browsingfacilities through gopher menus.Many languages.http://www-nlp.stanford.edu/links/images/ball.green.gifWiretap Electronic Text ArchiveExtensive and good quality.Still in the gopher age, though.http://www-nlp.stanford.edu/links/images/ball.green.gifThe On-line BooksPageThe index here only covers books in English, but there are lots oflinks to other collections of material in all languages.http://www-nlp.stanford.edu/links/images/ball.green.gifProject GutenbergThe oldest and largest project to get out of copyright literatureonline, freely available.(Or see the mirror, Sailor's ProjectGutenberg site.)http://www-nlp.stanford.edu/links/images/ball.green.gifThe Electronic TextCenter of the University of VirginiaLarge collection of SGML text, mainly in English, but also in othermajor languages.http://www-nlp.stanford.edu/links/images/ball.green.gifCenter for Electronic Texts in theHumanitiesPrinceton/Rutgers collaboration.They didn't have it together withtheir web site when I stopped by, but they may soon.http://www-nlp.stanford.edu/links/images/ball.green.gifOxford Electronic Text Library EditionsAvailable fromOxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300.The Complete Works of Jane Austen is $95.00, and is reviewed inComputers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.http://www-nlp.stanford.edu/links/images/ball.green.gifCoreferenceannotated textsFrom University of Woverhampton (R. Mitkov, C. Barbu et al.).Acquisition data
http://www-nlp.stanford.edu/links/images/ball.green.gifCHILDES database.Database of child language transcriptions in English and many otherlanguages.Texts are also available by ftp.Certainusage requirements.Manuals and programs for accessing the data (theCLAN concordancer) are also available online.Now in Unicode XML.SGML/XML
http://www-nlp.stanford.edu/links/images/ball.green.gifRobin Cover's SGML/XMLWeb PageThis is a wonderful compendium of information on SGML and XML, includinginformation onthe Text Encoding Initiative (TEI).This document is also a guide tomany text collections (ones usi
页:
[1]