What is Corpus ?

Simply put, the meaning of corpus is a collection. In Natural Language Processing (NLP) context, this term refers to the collection of texts to be used for specific purposes.

In NLP, corpora are used for text mining, deep learning, machine learning and artificial intelligence processes. Parallel to your needs, these processes might require the use of monolingual corpus or multilingual corpora.

In Starlang, we turn text data (raw data) into sifted, sorted and sometimes annotated corpora with our team of linguists. To do so, we process great amounts of raw text (plain text) such as customer feedback and comments, business files and documents, web page contents, and trending keywords.

How do we build a corpus?

Raw data / Data processed by our linguists / Domain-specific corpora.

Why do we need linguists to build a corpus?

Natural Language Processing (NLP) is an involved and multifaceted task that requires the input of both computational science and linguistics experts. Some steps of NLP like machine learning require more help from coders and software engineers while some others like building a corpus require more help from linguists. The reason behind this division of labour is the fact that human languages pose various challenges regarding the NLP processes.

We work with linguists while building corpora in order to protect the semantic integrity of your data.

Various repositories offer dictionaries that can be integrated into NLP projects for purposes like corpus building. Yet these dictionaries fail to capture the intended meaning of words, as a result they cannot process your text data in an accurate and consistent way due to the morphologically rich typology and intricate semantics of Turkish language. That is why we process your raw data with a team of seasoned linguists and base our NLP operations on the unique characteristics of Turkish.

We offer accurate and consistent data processing in order to protect the semantic integrity of texts and accelerate your NLP processes.

We deliver a processed (and, if requested, annotated) corpus, so that you don’t waste your time on pre-processing, data sorting and similar operations. In addition, you can perform text mining, machine learning and artificial intelligence processes faster and acquire better results.

We provide domain-specific words and terminology to your corpus.

Turkish language consists of more than 50.000 base forms but not all of them are present in a given data set. Moreover, a significant portion of these forms has more than one unique meaning. That is why employing a dictionary or corpus that includes the entirety of these forms and their meanings leads to ambiguous and often noncoherent results.

In order to ensure that your NLP processes provide accurate analyses and meaningful results, our team of linguists includes related terminology, domain-specific words and their context specific meaning in your corpus.

We make sure that your corpus is always available.

We deliver a domain-specific corpus (and/or dictionary) built in accordance with the unique needs of your organization. Therefore, you can incorporate your corpus into your text mining, deep learning, machine learning and artificial intelligence projects at your own pace and any time you desire.

Do you need a domain-specific corpus?