Latent Dirichlet Allocation

We propose a generative model for text and other collections of dis crete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present em pirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.


Introduction
The application of algorithmic and computational techniques and methods to literature and humanities studies has lately resulted in the emergence of a novel research field termed digital humanities It should come as no surprise that natural language processing and text mining techniques have come to play a central part in this emerging field, 4 and it is exactly in this context that the present work should be placed.
In this article, we demonstrate the capabilities of unsupervised 5 text mining techniques in revealing useful and meaningful information about the structure of prose literature works. Our exposition aims at simplicity and clarity of the general methods used, so as to be of introductory merit to an uninitiated reader. We have chosen Thomas Pynchon's novel V. as our example, which should be familiar to Orbit readers, as it is well known that the novel exhibits a highly heterogeneous structure, with two minimally intersecting storylines running in parallel. Our purpose is to explicitly demonstrate that the computational techniques employed are rather, it is to provide adequate evidence that the computational analysis results indeed converge to the already known answer. In other words, here we aim to legitimize the use of such techniques in the eyes of the uninitiated, and possibly skeptical (or even suspicious 6 ) scholar, by verifying that they confirm the existing critical readings of the novel. Nevertheless, on the way, we have come upon a slight revision of the accepted division of the novel between the two storylines, as we explain in detail in Section 2.
Trying to clarify further, and to avoid possible misunderstandings regarding the scope of the present study: this article is written from the point of view of a data scientist, and our strategic objective is a) to convince Pynchon scholars that there is indeed merit in using such techniques to aid the critical analysis, and b) possibly to help initiate their application in critical problems and questions yet unanswered, or even not yet posed. We will pause here, to come back to this discussion in the final section of the article.
A work like the present one can easily grow to an inconvenient (and possibly threatening) length and complexity, if one attempts to take notions like "rigour" and "completeness" at face value, and thus tries to introduce in detail all the technical concepts involved. As our stated objective is to provide a convincing demonstration for the uninitiated reader, we deliberately choose not to go down this path: hence, we mainly introduce the relevant concepts in an intuitive manner, just enough to facilitate the reader's smooth engagement with the main findings. In every case, appropriate references are cited, which the interested reader can consult for delving further into the computational techniques employed.
The general structure of the rest of this article is as follows: in Section 2 we provide a brief overview of the novel, and we frame more precisely our research question; the basic framework of our computational approach is introduced in Section 3; in Sections 4-7 we present our computational experiments and findings, introducing also the relevant concepts in the above stated manner; Section 8 concludes with a comprehensive discussion regarding the interpretation of our findings, the limitations of our approach, and some suggestions for possible future work.

Overview of the novel
As already mentioned, Thomas Pynchon's novel V. consists of two minimally intersecting storylines running in parallel, a fact that is rather universally recognized in the relevant critical literature: Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. 3 Dualism structures Pynchon's first novel, V., a multifaceted work stretched between two picaresque plots. The first plot involves Benny Profane […] Profane wanders the "streets" of the present -a period of several months in 1955 and1956. His motion […] frames the other episodes in the novel. Profane's travels intersect those of Herbert Stencil […] In contrast to Profane's, Stencil's movements have purpose: He is searching for manifestations of a mysterious female called V., who he believes has appeared at various social and political junctures since the turn of the century. ] intercalates sections within a linear narrative set in 1956 in order to broaden the scope of that narrative. The two main sequences which alternate with each other and thus establish one of the novel's rhythms, are the latter which centres on a character called Benny Profane and takes place mainly in New York, and a series of historical chapters which spread from 1898 to 1943. The historical sections are linked by the search of one Herbert Stencil for a mysterious figure called V. 8 "[T]here are two main threads or plots to [the novel's] structure, threads that begin far apart from each other but ultimately intersect and interweave, forming a "V" in the plot itself. One storyline of the book details the life and adventures of Benny Profane and is set in the mid 1950s; the other line of the book describes Herbert Stencil's quest for V. "herself," and includes most of the key, calamitous events of the twentieth century.

9
Somewhat to our surprise, despite this universal agreement regarding the existence of two different storylines in the novel, it seems that there has never been an attempt to exclusively map each chapter to one and only one storyline. Indeed, and to the best of our knowledge, the closest one has come to such a distinction is a relevant table in David Seed's The Fictional Labyrinths of Thomas Pynchon, which we reproduce in Table 1 below. South-west Africa, 1922Africa, & 1904 Early summer to August 11 Malta, 1939Malta, & 1940 August  As it can be seen from Table 1, most of the chapters are indeed "pure", in the sense that they belong to one and only one storyline; nevertheless, there are four chapters (2, 4, 5, and 7) that seem to contain elements from both storylines. As we intend to use the individual chapters of the novel as our "base units", we need a way to obtain a one-to-one mapping of chapters to storylines. So, we put forward the following, "operational" definition, for mapping individual chapters to each one of the two storylines: If a chapter takes place at the novel's present and it involves Benny Profane, 10 it belongs to the Profane storyline, irrespectively of what other nested stories it may contain; otherwise, it belongs to the V. storyline.
We will strongly argue that the above definition, irrespectively of its "operational" potential for the purposes of our study, is indeed a natural and intuitive one, and the most probable answer once the relevant question of chapters-to-storylines mapping has been posed. Also, as we shall see, our computational results justify this choice ex post facto.

11
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. 5 The above definition leaves us with 11 chapters in the Profane storyline, and 6 chapters (including the epilogue) in the V. storyline.
12 This chapter division, along with some other relevant information, is summarized in  With the novel structure as depicted in Table 2, our research question can be now stated as follows: given the heterogeneous nature of the narrative as imposed by the two different storylines, can we construct relatively simple unsupervised text mining techniques that can reveal structural heterogeneities at the chapter level? In other words, can we come up with relatively simple algorithms that can distinguish between the two storylines, so as to group the corresponding chapters separately in a meaningful way? And if yes, how stable and consistent can such groupings be with varying methods, algorithms, parameterizations, and preprocessing tasks applied?
In answering the above questions, and in what follows, we must keep in mind two things: first, it is naturally and intuitively expected that the 11 chapters of the Profane storyline bear a greater degree of similarity among them regarding word usage than with the V. storyline chapters; at the same time, the 6 chapters of the V. storyline are not expected to bear an analogous degree of similarity among them, as their narrative intra-diversity is much wider than that of the Profane chapters.
13 These observations will serve as a means of model validation, in order to justify or not the results produced.

Basic framework and concepts of the computational approach
With the exception of Section 6, all results reported here are based on the bag-of-words assumption, i.e. we simply count individual words and compute word frequencies, without taking into account combinations of words or any other higher-order semantic structure of the text.
14 The limitations of such an approach are apparent, but, as already stated, our purpose here is to keep the techniques used as simple as possible, in order to demonstrate their power and applicability in a most elementary setting; several ways by which this assumption can be relaxed and extended are discussed in Section 8. In accordance with the quantitative text analysis framework and the relevant terminology, we consider the novel as a document collection, where the individual documents are indeed the book chapters.
Based on the above mentioned bag-of-words assumption, the simplest approach in order to quantify the content of a document is simply to compute the frequencies of the individual words (terms) contained in it, and then represent the document as the weighted set of these terms, with the weights being the computed term frequencies. This is called (simple) term-frequency (TF) weighting, and it is indeed a valid document representation approach. Nevertheless, it happens that we can also do somewhat better, the rationale being as follows: we would like to give more weight to terms that may appear very frequently in only one (or a subset) of our documents, as these Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. 7 terms are possibly the exact ones that mostly signify the differences in the content of our documents. This leads to the term-frequency/inversedocument-frequency (TF-IDF) 15 weighting, which can be shown to possess the following qualitative properties: 16 1. It is highest, for a term that occurs many times within a small number of documents in the collection (thus lending high discriminative power to those documents).

2.
It is lower, for a term that occurs fewer times in a document, or occurs in many documents of the collection.

3.
It is lowest, for a term that appears in virtually all documents of the collection.
We employ both TF and TF-IDF weighting schemes in our experiments, indicating our choice each time.
The end result of the above "text quantification" process is the representation of a document as a mere list of numbers, 17 where the list length is equal to the number of different terms contained in the document, and the list entries are the (TF or TF-IDF) weights of each individual term.

18
Stacking these lists together for all documents in a given collection, we get the term-document matrix, which exhibits how exactly the various terms in a collection are distributed among its constituting documents.
Having effectively transformed the text of the novel into a term-document matrix (with the documents being the individual chapters), i.e. a matrix of numbers, as described above, we can now process it, using several quantitative and computational techniques appropriate for our purpose.
Of fundamental importance in what follows -actually in almost every approach in the quantitative analysis of text -are the notions of similarity and distance. Informally speaking, the similarity between two data objects (i.e. two documents, in our case) is a numerical measure of the degree to which the two objects are alike. In analogy, the dissimilarity is a measure of the degree to which two data objects are different. Often, the term distance is used as a synonym for dissimilarity.
19 Applied to our case, it should be obvious from the above definitions that the lower the distance between two documents (expressed as entries in a term-document matrix), the more similar these two documents are, while a higher distance denotes a greater dissimilarity between two documents.

20
There are several different measures which can be used in order to quantify the distance between data objects, and the choice is usually dictated by the specific problem at hand. 21 The Euclidean distance 22 is a generalisation of our usual notion of distance between two points in our everyday, 3-dimensional space. The cosine similarity 23 is frequently used for text and document analysis; as it has been shown that it exhibits a very high and almost perfect negative correlation with the Euclidean distance (i.e. the higher the cosine similarity between two objects, the lower their Euclidean distance), 24 and since we utilize the Euclidean distance in what follows, we will not employ the cosine similarity here. Going into the details of the different distance functions employed is clearly beyond the scope of the present article; the relevant list is shown in Table 3 (Section 4) below, and more technical details can be found in the provided references. Figure 1: The 300 most frequent terms in the whole novel Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. 9 In what follows, except otherwise mentioned, the typical text preprocessing tasks of stop words 25 and punctuation removal have been applied. As this is a literary text, we did not perform word stemming. 26 We also found that conversion to lowercase or otherwise has occasional impacts to the results, so we keep it as a parameter for experimentation. As a kind of kick-off, and before proceeding to our main results, in Fig. 1 we present a wordcloud visualization of the 300 most frequent terms in the whole novel. It can be seen that, excluding the character names "Profane", "Stencil", and "Pig", the most frequently occurring terms are "time", "night", "street", "girl", and "eyes".

Hierarchical clustering
Intuitively speaking, clustering refers to the grouping of similar objects together, whereas the groups (clusters) thus produced are thought of as being meaningful, useful, or both.
27 Cluster analysis has a rather long history in various fields of physical and social sciences, including the quantitative analysis of documents and texts. There are several different types and methods for clustering data; here we will restrict the discussion to what is termed as agglomerative hierarchical clustering, which is the type of clustering most often used for this kind of text analysis.

28
A hierarchical clustering is a set of nested clusters that are organized as a tree, and frequently visualized as a tree-like diagram called a dendrogram. 29 Usually, the leaves of the tree are singleton clusters of individual data objects, 30 which, the reader is reminded, in our case are the individual book chapters. Agglomerative means that the procedure starts with the single data objects as the (trivial) individual clusters and, at each step, merges the closest pair of clusters, according to the particular distance function (see Section 3) used. 31 It should be intuitively obvious, even from this short discussion, that book chapters that are "close" together are expected to be found in the same branch of the corresponding dendrogram visualizations, and "away" from other chapters, with which they are less similar.
Given a particular distance function, there are several different hierarchical clustering methods available, depending on how exactly the distance between clusters is defined: in the single linkage method, this distance is defined as the distance between the closest two points that are in different clusters, while in the complete linkage method it is the distance between the farthest two points in different clusters; 32 UPGMA and WPGMA methods stand for "unweighted/weighted pair group method using arithmetic averages" respectively, 33 while Ward's method uses a somewhat different cluster distance measure, involving the increase in the variance of the 10 Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. distances between the data objects and the cluster centroids, when the two clusters are merged.

34
At this point, we would like to urge the reader not to let himself or herself be discouraged by the introduced technical terminology, which serves only for making the present article self-sufficient regarding terms and definitions: we argue that the rest of this section can be safely read and comprehended intuitively, without any direct reference to the technical definitions given above.
That said, in the rest of this section, we present hierarchical clustering results involving six (6) different distance functions between data objects (book chapters), each one of them tried with five (5) different clustering methods. 35 We stress that, to the best of our knowledge, this is far from typical in similar studies of literary texts, where the clustering experiments are usually limited to the Euclidean or cosine distance functions, with Ward's or complete linkage clustering methods, 36 with usually no justification provided for the choice of a particular distance function or clustering method. The rationale for employing such a rather numerous variety of clustering approaches in our study is discussed and justified in the final section of the article.
To begin with, the results of hierarchical chapter clustering using Ward's method with the Manhattan distance 37 are shown in Fig. 2 below.
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V.

11
Figure 2: Hierarchical clustering of chapters using Ward's method with the Manhattan distance. The six V. storyline chapters stand out clearly on the left branch of the dendrogram. The picture is very similar using simple term frequency (TF) weighting without conversion to lowercase.
In Fig. 2, the six chapters of the V. storyline (3,7,9,11,14, and 17) are clearly grouped together and "away" from the chapters of the Profane storyline, which are also themselves grouped together. The clustering of Fig. 2 is produced by weighting the terms according to their TF-IDF count (see Section 3) with lowercase conversion, but the picture is qualitatively very similar with simple TF weighting and preservation of the uppercase characters. Two more clustering examples with very similar results are shown in Figs. 3 and 4, using different clustering methods, distance functions, and term weighting.

12
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. Figure 3: Hierarchical clustering of chapters using the UPGMA method with the Canberra distance. The six V. chapters stand grouped together in the rightmost branch of the dendrogram (14, 3,7,9,11,17). Changing the weighting to TF-IDF or converting to lowercase gives practically identical results.
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V.
13 Figure 4: Hierarchical clustering using the complete linkage method with Euclidean distance. Again, the six V. storyline chapters can be seen to occupy their own dedicated (left) branch of the dendrogram.
The results of our thorough experiments with hierarchical clustering are summarized in Table 3 below. The tick symbols mean that, for the particular combination of clustering method (columns) and distance function (rows), we were always able to find a hierarchical clustering similar to that of Figs. 2-4 above, by varying the term weighting (TF or TF-IDF), the percentage of sparse terms removed, and, rarely, the conversion or not to lowercase. When using the Canberra and binary distance functions with the single linkage clustering method, we had again a dendrogram branch consisting exclusively of five chapters of the V. storyline, but Chapter 14 was not included. For the UPGMA method with the Euclidean distance, we had the case that Chapter 16 was grouped together with the six V. storyline chapters, again in a dedicated branch of the corresponding dendrogram with no other chapters of the Profane storyline.
14 Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V.
From the results shown in Table 3 and Figs. 2-4 above, it is apparent that the clustering algorithms employed are capable of capturing the heterogeneities among the book chapters in a robust and consistent way, across a rather wide spectrum of settings, approaches, and parameters.

Graph visualizations
Utilizing the distance calculations produced as a part of the clustering approaches presented in the previous section, we are able to come up with a different, ad hoc visualization technique that can highlight the book structure from an alternative viewpoint. The idea behind it is simple: we visualize the chapters as nodes in a graph; we apply a certain threshold to the distance measures, so that if the distance between two chapters is lower than this threshold, we connect these two chapters with a link; otherwise, if the distance between two chapters is greater than this threshold (i.e. if two chapters are more dissimilar according to the particular distance function used), we do not connect them. That way, we expect to get a graph where the most similar chapters will be connected between them, and disconnected from the other ones. By applying this idea to the Euclidean distance function between our chapters, we get the graph shown in Fig. 5.

3.
Chapter 16 is connected with the main body of the Profane storyline only through Chapter 13 (the main characters are already in Malta), and it stands out naturally as a kind of terminal (or a cape!).
Looking at the terminal-or cape-like depiction of Chapter 16 in Fig. 5, we cannot help but recall the actual ending of the chapter (of the whole Profane storyline, in fact), with Benny Profane running towards the literal edge of Malta:

39
Later, out in the street, near the sea steps she inexplicably took his hand and began to run. The buildings in this part of Valletta, eleven years after war's end, had not been rebuilt. The street, however, was level and clear. Hand in hand with Brenda whom he'd met yesterday, Profane ran down the street. Presently, sudden and in silence, all illumination in Valletta, houselight and streetlight, was extinguished. Profane and Brenda continued to run through the abruptly absolute night, momentum alone carrying them toward the edge of Malta, and the Mediterranean beyond.

40
As with hierarchical clustering, our results here seem also to be robust and persistent under different settings and parameterizations: Fig. 6 shows a similar graph visualization, this time with the Manhattan distance. Despite some differences, most notably the non-connection of Chapter 1 with the main body of the Profane storyline, the similarity between the two figures is striking. Trying to get a similar visualization with the Canberra distance, we were in for a surprise, as it can be seen in Fig. 7. After checking out for errors, and while still thinking of not including Fig. 7 here, we came up with a controversial claim, which we expose for debate: In a (highly unlikely) question posed by a (highly unlikely) fictitious candidate reader of the novel, "since I keep on hearing about the highly heterogeneous structure of the novel, it should be possible to read roughly half of the book and still be able to grasp the most out of it; now, which chapters should I read?", we claim that the connected chapters in Fig. 7 (i.e. the V. storyline chapters minus Chapter 14, framed by the first and the last of the Profane storyline chapters) constitute a possible valid answer.

Normalised compression distance
The normalised compression distance (NCD) is a relatively recent method, proposed by Cilibrasi and Vitányi, for computing the distance between generic data objects based on compression. The method has deep roots in information theory, particularly in the concept of Kolmogorov complexity.

41
Surprisingly enough, the method has yet to find its way into the standard text mining toolbox and it remains rather underexploited. An application to literature analysis was included already in the original NCD paper, where a perfect hierarchical clustering of five classic Russian authors (Dostoyevsky, Gogol, Turgenev, Tolstoy, and Bulgakov) is reported, based on three or four original texts per author; 42 interestingly enough, when fed with English translations of works by the same authors, the resulting clusters were biased by the respective translators. 43 In Cilibrasi and Vitányi's words, 44 "it appears that the translator superimposes his characteristics on the texts, partially suppressing the characteristics of the original authors", a rather wellknown truth regarding literature translation, which nevertheless the method was able to independently re-discover, based on fairly simple quantitative measures.
The choice of the particular data compressor to be used is the only free parameter of the NCD method. Cebrián et al. have performed a thorough, independent performance test of the method using three different compressors, namely bzip2, gzip, and PPMZ, which in turn are example implementations of the three main types of compression algorithms, i.e. block-sorting, Lempel-Ziv, and statistical, respectively. Following their findings and recommendations, we do not use the gzip compressor here due to file size concerns; also, since PPMZ implementations are not common, we have used instead the LZMA compressor, which has an acceptable file size region 45 identical to that of the PPMZ and suitable for our data. The specific implementations employed are the bz2 and pylzma Python libraries.
It should be stressed that, in stark contrast to all the other methods used in this paper, the NCD method does not rely on the bag-of-words assumption; also, the input text files are fed to the algorithm "as-is", without any kind of preprocessing. That way, and by its nature, the NCD method is able to capture higher-order information included in the text, which by definition goes beyond the reach of all the other methods employed here.
Since the NCD is essentially a distance measure, it can itself be used for constructing hierarchical clusterings; indeed, this is the principal use of the method as suggested by its creators. Nevertheless, here we choose to use it in order to construct graph visualizations similar to those in Section 5 above. Figs. 8 and 9 depict such graphs, built using the LZMA and bzip2 compressors respectively. All the characteristics already met in Figs. 5 and 6 of Section 5 are again present here, and should by now look familiar: a main connected body consisting of the Profane storyline chapters; the V. storyline chapters as islands; the "gateway" function of Chapter 13; the "terminal" function of Chapter 16; and even the loose integration of Chapters 1 and 15 into the main cluster of the Profane storyline (Chapters 2, 4, 5, 6, 8, 10, and 12). We notice that the connection between Chapters 11 and 17 of the V. storyline shown in Fig. 9 is a meaningful one, as both chapters take place in Malta, with Fausto Maijstral as a central figure. Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V.

A simple topic model
Probabilistic topic modeling 46 is a family of algorithms that aims to automatically discover and extract thematic information from (usually large) corpora of text documents. Without going into the technical details here, for our purposes it suffices to say that, according to the topic modeling approach, each document in a collection consists of several topics in different proportions, whereas the topic set itself is common for the whole document collection. The method has found applications in the analysis of political texts, 47 as well as in meta-analyses of scientific papers published in academic journals, ranging from automatic tagging and labeling 48 to location and identification of specific research trends as they evolve in time.

49
Here we will use one of the simplest and most basic approaches in topic modeling, namely the latent Dirichlet allocation (LDA), 50 as implemented in the R package topicmodels. 51 We will treat the whole book as our

22
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. collection, with the documents being the individual chapters. Again, we aim for simplicity and clarity of demonstration rather than a rigorous treatment using more complex and sophisticated techniques. In what follows, the reader has to keep in mind that topic modeling, in contrast with the other techniques employed here, is a probabilistic method, and as such it is expected to give non-identical results for various runs of the algorithms with different initializations ("seeds") of the software random number generator involved.
In the elementary LDA approach, the number of topics one is looking for has to be predefined by the researcher. Following the suggestions of Grün and Hornik, we tried to determine the optimal number of topics by running the algorithm for a range of possible topic numbers and computing the log likelihood of each resulting model. Unfortunately this approach, when repeated for several different random seeds, gave a topics number in the range of 18-23. Clearly, the number of topics should not be greater or even equal to the number of documents (we did confirm that: running the algorithm for 18 topics gave the trivial result of assigning each chapter to one unique topic).
With the likelihood approach unsatisfactory, we tried to determine a reasonable number of topics ad hoc, based on our prior knowledge about the novel: we thought that this number should not be less than the number of the V. storyline chapters (6), and it should not be greater than roughly half the total number of chapters (8-9). That way, with some trial and error with numbers of topics between 6 and 9, we were able to come up with a fitted LDA model of seven (7) topics. Our results are shown in Table 4.  Table 4: Chapters assignment to topics, as produced by a 7-topic model. V. storyline chapters are denoted in bold. Asterisks denote chapters that were assigned to more than one topic with comparable probabilities.
Recall that in principle, according to the topic modeling approach: (i) a topic can be part of more than one document and (ii) a document can consist of one or more topics in some proportions. From Table 4, we can see that the topics discovered by the LDA algorithm are "pure" with regard to the two different storylines, i.e. there are no topics belonging to both. Moreover, the vast majority of our chapters are also "pure", in the sense that they consist of a single topic, with the exception of Chapters 5 and 15, which consist of two topics each.
As already said, topic modeling is a probabilistic method, and the results shown in Table 4 are just the output of the algorithm for a specific random seed. But repeating the experiment 10 times with different random seeds, we kept on getting the same qualitative result, i.e. three topics assigned exclusively to the six V. storyline chapters, and four topics for the Profane chapters, with no mixing between the storylines, although the specific grouping of chapters to topics can be quite different.
For illustrative purposes, Table 5 shows the 10 most probable terms for each of the three topics of the V. storyline, as depicted in Table 4.  Table 5: The 10 most probable terms for each of the three topics found in the V. storyline chapters, as shown in Table 4. Notice the presence of the most frequent terms, as shown in the wordcloud of Fig. 1 above. Uppercase letters have been manually restored where appropriate for the convenience of the reader.
Once again, the results prove to be rather robust and consistent, and not the outcome of some fine tuning: we performed some limited experiments with a number of 8 topics; most of the time, the results again were qualitatively similar to those in Table 5, but occasionally Chapter 16 alone of the Profane storyline would be grouped to the same topic with Chapter 17 of the V. storyline. This is rather justifiable, as both chapters take place in Malta with Fausto Maijstral as a central figure (recall from Table 3 above that Chapter 16 was again misgrouped in some of our clustering experiments).

24
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V.

Discussion and future work
It is a well-known fact among data mining practitioners 52 that unsupervised methods in general, and clustering in particular, can be like looking for patterns in the star-filled night sky: one will always be able to come up with some meaningful-looking ones, as the results of the ancient Greeks' vivid imagination still testify.

53
Nevertheless, the convergence of the results produced by a number of different approaches provides a kind of safety against this mental trap, especially if the subject approaches are based on several non-overlapping assumptions and techniques.
54 And this is exactly what we report here: we have utilized a wide range of techniques and algorithms, both deterministic and probabilistic, including different term weighting schemes, different clustering methods and distance functions, varying parameterizations where applicable (e.g. for the Minkowski and NCD distances), ad hoc visualization techniques, with and without the bag-of-words assumption, and with several levels of text preprocessing, ranging from application of all standard preprocessing operators up to no preprocessing at all. Our results converge convincingly in revealing the heterogeneous structure of the novel at the chapter level.
It is not quite clear to us how (and if) such results can be of merit for the critic or the literary theorist. At the end of the day, we could easily imagine one arguing that we have just spent enormous amounts of human and computing power, just to reveal something that was rather known in the first place. Of course, such arguments cannot stand against any serious criticism: if we are to embark on any genuine journey towards the quantitative analysis of our literary heritage, we must first test our tools and methods, explore their range of applicability, and map their limitations; and there is hardly any better way of doing so, other that checking their outputs against already known facts, in order to gauge and calibrate their relevance and suitability. From this point of view, we consider the work exposed here as a successful demonstration.
There are several different ways and directions towards which the present study can be extended. Among the first, one could imagine dropping the bag-of-words assumption. There are already some relevant tools available: limiting the discussion to topic modeling, there have been proposed 55 extensions of the basic approach and hybrid models that can capture higherorder semantic structure and both short-and long-range dependencies between words in a document; some of these tools are also available as a free Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V. 25 toolbox for Matlab. 56 Even the elementary LDA model is in principle readily applicable to more complex approaches, involving building blocks of n-grams or even paragraphs.

57
An implicit characteristic of the present work is that it was implemented using just general purpose data analysis software, which, despite the functionality of some dedicated add-on packages, is perhaps still quite limited for this kind of study. We plan to undertake similar investigations in the future, utilizing freely available software that is dedicated to document analysis, such as MALLET.
58 In any case, however, the access to a considerable body of existing algorithms and the flexibility that are provided by a general purpose software tool such as R is extremely valuable and should not be underestimated.
By now, computer-assisted content analyses for literary works are not uncommon and, perhaps unsurprisingly, a good lot of them focus upon the Shakespearean corpus. 59 We choose to conclude the present study quoting Jonathan Hope, one of the pioneers in the field of digital scholarship on Shakespeare: We perform digital analysis on literary texts not to answer questions, but to generate questions. The questions digital analysis can answer are generally not 'interesting' in a humanist sense: but the questions digital analysis provokes often are. And these questions have to be answered by 'traditional' literary methods.

60
Or, in the words of Stephen Ramsay: If text analysis is to participate in literary critical endeavor in some manner beyond fact-checking, it must endeavor to assist the critic in the unfolding of interpretive possibilities. We might say that its purpose should be to generate further "evidence," though we do well to bracket the association that term holds in the context of less methodologically certain pursuits. The evidence we seek is not definitive, but suggestive of grander arguments and schemes.

61
We would be very happy if the present work could serve as a trigger, in order to initiate more quantitative studies on the work of "Thomas Pynchon, the greatest, wildest and most infuriating author of his generation". 62 In the meanwhile, we will delve further into the research paths proposed by Franco Moretti and Stephen Ramsay, trying to prepare ourselves against the day.-5. The distinction between supervised and unsupervised methods is a standard one in the field of data mining. Unsupervised methods aim "to identify patterns in the data that extend our knowledge and understanding of the world that the data reflects", without the existence of a "specific target variable that we are attempting to model" (Williams,p. 175); they are usually associated with what we call "descriptive" approaches, and they do not depend on any particular modeling input (hence "unsupervised"). In contrast, with the "predictive" approaches and the corresponding supervised methods, one tries to predict a specific target variable which has been previously defined as such in the modeling (hence "supervised"); an example of supervised methods in the quantitative analysis of text would be to try to assign a piece of text of unknown authorship to one of the authors in a predefined and limited list, once the algorithm has been previously "trained" with known texts of the subject authors (the target variable here being a "class label" attached to the text, with its author's name). Only unsupervised methods are employed in the present study.
6. Stephen Ramsay comments on "quantitative analysis [as] chief among […] those activities that are usually seen as anathema to the essential goal of literary criticism" (Ramsay,p. 57 11. We will confess that, when commencing with the present study, we were erroneously certain that such an unambiguous mapping of chapters-tostorylines was already in place.
12. David Cowart, trying to construct a timeline-chronology of avatars and congeners of the character V., implicitly ends up with a collection of V. storyline chapters that is identical to ours, i.e. chapters 3,7,9,11,14,and the Epilogue (Cowart,. 13. As David Seed notes, "There has been a tacit agreement among critics that the historical chapters tend to be richer and more varied than those set in 1956" (Seed, p. 72). He also comments on "the astonishing variety of tone and effects which Pynchon manages", and "the local richness of these [historical] chapters" (Seed, p. 87).
14. "[In] the bag of words model, the exact ordering of the terms in a document is ignored but the number of occurrences of each term is material. We only retain information on the number of occurrences of each term. Thus, the document "Mary is quicker than John" is, in this view, identical to the document "John is quicker than Mary". Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content." (Manning et al.,p. 117, emphasis in the original).
15. In our (desperate) attempt not to get too technical, we choose to quote from a source that appeals to humanities readers rather than to quantitative scientists. Nevertheless, even in this case, it seems that we cannot avoid the explicit use of an equation… The framework of the following discussion (and the relevant document collection) is Virginia Woolf's novel The Waves, and the assumed distinct "documents" in the collection are not the chapters (nonexistent here), but the individual characters' monologues: "Let tf equal the number of times a word occurs within a single document. So, for example, if the word "a" occurred 194 times in one of the monologues, the value of tf would be 194. A term frequency list is therefore the set of tf values for each term within that speaker's vocabulary. Such lists are not without utility for certain applications […].
[If] we modulate the term frequency based on how ubiquitous the term is in the overall set of speakers, we can diminish the importance of terms that occur widely in the other speakers […] and raise the importance of terms that are peculiar to a speaker. Tf-idf accomplishes this using the notion of an inverse document frequency: Let N equal the total number of documents and let df equal the number of documents in which the target term appears. We have six speakers. If the term occurs only in one speaker, we multiply tf by six over one; if it occurs in all speakers, we multiply it by six over six. Thus, a word that occurs 194 times, but in all documents, is multiplied by a factor of one (six over six). A word that occurs in one document, but nowhere else, is multiplied by a factor of six (six over one)." (Ramsay,  18. Stephen Ramsay goes at length to argue that such text transformations, however distant they may initially seem from the scholar tradition of 'close reading', can be actually seen as a natural part of it: "Any reading of a text that is not a recapitulation of that text relies on a heuristic of radical transformation. The critic who endeavors to put forth a "reading," puts forth not the text, but a new text in which the data has been paraphrased, elaborated, selected, truncated, and transduced. This basic property of critical methodology is evident not only in the act of "close reading," but in the more ambitious project of thematic exegesis. In the classroom, one encounters the professor instructing his or her students to turn to page 254, and then to page 16, and finally to page 400. They are told to consider just the male characters, or just the female ones, or to pay attention to the adjectives, the rhyme scheme, images of water, or the moment in which Nora Helmer confronts her husband. The interpreter will set a novel against the background of the Jacobite Rebellion, or a play amid the historical location of the theater. He or she will view the text through the lens of Marxism, or psychoanalysis, or existentialism, or postmodernism. In every case, what is being read is not the "original" text, but a text transformed and transduced into an alternative vision, in which, as Wittgenstein put it, we "see an aspect" that further enables discussion and debate." (Ramsay,p. 16). 25. In text analysis jargon, "stop words" refer to extremely common words that are so frequently used that they become trivial and non-significant for the analysis. As Rajaraman & Ullman note: "Our first guess might be that the words appearing most frequently in a document are the most significant. However, that intuition is exactly opposite of the truth. The most frequent words will most surely be the common words such as "the" or "and", which help build ideas but do not carry any significance themselves. In fact, the several hundred most common words in English (called stop words) are often removed from documents before any attempt to classify them." (Rajaraman & Ullman,p. 8 38. Recall that the V. storyline chapters ("historical") "tend to be richer and more varied than those set in 1956" (Seed, p. 72) -see also endnote 13 above.
39. We do not claim, of course, that the chapter ending is actually detected (let alone "proved") in Fig. 5; we simply notice this as a rather playful, but worth-mentioning, coincidence.  Tan et al.,p. 532: "almost every clustering algorithm will find clusters in a data set, even if that data set has no natural cluster structure". See also endnote 54 below.
53. Ironically enough, Pynchon himself seems to warn against such a stance; in the words of a character in the book, Dudley Eigenvalue: "In a world such as you inhabit, Mr. Stencil, any cluster of phenomena can be a conspiracy." (Pynchon,p. 154