Plein écran
Imprimer

temp


\documentclassletterpaperWARNING: Plugin disabled article!
% \keywords{ Mapping and visualization of knowledge ; semantic network ;
% co-word analysis ; paradigmatic proximity, asymmetric proximity measure}

%% Email address of the comuncating author is required.
%% You may list email addresses of all other authors, seperatly.
%\email{david.chavalarias@polytechnique.edu}
NaV% Put your short thanks below. For long thanks/acknowlegements,
%%please go to the last acknowlegement section.


%The first author is supported by NSF grant xx-xxxx}
NaV\setcounter{minitocdepth}{1}
NaV\dominitoc


%sert à afficher une table des matières... ça bug un peu, donc à décomenter si non nécessaire.
%\tableofcontents
\maketitle




\centerline{\scshape David Chavalarias }
\medskip
{\footnotesize
%% please put the address of the second author
\centerline{Institut des Systèmes Complexes de Paris Ile-de-France \& CREA, CNRS - Ecole Polytechnique}
\centerline{ ISCPIF, 57-59 rue Lhomond, 75005, Paris, France}
} NaV% Enter the first author's name and address:
\centerline{\scshape Jean-Philippe Cointet }
\medskip
{\footnotesize
%% please put the address of the first author
\centerline{CREA, CNRS - Ecole Polytechnique, 1 rue Descartes, 75005, Paris, France}
\centerline{TSV, INRA, 65 av de Brandebourg, 94205, Ivry, France}

} %% Do not forget to end the {\footnotesize by the sign }

\bigskip

\begin{abstract}
We propose for the first time to reconstruct a science phylogeny made of discrete dynamics of scientific fields.
\end{abstract}

NaV\dominitoc
\section*{Introduction}
\label{sec:introduction}
We are facing a real challenge when coping with the increasingly changing nature of science. First the millions of papers published every year makes clearly impossible for anybody to have an exhaustive knowledge of all the important breakthroughs and developments in every fields of science. This issue is made even more critical by the continuous acceleration of scientific production, which threatens every scholars of \textit{information overload } (the volume of publications per year has doubled the last 12 years). Second, although science is not carved in marble and would better be defined as an ever-changing enterprise \cite{Hull-1988}, a lively debate has been taken place for more than 10 years around the shift toward a new regime of knowledge production following the transformation of the nature of the research process. According to \cite{nowotny2001rts} science would have recently entered a new mode, into which knowledge is generated within a wider context of application, making full place to trans-disciplinarity, defined as the circulation of tools, theoretical perspectives, and people. % Le reste de la phrase n'est pas très clair: a préciser :, opening of scientific communities frontiers which are getting even more virtualized than before. NaVScientists are circulating into various agoras

%First, in ?Mode 2? knowledge, scientific ?peers? can no longer be
%reliably identified, because there is no longer a stable taxonomy
%of codified disciplines from which ?peers? can be drawn. Second,
%reductionist forms of quality control can not easily be applied to
%much more broadly-framed research questions; the research ?game?
%is being joined by more and more players ? not simply a wider and
%more eclectic range of ?producers?, but also orchestrators, brokers,
%disseminators, and users. Third, and most disturbingly, clear and
%unchallengable criteria, by which to determine quality, may no longer
%be available. Instead, we must learn to live with multiple definitions

Whatever the causes of such transformations science frontiers indeed appear to be even faster changing and getting blurred as fields and sub-fields are cross-fertilizing, growing or dying. There is an urge to \textitWARNING: Plugin disabled map!.
%\MISSING{je rajouterais également des éléments de sociologie des sciences: à savoir quelque chose comme: les frontières se sont brouillées entre disciplines (façon nowotny à la dubucs) mais aussi entre science et société (façon calon, woolgar,etc.) }
%\MISSING{Nowotny ou autre sur l'avénement d'une science de type III.}

%While usual frontiers between disciplines are being redefined through new collaborations between domains, un-preceding rapprochement of ideas and tools the dynamics of science have been shown to exhibit strong interactions with the social : artefacts / human/non-human boundaries collapse - how to recollect these blurred boundaries? Things circulate! --> Redite avec le paragraphe précédent non ?


This evolution of science organisation have been accompanied by a deep revolution in the processes of production and dissemination of science. These last decades, the transfer of scientific publication activity toward electronic media, such as online journals or electronic archives, has completely changed the way we interact with scientific productions. We have an immediate access to almost all published articles, even the most recent, sometimes even before their publication through public preprint archives. Moreover, everybody can make her scientific production public, even without the support of a journal. All these productions are indexed, some time full text, which enable to perform accurate requests on the whole scientific production. We switched between a strongly hierarchical scheme of knowledge dissemination where journals were the intermediary between scientists to rather network-fashioned patterns of knowledge circulation.

%%

First, from a pragmatic perspective, scholars should be able to find their way through the millions of papers published every year, either to get an idea of the place of their research in the global picture of science, or to rely more accurately on previous items of knowledge to solve current problems in science, especially when exploring bibliography or selecting a set of citations
\cite{Garfield:1979p2208}.
%Citations and references thus operate within a jointly
%cognitive and moral framework. In their cognitive aspect, they are designed to pro
%vide the historical lineage of knowledge and to guide readers of new work to sources
%they may want to check or draw upon for themselves. In their moral aspect, they are
%designed to repay intellectual debts in the only form in which this can be done:
%through open acknowledgment of them. Such repayment is no minor normative re-
%quirement. That is plain from the moral and sometimes legal sanctions visited upon
%those judged to have violated the norm through the kinds of grand and petty in-
%tellectual larceny which we know as plagiarism. (Karl Marx testifies to the possible
%depth of commitment to the norm: for him, plagiarism was the one altogether
%detestable crime against private property, as witness his preface to the first edition of
%Capital and his further thunderings on the subject throughout that revolutionary
NaV\FTR{rajouter également la notion de faire sa bibliographie: en s'appuyant sur Merton: la biblio sert à agencer des briques de savoir sur lesquelles on construit un nouvel édifice qui répond à une nouvelle question - la multiplication des publications peut mettre en péril cet édifice.}

Second, it is important to get a right notion of how science evolves, which new directions of research appear, and which configurations become obsolete. The same questions can be raised both at the level of scientists, who are continuously opting for new research directions and forming new and changing collaborative configurations \cite{cambrosio2004mcw} ; and at the level of science policy makers and scientific organization managers, who need to know how scientific domains are actually connected,\emph{ e.g.} in order to design optimally their founding policies.

\par Of course, these problems are not new. Philosophers of science have been theorizing for a long time the conceptual structure of science and have proposed a lot of (often conflicting) descriptions and explanations of scientific change and revision (\cite{popper1963Conjectures}, \cite{Kuhn-1970}, \cite{Bonaccorsi2008Search}).

%Sociologists have good case studies on the building, structure and dissipation of such or such scientific network \FTR{ajouterdes ref. ou enlever cette phrase. les sociologues risquent d'insister sur les aspects ... sociaux, que nous n'abordons pas, donc c'est peut-être dangereux en effet }.

However, thanks to digitalization of scientific database, for the first time, a large scale quantitative approach about science structure and its evolution is possible, while new methods for the reconstruction and visualization of large scale science dynamics are being developed.

Electronic archives and other scientific databases are indeed a real opportunity to get insight into the scientific production and its evolution. The counterpart is however, that this massive access to millions of scientific papers requires specific methods to handle the global picture. In order to understand something to this huge mass of heterogeneous data about science, tools from data-mining (in the wide sense) are required in order to identify patterns or \textit{meso structures} that make sense to us(ers) (\textit{e.g.} scientific fields or "paradigms"). One of the major challenge scientometrics is to deliver scientists pictures of the knowledge landscape they face in their everyday work.

In this paper, we propose methods to reconstruct phylogenies of science. These methods will improve our global understanding of science evolution and pave the way toward the development of new tools for our daily interactions with its productions. In the long term, these methods should be able to corroborate or falsify models of science evolution.

As case study, the paper presents a first reconstruction of a phylogeny related to scientific communities related with network studies in biological and medical research.

\section{State of the art and rationale}

\subsection{Scientometry: from statics to dynamics}

Scientometrics is a young science that took off in the late seventies, fostered by the development of electronic scientific databases and the increasing power of computers. Quoting from one of the main journal in the field, ``scientometrics is concerned with the quantitative features and characteristics of science. Emphasis is placed on investigations in which the development and mechanism of science are studied by statistical mathematical methods\footnote{Scientometrics, Elsevier}''. Science mapping is one of the main activity of scientometrics. Maps are most often built upon co-occurrence data with the assumption that the more likely two words co-appear in the same article the more they are related, and the closer they should appear on the map. These co-occurrence data can be of different nature: co-authorship networks, %\cite{newman2004who},
co-citation networks,% \cite{Small1973Citations1},
or co-word networks (in titles, abstracts or full-texts).% \cite{Callon1983From}.

%\COR{pas forcement necessaire finalement, ni les references ci-dessus, on en donne pas mal par la suite...} \MISSING{recontextualiser avec l'aide des articles du geographe sur la notion de cartographie des sciences et ce qu'elle peut apporter}

The two most prominent kinds of analysis in scientometrics field have been the analysis of (co-)citations networks\footnote{For example, in co-citation analysis, two articles are linked if they are cited together by a sufficiently high number of papers.} and co-words networks. In citation analysis, maps represent clusters of papers where important papers should occupy central position on the map (\cite{Small1973Citations}, \cite{Small1974Structure}).

In co-word analysis, higher level structures are derived by analyzing word co-occurrences patterns in texts (\cite{Callon1983From}, \cite{Callon1986Mapping}). The link between two words has a strength that maps their alleged similarity.%Co-word measures come in two forms: proximity and inclusion measures that differ in the formula they use for the normalization.
Generated maps represent clusters of terms that aim at reflecting domains of science.

Both techniques have their own drawbacks \cite{Noyons2001Bibliometric}. Co-citation studies can be biased by loss of relevant papers, inclusion of non relevant papers or time lag between emergence of specialities and their appearance in science map. %Citing behavior can also be biased by consideration outside the scientific scope.%\COR{je comprends pas la derniere phrase}
Co-word technics may also suffer from an inappropriate choice of the initial set of terms to be mapped or the existence of fads in the use of terms among scientists. But the main objection made to co-word analysis is that words can be ambiguous or can have several meanings. Since co-word analysis do not take into account the context of terms in the source articles, few information is conveyed about their real meaning. Given that most scientometrics studies so far were based on non overlapping clustering methods, co-word maps where bound for a long time to assign a unique meaning to a word, which decreases the overall significance of the map.

Last, large part of the utility of science maps, both for theorists (history and philosophy of science), for users (scientists) or policy makers, are their capacity to give meaning to the evolution of science: what are the emergent fields, the continuities and main paradigmatic shifts, and from which scientific fields does a new field inherit its intellectual background. There is thus an important concern about reconstructing these dynamics in such a way that fields of knowledge could be tracked through time. From the theoretical point of view, this entails that the core object for representing science evolution is a \emph{phylogenetic network} while most of scientometrics studies focus on static maps.

Today the drawbacks of co-words analysis are about to be overcome. The availability of very large database about terms, citation and reference indexes results in a massive statistical effect that increases the robustness of studies and discard bias associated to small sample effects. Recent advances in data-mining and new methods from complex networks analysis enable to perform hierarchical overlapping clustering on large worldwide database. % (\COR{citer aussi boyack ici }\cite{pala et et autres}). Oui mais a-t-on des références publiées ?
This enables to handle multiple contexts of terms and take into account different meanings or use of a term. Last but not least, new information visualization technics, especially coming from network analysis, help to make the maps more understandable, and interactive and consequently more useful for end-users (scientists or policy makers).

In the following, we will focus on co-word analysis and propose methods for automated reconstruction of science phylogenies that capitalize on these advances. The central question will thus be :\textit{How can we reconstruct science dynamics through automated bottom-up analysis of scientific publications? }

%\COR{je sais plus}\MISSING{Il y a une citation de Latour \cite{Latour:1991p2213} qui traine ici je ne sais plus pourquoi... }


\subsection{Tracking meso-dynamics}

Terms can be used by different communities with different meanings. This means that the appropriate level for tracking science evolution is not the dynamics of single terms but the evolution of sets of terms, that contextualize each other meaning. These sets of terms can be called \emph{scientific fields} or \emph{epistemic fields}. One term can thus belong to different scientific fields, which technically entails the use of clustering methods that allow for clusters overlap.

In \cite{coint08multi} we proposed new science mapping methods based on an asymmetric proximity measure, formerly introduced in \cite{chava:scien}, that meet these constraints. In this approach, unit of the maps are made overlapping clusters of words. Information about clusters and their articulation can be assessed from their components, and clusters can be automatically labeled according to what one wishes to put forth: specific terms to extract emerging topics, generic terms for quick overview, etc.

In this paper we will address the question of science dynamics. One of the most essential feature of science evolution is the way new associations between terms are performed and how these new associations change composition of scientific fields. These changes in the use of terms are the main visible evidences of shifts in scientific activity. Sets of terms are the adequate level to study cross-fertilization of different fields of science, circulation of concepts through domains, bursts of activity in a given branch, and so on. They are widely used by scientists, to define with few keywords, their research, a journal topics or a conference scope. We will call the dynamics of science studied at the level of sets of terms the \textit{meso-dynamics} of science. Reconstructing these meso-dynamics is equivalent to find a matching function between clusters of science maps between successive periods of time.

The answer to this problem is far from straightforward. A scientific field, represented by a cluster $C$ at a given period of time, can undertake several kinds of transformations in its composition that will entails a different representation in the next periods: it can grow, shrink, merge with an other field, split or die. Consequently, two successive maps can have very different sets of scientific fields. Scientific fields can all be different but nevertheless share some terms and potentially share a same scientific background. A scientific field can have several ``offsprings'' at the next period and its conceptual legacy can come from several domains of investigation from the previous period. The reconstruction of these inheritance patterns will be very useful to get a global overview of the activity and evolution of large scientific domains.

\subsection{The phylogenetic network of science}
Drawing an analogy with biology, we might consider scientific fields dynamics as ``species''. %evolutionary history.
In biology a phylogenetic tree represents the evolutionary history of species or organisms. They are usually built upon sequenced genes or genomic data. Different methods are used to infer phylogenetic relationships \cite{Nei:1996p2205}. The most straightforward method are clustering methods (like Fitch-Margoliash method) based on a measure of genetic distance between species. Other methods like Maximum likelihood or bayesian inference rely on specific biological hypotheses regarding evolutionnary dynamics (like the maximum parsimony hypothesis) in order to infer the tree that better fits the observed data \cite{Huelsenbeck:1997p2204}.

More importantly for our study, these methods of phylogenetic tree reconstruction have been coined as too limited when considering complications such as a reticulate evolutionary history featuring horizontal gene transfers or genetic recombinations. When confronted to these hybridization events, one need to switch from phylogenetic trees to \textit{phylogenetic networks}.

Contrary to biologists who may have some prior knowledge about mutation rates of certain genes, we will take an agnostic perspective and will ignore any possible mechanisms or organizing principles guiding science evolution, even if some authors have suggested that conceptual change in science could be lead by similar evolutionary mechanisms than acts for biological systems\cite{hull2001sas}. We will thus rely on simpler methods like distance matrix methods.
We can also expect from science evolution to be populated with many ``hybridization events''. Science lineage is not progressing linearly, cross-fertilization of domains is common place, and it would be misleading to give a idealized picture of science made of a simple tree continuously growing into finer and finer specialities.

Since we have no global objective function that we could optimize, we cannot rely on a parsimony principle like minimizing the number of mutations. We have to adopt a local approach. Reconstructing the phylogenetic network of science can then be construed to answering this simple question: given a scientific field $C
t$%$\Im$
at time $t$ and an ``homology'' matrix $\mathcal{H}(t)$ between the set of fields at two consecutive time: $t-1$ and $t$, from which fields at time $t-1$ $C
t$
derives its conceptual legacy ?

\section{Methodology: from publications to phylogenetic networks}
We will now detail the whole methodology used to go from raw data to phylogeny mapping. It unfolds as follow:
\begin{enumerate}
\item Selection of a target scientific domain and an associated data source,
\item Delineation of a sets of terms used by scientists of the target domain and indexation of these terms in the database,
\item Mapping of the target domain, %Multi-level : j'enlève cela car là on ne présente plus le multi-niveaux
\item Inter-temporal matching between scientific fields and phylogeny reconstruction.
\end{enumerate}

It is important to highlight the fact that, in order to propose scalable methods on rough data, we will assume that the proxy to science evolution is indexes of science databases, as they are already built by search engines. This method will thus meet the constraint of working with aggregated co-occurrence data of terms in articles. Other methods bring interesting complementary perspectives in epistemic communities dynamics but rely on more detailed data sets (like author-based data for example \cite{roth:latt}). % \MISSING{on a d'autres réf ?}

\subsection{From corpus to data}
Co-word analysis critically depends on the initial set of terms chosen for the study and can be biased by the ``indexer effect'' (\cite{Whittaker1989Creativity},\cite{Callon1986Putting},\cite{He1999Knowledge}). This effect can have several origins: terms selected by the indexers are too general, specific terms have been omitted from the satisfactory list or the indexer puts the wrong emphasis, or even a mistaken emphasis in keywording. We choose a semi-automatic method that takes advantage both of powerful automated parsing of large corpora and experts skills to minimize this effect.

In the case study presented here, we targeted the question of \textit{networks} in biological research. We choose PubMed-MedLine as data source since it covers most of the publications in biology (more than 17M references), while titles and abstracts of articles are freely available. We then choose few concepts related to network-based approaches (network, evolvable, evolvability, hub, feedback) and retrieved all the papers mentioning at least one of these terms in MedLine (about 2,4M references). We then indexed these 2,4M abstracts with date of publication and retrieved all n-grams\footnote{key phrases with exactly n terms.} with a number of occurrences higher than $100
\frac{1}{n}$ and $n\leq3$ over the whole period (\emph{e.g.} the term \emph{protein interaction network} has to appear at least in 5 references to be included in our set of candidate keywords). Stop words were discarded. This list of terms was then checked by science historians to further discard uninformative terms, which finally lead to a set $\mathcal{L}$ of 834 terms (given in SI).

These terms were then indexed from 1950 to 2008 in the 2,69M retrieved abstracts to build the co-occurrence matrix $\mathcal{M}$ giving all co-occurrences for terms in $\mathcal{L}$ from 1950 to 2008. $\mathcal{M}_t(i,j)$ gives the number of articles published during the year $t$ which mentioned both terms $i$ and $j$ in their abstract.

\subsection{Static (multi-level) map reconstruction}
Scientometrics has defined a great number of measures based on co-occurrence data that capture the degree of similarity or proximity between two terms, through the analysis of simple co-occurrences statistics \cite{He1999Knowledge}. Among other, we can mention the two index early introduced in scientometrics the inclusion index $\frac{n_{ij}}{min(n_i,n_j)}$ \cite{Callon1986Qualitative} and the proximity index $\frac{n_{ij}
2}{n_i.n_j}$. Here, $n_i$ (respectively $n_j$ and $n_{ij})$ is the number of articles mentioning the term $i$ (respectively $j$ and both $i$ and $j$).

Further measures where later introduced, however, these measures, by synthesizing the relation between two terms with a single number, fail to convey information about the level of use of a term: given two terms $i$ and $j$, is one more specific or more generic than the other ? Is $i$ more specific in the sense that it tends to be used by a sub-community of the community using $j$ ?

Our hypothesis is that asymmetrical relations between terms are essential information to get insight into the overall structure of science (fields and subfields), and that it can be captured from co-occurrence analysis once we have defined a proper asymmetric measure on terms.

In \cite{chava:scien}, we introduced a new proximity measure called ``paradigmatic proximity'', that has the advantage to convey information about the relative position of two terms from the point of view of specificity. We will base our case study on a variant of this measure: %since, as we shall see, it enables to built multi-scale maps with adaptive labeling of the clusters. We will thus consider the measure :
$\Proxm
t(i,j)=((\frac{n_{ij}
t}{n_i
t})
{\alpha}(\frac{n_{ij}
t}{n_j
t})
{1/\alpha})
{max(\alpha,\frac{1}{\alpha})}$\footnote{ $n_i$ (respectively $n_j$ and $n_{ij}$ is the number of articles mentioning the term $i$ (respectively $j$ and both $i$ and $j$) }. However, the methods for phylogeny reconstruction we introduce here can be performed with alternative proximity measures.

The proximity measure chosen transforms a co-occurrence matrix $\mathcal{M}$ into a proximity matrix $\mathcal{P}$, which in our case is asymmetric ($\mathcal{P}_{1/\alpha}
T(i,j)=\Proxm
T(j,i))$. This matrix defines a directed weighted graph on the set of terms $\mathcal{L}$ that can be further analyzed with clustering methods to detect informative patterns. In our case, patterns will represent domains of science defined by sets of strongly related terms that contextualize each other, some being more specific, others more generic. Several clustering methods have been proposed and extensively tested: k-means clustering \cite{Small,Zitt}, Self-Organized Maps \cite{Skupin:2004p2187}, information flows based \cite{Rosvall:2008p909}. In our case, in order to keep the information conveyed by the asymmetry of $\mathcal{P}$ and allow overlapping clusters for the reasons mentioned previously, we choose to take as unit of our clustering the directed cliques \cite{palla:dir} of the graph which is one of the recent and convincing algorithm that produces overlapping clusters on directed graphs.

This clustering step aims at reducing the complexity of the micro-level data by representing the whole set of terms through a limited number of clusters which aggregates valuable information about the way terms are balanced between scientific fields.
Keeping in mind that this is a general workflow independent of the clustering used, one can interpret this last step as a classification step. Given a set of terms $\mathcal{L}$ and a period $T$, we categorize them into a set of (possibly overlapping) scientific fields $\mathcal{C} = \{C_i\}_{i \in I}$ .

\subsection{Mapping science}
After this clustering operation, the next step is to give an insight into the articulation of the different scientific fields. This will provide a global view of the scientific landscape defined by our initial set of terms.

Our paradigmatic proximity measure $\Proxm$ can naturally be extended to proximity between clusters by averaging the proximity between terms of two clusters computed over the period $T$ as follows:

$$ \Proxm
{T,2}(C_a,C_b)=\frac{1}{\mid C_a \mid} \sum_{i \in C_a}(\frac{1}{\mid C_b \mid}\sum_{j\in C_b}\Proxm
T(i,j))\label{intercluster}$$


It is important to note that two clusters can be close relatively to $ \Proxm
{T,2}$ even if they do not share any terms as the terms they contain are themselves close.

This defines a weighted directed graph on the set of clusters that can be drawn with network visualization tools. The labeling of each node of the map finds here a natural solution since we can characterize within a cluster, each terms on a specificity / genericity dimension (\textit{cf.} \cite{coint08multi} for details). According to what is looking for, one can label the clusters with the most generic terms, the specific ones, an so on.

\subsection{Inter-temporal matching function}

%\noteperso{movie is made of successing photographs, previous work, measure pseudo-inclusion, carte des sciences multi-niveau, how to }
The next step in phylogeny reconstruction is to achieve inter-temporal matching between communities. Given a field, we wish to find the field or union of fields from which it inherits. %% parler plutot d'héritage conceptuel. Arranger avec précedemment. Je veux bien essayer de reformuler cette partie.
We assume that the time scale of scientific fields evolution is slow enough to allow simple similarity measures between two close periods to track the meso-dynamics of a given field.
%We assume that the field evolution is slow enough to allow to track it. that a basic similarity measure should enable to match communities together.

We thus seek to find the field \emph{or} combination of fields that are most similar and therefore the most likely matchable.
%We first define a similarity measure $\delta$ between fields.
One of the most straightforward measure is a Jaccard similarity measure\footnote{This function is the inverse of the ``transformation index'' introduced for similar purposes by Callon in \cite{Callon:1991p2209}} on fields terms, thereafter denoted $\delta$. Given two fields $C_1$ and $C_2$ that can be defined as set of terms %$A =(t_i)_{i\subset I}$ and $B% = (t_j)_{j\subset J}$
then $\delta(C_1,C_2)=\frac{| C_1 \cap C_2 |}{| C_1 \cup C_2 |} $. $\delta$ can be interpreted in terms of the probability that a term belonging to $C_1 \cup C_2 $ also belong to $C_1 \cap C_2$. This is simply a measure of the overlap between $C_1$ and $C_2$.


Given a conceptual field $C_l
{t+1}\in\{C_b
{t+1}\}_{b \in B}$ at time $t+1$, its ``fathers'' $\mathcal{F}_l
{t+1}$ are chosen among the set of paradigmatic fields of the previous period $\{C_a
t\}_{a \in A}$ as:$$ \mathcal{F}
{t+1}_l = \displaystyle{ \argmax_{K \subset A} (\delta({\bigcup_{k \in K} C
{t}_k ,C
{t+1}_l }))}$$ % J'ai inversé pour avoir dans l'ordre t et t+1

With the Jaccard similarity measure we can write:
$$ \mathcal{F}
{t+1}_l = \displaystyle \argmax_{K \subset A}\frac{|(\cup_{k \in K} \mathcal{C}
{t}_k ) \bigcap C
{t+1}_l |}{| (\cup_{k \in K} C
{t}_k ) \bigcup C
{t+1}_l |}$$


%otherwise one may prefer to adopt other kind of measure like a cost/benefit tradeoff function balancing on one side the possible benefits provided by one field (or union of fields at $t$) by the ratio of overlap obtained with the field at $t+1$ versus the ``cost'' of such a matching provided by the total size (in number of items) of the candidate field: that is $\delta(A,B) = \frac{|A \bigcap B|}{|A|| B|}$.(OR other formulas...) % A élaguer ...
%\COR{il faudrait verifier quand meme que ce schema ne ressemnle pas trop au schema de palla, je crains que si....}
Figure \ref{intertemp_match} illustrates the matching procedure. We plotted two successive sub-networks with the same set of nodes between two time steps. The two successive period present distinct cluster sets : $A$ and $B$ at time $t$ and $C$ and $D$ at time $t+1$. Note that one node belongs to two different clusters at time $t+1$. The aim is to determine from which fields or union of fields $C$ and $D$ may be descending. It is straightforward to check that field $A$ is the closest to cluster $C$ (that is $\mathcal{F}
{t+1}_C = A$). Even if two nodes were removed from $A$ %(nodes $1$ and $5$)
while one node was added %(node $2$)
, the similarity between $A$ and $C$ ($\delta(A,C)=\frac{2}{5}$) is still the best possible and offers the best matching. The case of $D$ it is more delicate since three cases are possible: $D$ may inherit from $A$, $B$ or $A\cup B$. Computing the distances according to each cases we get : $\delta(D,A) = \frac{2}{8}$, $\delta(D,B)=\frac{3}{6}$ and finally $\delta(D,A \cup B)=\frac{5}{7}$. We will thus conclude that $D$ most likely inherits from the merging of the two preceding fields $A$ and $B$ and thus conclude that $\mathcal{F}
{t+1}_D = A\cup B$.
\begin{figure}
\center
%\hspace{-2cm}
\includegraphicswidth = 0.5 \linewidthNaV\includegraphicswidth = 0.4 \linewidthNaV{image/phyloex.eps}
\caption{Inter-temporal fields matching.}\label{intertemp_match}
\end{figure}

%\subsection{discussion about the distance}

%\noteperso{intertemporal matching: trouver les pères qui expliquent le mieux (au sens d'une distance à préciser) une descendance.}

%d=Jaccard
As it would seem uncorrect to match two fields that have very few terms in common even though no better matching is possible, we need to define a threshold above which the matching is satisfying. We shall call this threshold $\delta_0$. One can tune a threshold requiring a minimum amount of similarity. As we shall see, activity patterns in the phylogeny (areas of activity burst, areas with emergent fields, branches death, etc.) are robust to variations of $\delta_0$ provided that $\delta_0$ does not take get too close from 0 or 1.

%\noteperso{on va pouvoir repérer des patterns dynamiques sur l'évolution du bidule. Rechercher des ruptures pleins de morts, et de naissances}
%\noteperso{parler de la résolution temporelle et de sa conséquence sur la définition du seuil.}

\subsection{Validation}
As stated before, the aim of phylogeny reconstruction is to discover patterns and regularities in Science evolution. Given this objective, we defined two benchmarks for this reconstruction: theoretical validation and empirical validation.

Theoretical validation is related to the robustness of the detected patterns regarding the parameters of the model ($\delta_0$ in our case). Detected patterns should be robust to parameter change if we want them to be significant.

Empirical validation is related with the adequacy of the reconstruction of scientific fields compared the productions of scientific communities. To reflect the activity of a scientific community, it is important that scientific fields are composed by terms that are indeed mentioned altogether in the literature. We will thus checked for each cluster identified in 2.2, that there are some significant number of papers mentioning all the terms of the clusters in their full text. Moreover, a cluster composed by very common terms (\textit{e.g.} {disease ,molecule,cell,division}) will not be as much informative as a cluster composed of more specific terms (\textit{e.g.} {cancer ,dna damage, apoptosis, checkpoint}). This nuance can be caught by the notion of self-information \cite{shannon1948mathematical} conveyed by the observation of an event composed of independent items $a_1$ ... $a_n$ which have a probability $p_1$ ... $p_n$ to be observed individually. Self-Information is then defined by $I(a_1,...,a_n)=\sum_{i=1...n}-log(p_i)$. These two constraints can be synthesized into the \textit{empirical quality} of a cluster $C$, defined as the products of its self-information with the normalized number $\frac{n_C}{N}$ of papers mentioning all the terms of $C$ in their full text: $$Q_e(C)= \frac{n_C}{N}.\sum_{i \in C}-log(\frac{n_i}{N})$$, where $N$ is the total number of papers in the reference corpus.
%In the following case study, we will take as a reference corpus all the journal sources indexed by scirus.com, which covers more than 20 millon references in most scientific domains.
In the example presented here, clusters with null empirical quality were removed from the phylogeny and patterns in the distribution of empirical quality have been further studied. The empirical quality could also be used as a parameter to filter phylogenies so as to display most informative scientific fields.

\subsection{Qualifying clusters}
Relevance is not a binary judgment but rather lays on a continuum, potentially multidimensional, reflecting what is looked for: well-recognized domains of investigation, emergent domains, highlights on interdisciplinary domains, etc. Empirical quality is one of the index that enables to qualify identified scientific fields. We studied furthermore two other indexes that help to give meaning to science evolution.
\begin{itemize}
\item \textbf{Density.} One of the first index introduced to assess scientific fields evolution is the density of a field \cite{callon91coword}.``It characterizes the strength of the links that tie the words making up the cluster together. The stronger these links are, the more the research problems corresponding to the cluster constitute a coherent and integrated whole. It could be said that density provides a good representation of the duster's capacity to maintain itself and to develop over the course of time in the field under consideration." It is computed by: $$D(C)=\frac{1}{Card(C)}\sum_{(w,w')\in C
2, w\neq w'} P_1(w,w')$$,
\item \textbf{Pseudo-inclusion index.} In \cite{coint08multi} we have showed how two coordinates can be assigned to each component of a cluster $C$ to qualify its degree of specificity and genericity relatively to other terms of the cluster. The \textsl{genericity index} indicates to what extent a term $w$ contextualizes $C$ . It is defined by : $$I_g
\alpha(w)=\frac{1}{card(C)}\sum_{w'\in C}P_{min(\alpha,\frac{1}{\alpha})}(w,w')$$
\textsl{The specificity index} indicates to what extent $w$ is specific to $C$ and is defined by:
$$I_s
\alpha(w)=\frac{1}{card(C)}\sum_{w'\in C} P_{max(\alpha,\frac{1}{\alpha})}(w,w')$$
Since our goal is to find clusters where all terms are satisfying contexts or well contextualized by other terms in the cluster, we defined the \emph{pseudo-inclusion index} of a cluster: $$I_{\subset}
\alpha(C)=\min_{w \in C}\frac{1}{2}(I_s
\alpha(w)+I_g
\alpha(w))$$ This index indicates the degree of structuration of $C$. As we will see, this index the pseudo-inclusion provides some perspective in the interpretation of science dynamics.
\end{itemize}

Along with empirical quality, these two indexes will be useful to filter science maps and focus on some particular parts of the phylogeny. Note that whereas pseudo-inclusion and density can be computed without supplementary information, empirical quality needs additional queries to a corpus database. One issue will thus be to see in what extent it is possible to use the first to indexes as proxy to estimate the empirical quality.

\subsection{Software}
We used the Words Evolution software\footnote{http://sciencemapping.com/WE} to process and visualize the phylogenies. This software is interfaced with network visualization tools like (Gephi\footnote{http://Gephi.org} or GraphViz\footnote{http://www.graphviz.org}) and clustering software (CFinder\footnote{ http://www.cfinder.org}). (external link)

\section{Results}
\subsection{Mapping network studies in biology}
We performed phylogeny reconstruction on the MedLine database focusing on research in biological and biomedical fields related with network studies. After the constitution of a database concerning a set $\mathcal{L}$ of 834 terms according to the methodology explained in 2.1, we generated maps processed on four years sliding time windows from 2007 to 1987.

As example, a detail of the map obtained on the period 2004-2007 is given in figure \ref{map2003-2007}. We extracted all directed cliques based on the terms list $\mathcal{L}$ composed of at least 4 terms. We labeled the clusters with their most generic terms and we further simplified the map by merging clusters with the same label. Nodes size is related to density of clusters, this value is averaged over the set of merged clusters if the node is made of merged clusters. The value of the link between two sets of merged clusters is the maximum value of the inter-cluster similarity between all pairs of clusters.
NaVWe filtered the map by thresholding the cliques on empirical quality and \COR{comment? par rapport a quel parametre?} so as to display a maximum of 100 labels representing the most relevant fields and put a threshold on links to keep the \COR{ more significant ones PQ strongest ones}. The figure \ref{sciencemap} displays a portion of the map of science over the period 2004-2007 \COR{redite avec ci-dessus, mais bon...}. This map is the context of network research in biology \COR{pareil} . We \COR{ Clusters were labelled with the most generic terms PQ use the labeling with most generic terms } \COR{ RIEN PQ to be understandable for a general audience}. %To get a more focused figure, we can plot only the scientific fields mentioning the term network (fig. \ref{sciencemapnetgen}).

\begin{figure}
\centering
\includegraphicswidth=5in{image/BionetMap.pdf}\
\caption{Portion of a map of the scientific fields related to network studies in biology. Clusters are labeled by their two most generic terms. Size of text and bubbles map the density of the cluster. The inset give a detail of a cluster. \tiny Visualized with Gephi.org}\label{sciencemap}
\label{map2003-2007}
\end{figure}

NaV\begin{figure}
NaV \includegraphicswidth=5in{image/MapNetworkGene.pdf}\
% \caption{Macroscopic map of the ``biological fields mentioning network studies. One can zoom into subfields as illustrated in the inset to get further information. Clusters are labelled by their two most generic terms.}\label{sciencemapnetgen}
NaV
\subsection{Phylogenetic Patterns}

We reconstructed the phylogeny of the domains related to networks studies in biology over the period 1987-2007. Releasing all constraints on the phylogeny except that we required the fields to have at least four elements and a non null empirical quality, the phylogenetic network is made of 7759 nodes.

Within this network, we observed a significant positive correlation between the pseudo-inclusion index and the empirical quality. The Pearson coefficient $r$ lays within the 95\% confidence interval $0.14;0.19$, the probability to obtain a correlation as large as the observed value by random chance being $p=4.10
{-39}$. Between the pseudo-inclusion index and the number of papers per cluster we get $0.28<r<0.32$ and $p=0$. To a lesser extent, there is a significant positive correlation between the density index and the empirical quality ($0.03<r<0.08$, $p=4.10
{-6}$) as well as with the number of papers per cluster ($0.16<r<0.21$, $p=0$).

We categorized the fields according their position in the phylogenetic network: aborted (no father, no child), new born (no father, some children), adult (with father(s) and son(s)) and dying fields (with some father(s) but no child). Note that a cluster may belong to a different category according to the value of $\delta_0$. Scientific fields distribution regarding these categories is particularly interesting. Plotting the variations of the fields' empirical quality, density and pseudo-inclusion indexes against this categorization (fig. \ref{AgeQual}), we found very clear patterns for $\delta_0 \in 0.3 0.6$: aborted, newborn and dying fields tend to have weaker indexes than adult fields, with aborted fields having slightly lower values for their indexes than newborns.

The dependency of the mean of the density, pseudo-inclusion and empirical quality indexes over the position of the fields in the phylogeny suggests trends in the ``life cycle'' of scientific fields: these indexes grow while a new field emerges, and then loose their strength when it begins to be neglected by the community. However, density and pseudo-inclusion index are completely different ways of characterizing scientific fields. On the one hand, fields with high pseudo-inclusion will usually have terms with a large spectrum of specificity and genericity, which means that they are likely to contain very specific terms with few occurrences. These terms have a high probability to be new concepts of in fields or new objects of study. Their presence in the phylogeny will then often be correlated with high rate of branching processes. On the other hand, fields with a high density index rather correspond to well structured scientific domains with a priori lower rate of conceptual renewal.

Further studies based on different databases will confirm or not the relevance of these are general patterns in the study of science evolution. However, these regularities open perspectives for the detection of emergent or dying fields on the basis of some indexes computed on co-occurrence data.

\begin{figure}\center
\includegraphicswidth = 0.31 \linewidth{image/AgeQual.pdf}
\includegraphicswidth = 0.31 \linewidth{image/AgeDensity.pdf}
\includegraphicswidth = 0.31 \linewidth{image/AgeInclus.pdf}
\caption{\COR{c'est quoi en ordonnees?}}\label{AgeQual}
\end{figure}


Beside, the fact that aborted fields tend to be of lower quality suggests a methodology to adjust optimally $\delta_0$ in order to have the most informative phylogeny (in the sense of the empirical quality). Indeed, the ratio between the mean quality of fields belonging to the phylogeny and the mean quality of aborted fields is always higher than 1, and reaches its maximum around the value $\delta_c=0.33$. For this value, connected fields in the phylogeny \textit{i.e. } fields that have at least one father or one son are, on average, almost twice as informative as aborted fields.

% \begin{figure}\center
%\includegraphicswidth = 0.5 \linewidthNaV{image/RatioQ.pdf}
NaV\end{figure}

Inheritance patterns can be studied by classifying fields according to their number of sons in the phylogenetic network. While most fields have less than 2 sons, with 44\% having only one successor, almost 14\% have at least 3 children (cf. fig.~\ref{Outstats}-a). Again, the distribution of the different indexes in function of the number of children is very instructive. Figure~\ref{outvs} shows that, on average, the maximum of density (cf. fig.~\ref{Outstats}-b) and pseudo-inclusion is obtain for fields that have only one son. Again, this observation holds for a large range of $\delta_0$.
The synthesis of all these results suggests that relatively young branches of science are generally bushy with fields having lots of children. This corresponds to an intense exploration of new directions of research. Older fields will generally have a much more linear evolution with a lower rate of conceptual change.
This pattern can clearly be observed on figure~\ref{fullphylo} that represents, for $\delta_0=\delta_c$, the subpart of the phylogenetic network composed of fields with highest empirical quality and at least four terms. Recent branches have also been removed to meet editorial constraints. We can also notice that there is much more hybridation between scientific fields in the domain of formal methods and tools than in the branches corresponding to topics in biology. This transversal domain is also over-represented due to the fact that the targeted thematic is itself a transversal methodology.

\begin{figure}\center
\includegraphicswidth = 0.8 \linewidth
{image/Outstats.pdf}
\caption{\COR{dire quelle valeur de $\delta_0$ a ete choisie pour la figure a, ca change j'imagine?}
\COR{je separerais ces deux figures pour mettre b avec ces copines de la figure 6}}\label{Outstats}
\end{figure}


\begin{figure}\center
\includegraphicswidth = 0.45 \linewidth
{image/OutPaper.pdf}
\includegraphicswidth = 0.45 \linewidth
{image/OutIncl.pdf}
\caption{\COR{j'ai ramene ces deux figures ici} \COR{c'est bien l'empircial quality la figure OutPaper?}}\label{outvs}
\end{figure}

Details of the phylogeny are also very informative. Figure \ref{Cancer} represents the fields of more than five terms for which at least one term contains the words ``cancer" or ``tumor". On this partial phylogeny of this thematic, we can clearly see three distinct sets of branches with very different characteristics. Two sets are quite bushy and deals with \emph{cancer }and \emph{DNA} issues on one side, \emph{cancer, tumor and proliferation} issues on the other side. They appear to have increased their interactions these last years around the concepts of \emph{apoptosis}, \emph{suppressor} and \emph{cell cycle}. The third set has very linear branches and is related to the relations between \emph{tumor} and the \emph{immune system}. These three sets are also quite distinct in terms of the range of their density and pseudo-inclusion indexes. Whereas the bushy branches tend to have a higher pseudo-inclusion index than the linear ones, revealing a higher rate of conceptual renewal, they also have a lower density index, indicating that they should be more recent. The study of the evolution of the pseudo-inclusion index along these branches reveals that this index is increasing along most of the branches although its growth rate is decreasing with time. When \COR{relaxing PQ releasing} the constraints on the empirical quality threshold and on the number of terms in clusters, these characteristics regarding the three sets of branches are preserved, although the branches appear to be older than they appear in this partial phylogeny, the upper-part of the phylogeny having been pruned in the thresholding \COR{process}.

\begin{figure}
\center
\includegraphicswidth = .8 \linewidth
{image/PhyloFull.jpg}
\caption{Extract of the full phylogeny of domains related to networks studies in biology and medical research ($\sim$1400 clusters). We kept fields \COR{made} of more than four terms, set a threshold on the empirical quality and removed shortest branches for editorial purposes. The color map\COR{s} the pseudo-inclusion index of the fields.}\label{fullphylo}
\end{figure}

\begin{figure}
\center
\includegraphicswidth =1 \linewidth
{image/PhyloCancer2.pdf}
\caption{Part of sub-phylogenetic network related \COR{to PQ with} cancer studies. The size of a square is proportional to the pseudo-inclusion index of the field and the color, from blue to red maps the growth rate of this index. Note that the pseudo-inclusion index along the branches is increasing along most of the branches although its growth rate is decreasing with time. Fields are labeled with their most generic term, except for the beginning of a branch or for the most recent period, where all terms are displayed. The labels of inter-period arrows indicate which terms have been lost or gained between two periods.}\label{Cancer}
\end{figure}


\subsection{Discussion}
Seminal work of Callon et. al. \cite{callon91coword} \COR{was PQ constitutes} the first \COR{attempt PQ tentative} to quantify the evolution \COR{of} scientific fields through co-word analysis, \COR{monitoring inter alia the evolution of the density of clusters PQ looking among other at the evolution of the density of clusters}. Our work proposes the first automated methods \COR{for reconstructing PQ to reconstruct} the entire phylogeny of a domain of science and is clearly in line with their approach. We extended their work in several ways, trying to go beyond most classical limitations of scientometrics that have been expressed \COR{hitherto PQ since}:
\begin{enumerate}
\item \textbf{Coverage: } Our methods can cover the largest bibliographic database available. Nowadays, online publishers cover between 30 and 40 million articles, which represent a significant part of worldwide scientific literature. We gave an example on a case study based on MedLine (14M papers) covering most medical and biological research.
\item \textbf{Ambiguity: } Contrary to \cite{callon91coword} and most subsequent works, we used overlapping clustering algorithms in order to ensure that we can handle ambiguity in terms use and \COR{je comprends pas bien les false negative - avoid noise/mistakes in attributing terms to its clusters?} avoid false negatives in terms classification in clusters,
\item \textbf{Asymmetry and multi-level mapping: } Following previous work \cite{chava:scien}, we based our clustering algorithm on an asymmetric proximity measure in order to fully reflect the organization of science into domains and sub-domains. This asymmetry enables to highlight the internal structure of clusters allowing automatic labeling \cite{coint08multi}. This entails a multi-level mapping that proposes multiple view points on the phylogeny \COR{according to PQ in function of} the required degree of specificity. We also introduced \COR{a measure of PQ an indicator of} fields structuration, the \textit{pseudo-inclusion index}, that is based on this new \COR{proximity} measure and we showed that \COR{ the pseudo-inclusion index appears to be very informative when assessing PQ this index is an important indicator of} the evolution of a fields of research.
\item \textbf{Validation: } Complementary to \cite{Heal86AnExp} who \COR{suggested to validate science maps with both ``endogeneous and ``exogeneous criteria PQ proposed to the integration of both "internal" and "external" validation of science maps}, we proposed an \textit{empirical validation} (confrontation with real data) that could be articulated with the two previous ones in future work. We defined the \textit{empirical quality} of a cluster as the degree of its \PQ{ quantity of information?? PQ informativeness} relatively to the associated literature and showed that the pseudo-inclusion index was positively correlated with the empirical quality. The density, on the other hand, was only weakly correlated.
\item \textbf{Dynamics: } Our study took advantage of the availability of diachronic data to reconstruct the phylogeny of scientific fields, and takes into account multiple filiations, contrary to what could have been done in other related fields like social group evolution \cite{Palla:2007p229}. This structure revealed strong \COR{and robust} patterns \COR{which appear to highlight strong regularities PQ that seem to express strong regularities} in science evolution.
\end{enumerate}

This approach opens new promising perspectives both from theoretical and applicative point\COR{s} of view. While we tried to show that these approaches of science dynamics reconstruction are close to be able to corroborate of falsify theories in epistemology and science studies, we can also expect they will considerably renew the way we interact with science \COR{especially when navigating large-scale electronic databases PQ and in particular within electronic archives}. Moreover, the methodology presented here is not specific to scientific corpora and \COR{may PQ can} be applied to a wide range of co-occurrence data \COR{like, online communities} from Web 2.0, patents database, \COR{folksonomies PQ tags}, \COR{ or even experimental data like} micro-array data\COR{,etc. PQ and so on.}


\subsection*{Acknowledgements}
The authors warmly thank Jean-Paul Gaudilli\`{e}re et Christophe Bonneuil for their help in selecting the list of terms. These researches have been supported by the FP7 PATRES project and the Paris Île-de-France Complex Systems Institute and the École Polytechnique and project.


\MISSING{Tes remerciements}

\bibliographystyle{plain}
%\bibliographystyle{authordate2}
\bibliography{phylogeny}


\end{document}

\section{Remarques}
We can define two quantities characterizing a
word $w$ within this cluster : the genericity index $i_g$ and
the specificity index $i_s$ defined as follows:
\begin{description}
\item\textsl{The genericity index} defines to what extent the cluster $C$ is a good neighborhood for the word $w$ with respect to the paradigmatic proximity $\Proxm$. When $\alpha<1$ It is the mean of $w$ ``out-paradigmatic proximity" and is defined by : $$I_g(w)=\frac{1}{card(C)}\sum_{w'\in
C}P_{min(\alpha,\frac{1}{\alpha})}(w,w')$$
\item\textsl{The specificity index} provides the extent to which the word $w$ is specific to the
cluster
$C$ with respect to the paradigmatic
proximity $\Proxm$ considered (\textsl{i.e.} is $w$ relevant for the terms in $C$ ?). It is the mean of $w$ ``in-paradigmatic proximity" -
from term $w'\in C$ to $w$ - and is defined as : $$I_s(w)=\frac{1}{card(C)}\sum_{w'\in
C}P_{min(\alpha,\frac{1}{\alpha})}(w',w)$$. We will note that since $\Proxm(w',w)=P_{\frac{1}{\alpha}}(w,w')$, this index can be rewritten : $$I_s(w)=\frac{1}{card(C)}\sum_{w'\in
C}P_{max(\alpha,\frac{1}{\alpha})}(w,w')$$
\end{description}


\bigskip
\bigskip
$I_\subset(C_a)=\frac{1}{2}\frac{1}{\mid C_a \mid
2}.{\sum_{i \in C_a} \sum_{j \in C_a} (\Proxm(i,j)+ P_{\frac{1}{\alpha}}(i,j))}$


\bigskip
$I_\subset(i)=\frac{I_s(i)+ I_g(i)}{2} $

\bigskip
$I_\subset(C_a)=\frac{1}{\mid C_a \mid}.{\sum_{i \in C_a}\frac{I_s(i)+ I_g(i)}{2} }$

\bigskip
$\Proxm(i,j)=((\frac{n_{ij}
t}{n_i
t})
{\alpha}(\frac{n_{ij}
t}{n_j
t})
{1/\alpha})^{min(\alpha,\frac{1}{\alpha})}$


\bigskip
General remarks
Le taille des noeuds sur la carte (moyenne des liens internet) appelée densité dans est density in \cite{callon91coword} (bonne réf pour les dyn.)
Pour les dyn, voir aussi \cite{Noyons2001Bibliometric} qui insiste sur l'importance de la dyn.




\begin{figure}
\center
\includegraphicswidth = 0.5 \linewidth
{image/Inc+DensSurQual.pdf}
\caption{Relation Qual - Density and Inclu}\label{intertemp_match}
\end{figure}

\begin{figure}
\center
\includegraphicswidth = 0.5 \linewidth
{image/Year-Index.pdf}
\caption{Relation Qual - Density and Inclu}\label{intertemp_match}
\end{figure}

\begin{figure}\center
\includegraphicswidth = 0.5 \linewidth%width=8cm, height=6cm%,trim=0 10 0 20
{image/AgePapers.pdf}
\caption{}\label{}
\end{figure}

\begin{figure}\center
\includegraphicswidth = 0.5 \linewidth%width=8cm, height=6cm%,trim=0 10 0 20
{image/AgeInclus.pdf}
\caption{}\label{}
\end{figure}


\begin{figure}\center
\includegraphicswidth = 0.5 \linewidth%width=8cm, height=6cm%,trim=0 10 0 20
{image/AgeDensity.pdf}
\caption{}\label{outDens}
\end{figure}

\begin{figure}\center
\includegraphicswidth = 0.5 \linewidth
{image/OutPaper.pdf}
\caption{}\label{}
\end{figure}

\begin{figure}\center
\includegraphicswidth = 0.5 \linewidth
{image/OutIncl.pdf}
\caption{}\label{}
\end{figure}



\begin{figure}\center
\includegraphicswidth = 0.5 \linewidth
{image/OutDegreeDiffIncl.pdf}
\caption{}\label{}
\end{figure}


\begin{figure}\center
\includegraphicswidth = 0.8 \linewidth
{image/FatherSonQ.pdf}
\caption{}\label{}
\end{figure}

NaV \begin{figure}h!
NaVNaV\includegraphicsangle=90,width=8cm, height=20cm%,trim=0 10 0 20]]
%%{image/1971-3-2004-2007-_2.pdf} % image un peu conséquente. mais c'est la bonne phylo!!!
%NaVNaV\caption{ Inter-temporal fields matching.}\label{intertemp}
%\end{figure}

\MISSING{raconter les codes couleurs - dire que c'est graphviz qui fait le travail, expliquer les labels, et les flèches. mettre des exemples différents et une annexe web(un grand en vertical...)}

We can observe activity patterns (either discrete dynamics) and measure some indexes about the dynamics. - rupture epistémologique / arrivee de l'internet - changements structurels de l'organisations des communautés scientifiques: atomisation...

We can observe the history of one term in all its context (several meanings).

We can try to correlate activity of communities vs meso dynamics (can we detect a field which is going to provide a lot of sibblings?).

\noteperso{on ne cherche pas à reconstruire la phylogénie (comme phénomène naturel) mais une phylogénie qui est limité par essence à cause de nos méthodes mais s'il y a explosion, alors, il restera bien des traces de kérosène à analyser.}




\cite{Chen:2006p2223} : un suivi au niveau du mot mais pas au niveau du cluster.

\noteperso{
Depuis longtemps les auteurs insistent sur l'importance du dynamique (retrouver la réf). Depuis recemment, quelques ref. concernant les études dynamiques mais :
- Ont clustering non overlappant
- S'interessent au niveau micro ou au niveau macro (pas meso)
- Pas de visualisation synthetiques (films de réseau évoluant e.g. \cite{Leydesdorff2008Dynamic})
}


\noteperso{en scientometrie, on compare des cartes qui sont des reconst à différents moments, l'enjeu c'est d'attraper la cinématique/dynamique des sciences.}

\noteperso{dynamics required for everyone (policy maker)}

\COR{en fait je pense que cette section peut tres bien etre diluee dans le state of the art, en remontant quelques references et en les racontant rapidement...}

\subsection{Comparison with related work}

\subsection{Perspectives \& applications}

%\subsection{enhancement of methods}

NaVlikelihood with specificity index or frequences.


domaines emergents,
predicitibility? extrapolation? policy maker.
electronic archives: search engine specific. Time machine.


Collaborateur(s) de cette page: davidchavalarias .
Page dernièrement modifiée le mardi 14 avril, 2009 11:43:50 CET par davidchavalarias.

Langue du site : Français