Amaro, Valeria
(2017)
Machine Learning based Probability Density Functions of photometric redshifts and their application to cosmology in the era of dark Universe.
[Tesi di dottorato]
Item Type: |
Tesi di dottorato
|
Lingua: |
English |
Title: |
Machine Learning based Probability Density Functions of photometric redshifts and their application to cosmology in the era of dark Universe |
Creators: |
Creators | Email |
---|
Amaro, Valeria | vlaraie81@gmail.com |
|
Date: |
11 December 2017 |
Number of Pages: |
159 |
Institution: |
Università degli Studi di Napoli Federico II |
Department: |
dep06 |
Dottorato: |
phd028 |
Ciclo di dottorato: |
30 |
Coordinatore del Corso di dottorato: |
nome | email |
---|
Capozziello, Salvatore | capozzie@na.infn.it |
|
Tutor: |
nome | email |
---|
Longo, Giuseppe | UNSPECIFIED | Massimo, Brescia | UNSPECIFIED |
|
Date: |
11 December 2017 |
Number of Pages: |
159 |
Uncontrolled Keywords: |
Machine Learning;photometric redshifts;probability density function;weak lensing;cosmology |
Settori scientifico-disciplinari del MIUR: |
Area 02 - Scienze fisiche > FIS/05 - Astronomia e astrofisica |
[error in script]
[error in script]
Date Deposited: |
17 Jan 2018 09:28 |
Last Modified: |
19 Mar 2019 11:43 |
URI: |
http://www.fedoa.unina.it/id/eprint/12226 |

Abstract
The advent of wide, multiband multiepoch digital surveys of the sky has pushed astronomy
in the big data era. Instruments, such as the Large Synoptic Survey Telescope or LSST, are
in fact capable to produce up to 30 Terabytes of data per night. Such data streams imply that
data acquisition, data reduction, data analysis and data interpretation, cannot be performed
with traditional methods and that automatic procedures need to be implemented. In other
words, Astronomy, like many other sciences, needs the adoption of what has been defined the
fourth paradigm of modern science: the so called "data driven" or "Knowledge Discovery in
Databases - KDD" (after the three older paradigms: theory, experimentation and simulations).
With the words "Knowledge discovery" or "Data mining" we mean the extraction of useful
information from a very large amount of data using automatic or semi-automatic techniques
based on Machine Learning i.e. on algorithms built to teach the machines how to perform
specific tasks typical of the human brain.
This methodological revolution has led to the birth of the new discipline of Astroinformatics,
which, besides the algorithms used to extract knowledge from data, covers also the proper
acquisition and storage of the data, their pre-processing and analysis, as well as their distribu-
tion to the community of users.
This thesis takes place within the framework defined by this new discipline, since it describes
the implementation and the application of a new machine learning method to the evaluation
of photometric redshifts for the large samples of galaxies produced by the ongoing and future
digital surveys of the extragalactic sky. Photometric redshifts (described in Section 1.1)
are in fact fundamental for a huge variety of fundamental topics such as: fixing constraints
to the dark matter and energy content of the Universe, mapping the galaxy color-redshift
relationships, classifying astronomical sources, reconstructing the Large Scale Structure of
the Universe through weak lensing, to quote just a few. Therefore, it comes as no surprise that
in recent years a plethora of methods capable to calculate photo-z’s has been implemented
based either on template models fitting and/or on empirical explorations of the photometric
parameter space. Among the latter, many are based on machine learning but only a few allow
the characterization of the results in terms of a reliable Probability Distribution Function
(PDF).
In fact, Machine learning based techniques while on the one hand are not explicitly dependent
on the physical priors and are capable to produce accurate photo-z estimations within the
photometric ranges covered by a spectroscopic training set, on the other are not easy to
characterize in terms of a photo-z PDF, due to the fact that the analytical relation mapping
the photometric parameters onto the redshift space is virtually unknown. In the course of
my thesis I contributed to design, implement and test the innovative procedure METAPHOR
(Machine-learning Estimation Tool for Accurate PHOtometric Redshifts) capable to provide
reliable PDFs of the error distribution for empirical techniques. METAPHOR is implemented
as a modular workflow, whose internal engine for photo-z estimation makes use of the
MLPQNA neural network (Multi Layer Perceptron with Quasi Newton learning rule) for the
estimation of photo-z’s, with the possibility to easily replace the specific machine learning
model chosen to predict photo-z’s, and of an algorithm for the calculation of individual source
as well as of stacked objects sample PDFs. More in detail, my work in this context has been:
i) the creation of software modules providing some of the functionalities of the entire method
and finalised to obtain and analyze the results on all the datasets used so far (see the list of
publications) and for the EUCLID contest (see below), ii) to fix the natural algorithms for
improving some workflow facilities and, iii) the debugging of the whole procedure. The first
application of METAPHOR was in the framework of the second internal Photo-z challenge
of the Euclid consortium: a contest among different teams, aimed at establishing the best
SED fitting and/or empirical methods, to be included in the official data flow processing
pipelines for the mission. This contest lasted from September 2015 until the end of Jenuary
2016, and it was concluded with the releases of the results on the participants performances,
in the middle of May 2016.
Finally, the original workflow has been improved by adding other statistical estimators in
order to better quantify the significance of the results. Through a comparison of the results
obtained by METAPHOR and by the SED template fitting method Le-Phare on the SDSS-
DR9 (Sloan Digital Sky Survey - Data Release 9) we verified the reliability of our PDF
estimates using three different self-adaptive techniques, namely: MLPQNA, Random Forest
and the standard K-Nearest Neighbors models.
In order to further explore ways to improve the overall performances of photo-z methods,
I also contributed to the implementation of an hybrid procedure based on the combination
of SED template fitting estimates obtained with Le-Phare and of METAPHOR using as test
data those extracted from the ESO (European Southern Observatory) KiDS (Kilo Degree
Survey) Data Release 2.
Always in the context of the KiDS survey, I was involved in the creation of a catalogue of
ML photo-z’s and relative PDFs for the KiDS-DR3 (Data Release 3) survey, widely and
exhaustively described in de Jong et al. (2017). A further work on KiDS DR3 data,Amaro et
al. (2017), has been submitted to MNRAS. The main topic of this last work is to achieve a
deeper analysis of photo-z PDFs obtained using different methods, two machine learning
models (METAPHOR and ANNz2) and one based on SED fitting techniques (BPZ), through
a direct comparison of both cumulative (stacked) and individual PDFs. The comparison has
been made by discriminating between quantitative and qualitative estimators and using a
special dummy PDF as benchmark to assess their capability to measure the quality of error
estimation and invariance with respect to any type of error source. In fact, it is well known
that, in absence of systematics, there are several factors affecting the photo-z reliability, such
as photometric and internal errors of the methods as well as statistical biases. For the first
time we implemented a ML based procedure capable to take into account also the intrinsic
photometric uncertainties.
By modifying the METAPHOR internal mechanism, I derived a dummy PDF method through
which the individual PDFs, called dummy, are made up of a single number, e.g. 1 (the
maximum probability) associated to the the redshift bin of chosen accuracy in which the
only photo-z estimate for that source, falls. All the other redshift bins of a dummy PDF will
be characterized by a probability identically equal to zero. Due to its intrinsic invariance
to different sources of errors, the dummy method enables the possibility to compare PDF
methods independently from the statistical estimator adopted.
The results of this comparison, along with a discussion of the statistical estimators, have
allowed us to conclude that, in order to assess the objective validity and quality of any photo-z
PDF method, a combined set of statistical estimators is required.
Finally, a natural application of photo-z PDFs is that involving the measurements of Weak
Lensing (WL), i.e. the weak distortion of the galaxies images due to the inhomogeneities of
the Universe Large Scale Structure (LSS, made up of voids, filaments, halos) along the line of
sight. The shear or distortion of the galaxy shapes (ellipticities) due to the presence of matter
between the observer and the lensed sources, is evaluated through the tangential component
of the shear. The Excess Surface Density (i.e. a measurement of density distribution of the
lenses), is proportional to the tangential shear, through a geometrical factor, which takes into
account the angular diameter distances among observer, lens, and lensed galaxy source. Such
distances in the geometrical factor are measured through photometric redshifts, or better
through their full posterior probability distributions.
Up to now, such distributions have been measured with template fitting methods: our Ma-
chine Learning METAPHOR has been employed to make a preliminary comparative study
on WL ESD, with respect to the SED fitter results. Furthermore, a confrontation between the
ESD estimates obtained by using both METAPHOR PDFs and photo-z punctual estimates has been performed.
The WL study outcome is very promising since we found that the use of
punctual estimates and relative PDFs lead to indistinguishable results, at least to the required
accuracy. Most importantly, we found a similar trend for the ESD results in the comparison
of our Machine Learning method with a template fitter performance, despite all the limits
of Machine Learning techniques (incompleteness of the training dataset, low reliability for
results extrapolated outside the knowledge base) which become particularly relevant in WL
studies.
Downloads per month over past year
Actions (login required)
 |
View Item |