Amaro, Valeria (2017) Machine Learning based Probability Density Functions of photometric redshifts and their application to cosmology in the era of dark Universe. [Tesi di dottorato]

[img]
Preview
Text
AMARO_Valeria_30.pdf

Download (7MB) | Preview
[error in script] [error in script]
Item Type: Tesi di dottorato
Lingua: English
Title: Machine Learning based Probability Density Functions of photometric redshifts and their application to cosmology in the era of dark Universe
Creators:
CreatorsEmail
Amaro, Valeriavlaraie81@gmail.com
Date: 11 December 2017
Number of Pages: 159
Institution: Università degli Studi di Napoli Federico II
Department: dep06
Dottorato: phd028
Ciclo di dottorato: 30
Coordinatore del Corso di dottorato:
nomeemail
Capozziello, Salvatorecapozzie@na.infn.it
Tutor:
nomeemail
Longo, GiuseppeUNSPECIFIED
Massimo, BresciaUNSPECIFIED
Date: 11 December 2017
Number of Pages: 159
Uncontrolled Keywords: Machine Learning;photometric redshifts;probability density function;weak lensing;cosmology
Settori scientifico-disciplinari del MIUR: Area 02 - Scienze fisiche > FIS/05 - Astronomia e astrofisica
Date Deposited: 17 Jan 2018 09:28
Last Modified: 19 Mar 2019 11:43
URI: http://www.fedoa.unina.it/id/eprint/12226

Abstract

The advent of wide, multiband multiepoch digital surveys of the sky has pushed astronomy in the big data era. Instruments, such as the Large Synoptic Survey Telescope or LSST, are in fact capable to produce up to 30 Terabytes of data per night. Such data streams imply that data acquisition, data reduction, data analysis and data interpretation, cannot be performed with traditional methods and that automatic procedures need to be implemented. In other words, Astronomy, like many other sciences, needs the adoption of what has been defined the fourth paradigm of modern science: the so called "data driven" or "Knowledge Discovery in Databases - KDD" (after the three older paradigms: theory, experimentation and simulations). With the words "Knowledge discovery" or "Data mining" we mean the extraction of useful information from a very large amount of data using automatic or semi-automatic techniques based on Machine Learning i.e. on algorithms built to teach the machines how to perform specific tasks typical of the human brain. This methodological revolution has led to the birth of the new discipline of Astroinformatics, which, besides the algorithms used to extract knowledge from data, covers also the proper acquisition and storage of the data, their pre-processing and analysis, as well as their distribu- tion to the community of users. This thesis takes place within the framework defined by this new discipline, since it describes the implementation and the application of a new machine learning method to the evaluation of photometric redshifts for the large samples of galaxies produced by the ongoing and future digital surveys of the extragalactic sky. Photometric redshifts (described in Section 1.1) are in fact fundamental for a huge variety of fundamental topics such as: fixing constraints to the dark matter and energy content of the Universe, mapping the galaxy color-redshift relationships, classifying astronomical sources, reconstructing the Large Scale Structure of the Universe through weak lensing, to quote just a few. Therefore, it comes as no surprise that in recent years a plethora of methods capable to calculate photo-z’s has been implemented based either on template models fitting and/or on empirical explorations of the photometric parameter space. Among the latter, many are based on machine learning but only a few allow the characterization of the results in terms of a reliable Probability Distribution Function (PDF). In fact, Machine learning based techniques while on the one hand are not explicitly dependent on the physical priors and are capable to produce accurate photo-z estimations within the photometric ranges covered by a spectroscopic training set, on the other are not easy to characterize in terms of a photo-z PDF, due to the fact that the analytical relation mapping the photometric parameters onto the redshift space is virtually unknown. In the course of my thesis I contributed to design, implement and test the innovative procedure METAPHOR (Machine-learning Estimation Tool for Accurate PHOtometric Redshifts) capable to provide reliable PDFs of the error distribution for empirical techniques. METAPHOR is implemented as a modular workflow, whose internal engine for photo-z estimation makes use of the MLPQNA neural network (Multi Layer Perceptron with Quasi Newton learning rule) for the estimation of photo-z’s, with the possibility to easily replace the specific machine learning model chosen to predict photo-z’s, and of an algorithm for the calculation of individual source as well as of stacked objects sample PDFs. More in detail, my work in this context has been: i) the creation of software modules providing some of the functionalities of the entire method and finalised to obtain and analyze the results on all the datasets used so far (see the list of publications) and for the EUCLID contest (see below), ii) to fix the natural algorithms for improving some workflow facilities and, iii) the debugging of the whole procedure. The first application of METAPHOR was in the framework of the second internal Photo-z challenge of the Euclid consortium: a contest among different teams, aimed at establishing the best SED fitting and/or empirical methods, to be included in the official data flow processing pipelines for the mission. This contest lasted from September 2015 until the end of Jenuary 2016, and it was concluded with the releases of the results on the participants performances, in the middle of May 2016. Finally, the original workflow has been improved by adding other statistical estimators in order to better quantify the significance of the results. Through a comparison of the results obtained by METAPHOR and by the SED template fitting method Le-Phare on the SDSS- DR9 (Sloan Digital Sky Survey - Data Release 9) we verified the reliability of our PDF estimates using three different self-adaptive techniques, namely: MLPQNA, Random Forest and the standard K-Nearest Neighbors models. In order to further explore ways to improve the overall performances of photo-z methods, I also contributed to the implementation of an hybrid procedure based on the combination of SED template fitting estimates obtained with Le-Phare and of METAPHOR using as test data those extracted from the ESO (European Southern Observatory) KiDS (Kilo Degree Survey) Data Release 2. Always in the context of the KiDS survey, I was involved in the creation of a catalogue of ML photo-z’s and relative PDFs for the KiDS-DR3 (Data Release 3) survey, widely and exhaustively described in de Jong et al. (2017). A further work on KiDS DR3 data,Amaro et al. (2017), has been submitted to MNRAS. The main topic of this last work is to achieve a deeper analysis of photo-z PDFs obtained using different methods, two machine learning models (METAPHOR and ANNz2) and one based on SED fitting techniques (BPZ), through a direct comparison of both cumulative (stacked) and individual PDFs. The comparison has been made by discriminating between quantitative and qualitative estimators and using a special dummy PDF as benchmark to assess their capability to measure the quality of error estimation and invariance with respect to any type of error source. In fact, it is well known that, in absence of systematics, there are several factors affecting the photo-z reliability, such as photometric and internal errors of the methods as well as statistical biases. For the first time we implemented a ML based procedure capable to take into account also the intrinsic photometric uncertainties. By modifying the METAPHOR internal mechanism, I derived a dummy PDF method through which the individual PDFs, called dummy, are made up of a single number, e.g. 1 (the maximum probability) associated to the the redshift bin of chosen accuracy in which the only photo-z estimate for that source, falls. All the other redshift bins of a dummy PDF will be characterized by a probability identically equal to zero. Due to its intrinsic invariance to different sources of errors, the dummy method enables the possibility to compare PDF methods independently from the statistical estimator adopted. The results of this comparison, along with a discussion of the statistical estimators, have allowed us to conclude that, in order to assess the objective validity and quality of any photo-z PDF method, a combined set of statistical estimators is required. Finally, a natural application of photo-z PDFs is that involving the measurements of Weak Lensing (WL), i.e. the weak distortion of the galaxies images due to the inhomogeneities of the Universe Large Scale Structure (LSS, made up of voids, filaments, halos) along the line of sight. The shear or distortion of the galaxy shapes (ellipticities) due to the presence of matter between the observer and the lensed sources, is evaluated through the tangential component of the shear. The Excess Surface Density (i.e. a measurement of density distribution of the lenses), is proportional to the tangential shear, through a geometrical factor, which takes into account the angular diameter distances among observer, lens, and lensed galaxy source. Such distances in the geometrical factor are measured through photometric redshifts, or better through their full posterior probability distributions. Up to now, such distributions have been measured with template fitting methods: our Ma- chine Learning METAPHOR has been employed to make a preliminary comparative study on WL ESD, with respect to the SED fitter results. Furthermore, a confrontation between the ESD estimates obtained by using both METAPHOR PDFs and photo-z punctual estimates has been performed. The WL study outcome is very promising since we found that the use of punctual estimates and relative PDFs lead to indistinguishable results, at least to the required accuracy. Most importantly, we found a similar trend for the ESD results in the comparison of our Machine Learning method with a template fitter performance, despite all the limits of Machine Learning techniques (incompleteness of the training dataset, low reliability for results extrapolated outside the knowledge base) which become particularly relevant in WL studies.

Actions (login required)

View Item View Item